Mechanistic/Developmental Interpretability

Mentor

Evan Ryan Gunter

Evan is currently doing research on developmental interpretability and learning theory, funded by the Long-Term Future Fund as part of the ML Alignment & Theory Scholars (MATS) extension phase. During the summer phase of MATS 4.0, Evan co-authored Quantifying Stability of Non-Power-Seeking with fellow scholar Yevgeny Liokumovich and mentor Victoria Krakovna, and additionally researched formalizing power without rewards. Before MATS, Evan worked as a research engineer at Granica after graduating from Caltech with degrees in math, computer science, and philosophy.

Project

Please read this project document and familiarize yourself with the project details before applying!

I want to understand how neural networks represent and operate on information, and how they learn to do so. I am especially interested in low-level insights that shine light on safety-relevant high-level behavior: for example, how interference from superposition may explain the difficulty of preventing adversarial examples.

My current research is focused on understanding the geometry of loss landscapes. Iā€™m inspired by work showing that all local minima are global minima; work motivating why some optimizers (e.g. Adam) work better than others (e.g. SGD); and work on singular learning theory

Personal Fit

An ideal mentee would have a strong interest in math or physics, and would be open to learning how to run basic ML experiments. I am happy to take mentees without previous experience in ML or alignment.

Mentorship style: As a mentor, I hope to share my experience on methods for performing research effectively and provide guidance and feedback on research directions. I think it is very important to be able to work with someone who has context on your research; I aim to follow your research closely enough that I have this context. I will likely be able to spend 5-8 hours per week on mentoring across all mentees.

Expected time commitment: 6hr/week.