Guarantees-driven Mechanistic Interpretability

Mentor

Jason Gross

Jason leads a project in formalizing transformers funded by ARC Theory. He’s excited about approaches that might allow us to confidently scale interpretability. Previously he was a researcher at MIRI and got his Programming Languages PhD at MIT.

Before all of the AI stuff, he developed Fiat Cryptography, which enables the majority of secure connections to the internet (the big prime numbers behind the https lock icon), reported the plurality of all-time bugs in the proof-assistant Coq (and joined the dev team to help handle them!), and contributed category theory to the Coq-HoTT library (hoorah for beautiful, unifying mathematics).

Project

My team is developing a guarantees-driven approach to mechanistic interpretability, where we use compact formal proofs as a metric for evaluating mechanistic interpretability analyses. Our goal is to make interpretability scalable, which requires a level of rigor that we can trust enough to create automation for.

There are several possible projects that are useful to this agenda, and I would be happy to scope out a project with individuals depending on their interests and skills. Some ideas are:

  1. Making convexity arguments for multiple attention heads or combinations of MLP neurons

  2. Extending current analyses to different proof approaches and different theorem statements

  3. Proving guarantees about accuracy / loss at various stages of training

At the end of program, you should have:

  1. Intuition for and experience with toggling the level of rigor of mechanistic interpretability analyses,

  2. Explored hypotheses around the advantages and limits of the compact guarantees framing,

  3. Picked up soft skills relevant to working on a collaborative research project, and

  4. Written up your results.

Personal Fit

Must-haves

  • Have read through A Mathematical Framework for Transformer Circuits

  • Familiarity with basic asymptotic complexity analysis

Nice-to-haves

  • Have gone through MLAB or equivalent (e.g., ARENA, Mechanistic Interpretability Quickstart Guide by Neel Nanda)

  • Familiarity with formal proofs

Mentorship style:

  • Weekly pair coding or code review

  • Biweekly group meeting for project updates and check-ins

  • Slack response time < 48 hours. Happy to point out resources that will help unstick you

Expected time commitment: ~10 hr/week

Selection Questions

Please see the following Google Doc.