High-Level Interpretability

Mentor

Arun Jose

Arun Jose is an independent alignment researcher working primarily on high-level interpretability. Previously, he was a MATS scholar with John Wentworth. He has also worked on simulator theory - a conceptual framework for language models - the cyborgism research agenda, figuring out how to accelerate alignment research with AI, and more prosaic model evaluations projects at ARC Evals.

Project

Currently, the topic I'm thinking about the most revolves around how to approach building a robust conceptual form of the alignment properties we care about. Here's a doc where I lay out why I think this is important and the hard part of solving worst-case alignment.

You can think about this in a few different ways: if you wanted to steer the internal objective of an AI system, what mechanistic structure would you be trying to identify and intervene on? If you were doing low-level interpretability (the kind that looks like reverse-engineering neural nets), what kind of explanations do you actually want to achieve that lets you decide on various safety properties? What are the fundamental properties underlying desirable mechanisms (e.g., systems that answer questions honestly as opposed to answers that convince humans of their honesty, a la ELK)? These can sound pretty different at first glance, but all get at a similar underlying problem: how do you translate some complex concept from our ontology that we think would make a system safe, into a general form applicable to an alien ontology, like in neural networks?

This project is pretty open-ended, so I expect that there are quite a few different directions that would be useful in exploring this and understanding the shape of the problem better.

I expect that theoretical work is particularly important here so I'd be excited to work on something that attempts to tackle that directly, but I find that empirical work can often be very tractable on approaching amorphous problems as well. For instance, here's some empirical work that I recently did to answer some questions on how I was thinking about the notion of objectives and just gain more data on certain aspects of how neural networks represent concepts internally. There are other empirical directions that would be useful to explore as well; however, I expect them to involve a fair amount of conceptual work as well to figure out what operationalisation of an idea would be best.

Broadly, I think that what specific direction seems the most tractable or exciting to work on to you, would be the best fit for the program. I'd be happy to propose some directions myself, but if you have ideas that seem interesting to you, I'd be happy to help flesh them out and oversee them - in fact, I mildly recommend it, but don't worry too much about this.

Personal Fit

Must-haves

  • Is familiar with the basics of alignment theory ("Reads LW/AF regularly" would suffice here).

Nice-to-haves

  • Has experience with CS/math theory (as a lower bar, slightly above an undergrad level) or working with formalisms.

  • Worked on conceptual alignment before.

  • Familiar with working on open-ended problems generally.

  • Has prior experience with ML engineering (worked on a non-trivial ML project before, or has spent at least 50 hours working with transformers).

  • Has experience doing mechanistic interpretability.

  • Has an idea for a specific direction for this project, and can justify why they think it's useful.

Mentorship style: My level of involvement will probably be somewhere between "one meeting a week, and you can message me if you have any questions" and "spending as much time on it as you are". Because it's pretty open-ended and there's a lot of potential for drift (not to mention being pretty confusing as a problem), I'll be more involved than just checking on updates and answering questions, but I probably won't be working on it as directly as you are either. As a median, expect about 5 hours a week.

Expected time commitment: 5-10hr/week.

Selection Question

Pick any alignment proposal you know of, and describe flaws it may have at addressing the alignment problem. Examples of proposals you could pick (you don’t need to anchor onto these, you can pick any proposal you like or are familiar with): Anything from 11 proposals, Safety via debate, Relaxed adversarial training, OpenAI’s superalignment plan, High-level interpretability.