Representation Engineering

Mentor

Andy Zou

Andy is a second-year PhD student at CMU and a co-founder of Center for AI Safety (CAIS) working on AI Safety. Recently, he led Universal and Transferable Adversarial Attacks on Aligned Language Models and Representation Engineering: A Top-Down Approach to AI Transparency. For more information, please check out his website

Project

The project will focus on representation engineering research (ai-transparency.org/). See project document.

In general we aim to understand and control high-level phenomena in neural networks, such as deception, vulnerability to adversarial attacks, power-seeking tendencies, memorization, etc. The specific research directions will be discussed when the program starts.

Personal Fit

Must-haves

  • Familiar with PyTorch and Transformers library

  • Familiar with implementing a transformer from scratch

Nice-to-haves

  • Experience in working with LLMs.

Mentorship style: Scheduling can be relatively flexible, from one to multiple syncs per week depending on your preference. 0.5-1.5 hr/week.

Expected time commitment: minimum 8hr/week, ideally 12hr+/week.