AI Safety Technical Fellowship

CAISH runs a 7-week in-person reading group on AI safety, covering topics like neural network interpretability, learning from human feedback and goal misgeneralisation in reinforcement learning agents. We meet weekly in small groups led by experienced facilitators, with dinner provided. Meet safety researchers in Cambridge, find collaborators and start working on a project yourself. We meet weekly in small groups led by experienced facilitators, with dinner provided.

Applications for the Spring 2024 program open now. Deadline: Sunday 28 Jan 23:59 GMT.

The curriculum

The curriculum is based on the AGI Safety Fundamentals program developed by AI safety researcher Richard Ngo (Governance team at OpenAI, previously AGI safety team at DeepMind). See more details on our curriculum.

  • Get up to speed with the foundational concepts in ML. This week is a prerequisite to be done before the course starts and can be skipped if you have relevant experience.

  • This week focuses on the concept of reward misspecification, which occurs when RL agents are rewarded for misbehaving. We start by looking at some toy examples, both when rewards are hard-coded, and when they are based on human feedback. We then move to focus on how today’s most capable models (known as foundation models) are currently affected by reward misspecification, and how similar techniques might cause serious failures as AIs get more powerful. We finish by exploring a quantitative model for when AGI might be built.

  • This week discusses the notion of goal-directed behavior in AI systems. Such behaviors are common; for instance, agents trained via RL to solve mazes exhibit the goal-directed behavior of “move towards the maze exit.”

    We first look at some examples of goal misgeneralization, where AI systems in training develop goals different from the ones their developers intended. Sophisticated AI systems with goals, regardless of what those goals are, could pursue certain instrumental subgoals like power-seeking. Moreover, under certain assumptions, sophisticated AI systems with unaligned goals may instrumentally behave well during training in order to prevent their goals from being modified. This failure mode is called deceptive alignment.

  • This week, we discuss how the development of AI capability might bootstrap itself to acceleration. We then look at scalable oversight as a method that enables humans to oversee AI systems that are solving tasks too complicated for a single human to evaluate. We discuss two approaches to scalable oversight: iterated amplification and adversarial techniques.

  • When we train a ML model, it might learn internal representations of the task that are related to the truth, but not the actual truth. For example, if we train it to generate true texts that humans rate highly, it may output errors that seem true to human evaluators. This is one important cause for misaligned behaviours. We discuss the approach to address this called eliciting latent knowledge. Instead of trying to explicitly, externally specify truth, we try to search for implicit, internal “beliefs” or “knowledge” learned by a model.

  • Our current methods of training capable neural networks give us little insight into how or why they function. This week we cover the field of mechanistic interpretability, which aims to change this by developing methods for understanding how neural networks think on the level of individual neurons and circuits.

  • What if it’s very difficult to make progress on alignment through modern ML techniques, and we instead need to take a step back and ask more foundational questions: how does an agent model a world bigger than itself? How does abstraction work? We dip into a few of these more conceptual approaches and zoom out again in our discussion of AI safety.

  • An important crux for working on AI safety is timelines. When do we expect to see transformative AI? What about AGI? What kinds of signs should we expect to see along the way? We’ll be discussing bioanchors, one of the leading frameworks for thinking about AI timelines forecasting, look at the current predictions of forecasting platforms like Metaculus.

Course details

  • People who are interested in making advanced AI safe and in engaging with technical problems in this field. This course may particularly appeal to people with computer science, maths, physics, engineering, and quantitative backgrounds.

    Those interested in reducing AI risks through governance and policy may benefit from understanding the technical problems of AI safety through this course too.

  • Participants are expected to be familiar with the basic concepts of ML, such as stochastic gradient descent, neural networks, and reinforcement learning. Those without previous ML experience should be willing to learn the basic concepts (e.g. read the Week 0 Prerequisites materials on ML). We may group participants with similar experience levels.

  • In the past, we’ve received ~100 applications and we expect to have capacity for up to 50 students.

  • Meetings will be hosted in our office in Cambridge. Each weekly session lasts for 2 hours and consists of a mixture of reading and discussion on the materials. No preparation is required, as all reading will be done in-session. Your facilitator will guide the conversation and answer questions. Dinner provided.

    Accepted applicants will be asked for their time availability and assigned to a cohort time accordingly.

  • CAISH members with research experience in AI safety, usually a mix of undergrad finalists, Master’s and PhDs. If you are a PhD/postdoc, you will most likely be facilitated by a PhD working on direct AI safety research.

  • Yes! We’ve had a mix of undergrads, Master’s, PhDs, and postdocs. We will likely group people with similar backgrounds and experience levels.

  • Please contact hello@cambridgeaisafety.org to discuss other ways of getting involved in AI safety through CAISH! We have more advanced machine learning programs for those who’ve done the Intro fellowship.