Evaluating LM Agency

Mentor

Francis Rhys Ward

Francis Rhys Ward is a final year PhD student at Imperial College London, a member of Tom Everitt's causal incentives group, a recent GovAI summer fellow, and previously a CLR summer fellow. His technical work focuses on evaluating deception and agency in frontier AI systems. In addition, he organizes a number of community initiatives in London focused on AI existential safety, and is a lecturer on the Ethics of AI module at Imperial.

Project

The primary threat-models for AI x-risk depend on AI systems that are agents or “agentic”. Although there is no complete theory of agency, there are several important dimensions of agency, such as the coherence of a system's goals or beliefs. This project aims to empirically evaluate key dimensions of agency in frontier LMs, for example, by designing a method for evaluating the extent to which OpenAI models pursue the goal of being “helpful and harmless”.

Personal Fit

  • Proficient in Python

  • Strong background in mathematics and/or philosophy

  • Experience evaluating and fine-tuning LMs

Mentorship style: flexible, minimum: weekly meetings.

Selection Questions

What properties of agency are particularly important and tractable to measure and evaluate? Design a concrete evaluation method for measuring "agency" (or a key dimension of agency) in language models. (Answer in <45 mins)