How many assumptions about agent biases are needed to infer preferences from irrational behavior?
When agents behave irrationally (violating expected utility axioms, exhibiting biases, or acting suboptimally), their preferences cannot be uniquely determined from behavior alone. This project aims to empirically quantify how many and which types of structural assumptions about an agent's decision-making process are necessary and sufficient to reliably infer its underlying preferences from observed behavior in gridworld environments.
Stuart Armstrong's theoretical work shows that rational agents' preferences can potentially be inferred from behavior via revealed preference theory, but irrational agents require additional assumptions. This matters for AI safety because: (1) deployed AI systems may exhibit irrational behavior due to distributional shift, adversarial inputs, or misspecification; (2) we need to detect when AI systems have unintended objectives; (3) human preference learning requires handling human irrationalities. Prior work in inverse reinforcement learning typically assumes near-optimal behavior, which is often unrealistic. The synthesising human preferences research agenda (linked in Armstrong's posts) provides broader context on why preference inference under uncertainty is central to alignment.
IN SCOPE: Gridworld environments, discrete action/state spaces, supervised learning approaches to preference classification, systematic variation of assumption types (rationality violations, perceptual biases, decision heuristics), analysis of assumption sufficiency and informativeness. OUT OF SCOPE: Continuous control, real-world robotics, online/active learning approaches, theoretical proofs (focus is empirical), human subject experiments. CONSTRAINTS: Should produce reproducible computational experiments; code must be documented for reuse; gridworlds should be simple enough for clear interpretation but complex enough to demonstrate meaningful results.
Required: Python programming, machine learning fundamentals (classification, train/test methodology), basic reinforcement learning concepts (MDP formalism, policies, rewards). Helpful: experience with gridworld environments, inverse reinforcement learning, familiarity with AI safety arguments about objective inference.
Created: 2/5/2026
Last updated: 2/9/2026
No one is working on this yet. Be the first!
Sign in to indicate you're working on this problem.
to join the discussion