Open

Quantifying Structural Assumptions Needed to Infer Preferences from Irrational Agent Behavior

interpretabilityalignmentresearch-ideaowain-evanspreference-learninggridworldinverse-reinforcement-learningbounded-rationalityempirical-research

Difficulty

Intermediate

Verification

Research Output

Compute

CPU Only

Source

owain-evans-projects

How many assumptions about agent biases are needed to infer preferences from irrational behavior?

■ Problem Statement

When agents behave irrationally (violating expected utility axioms, exhibiting biases, or acting suboptimally), their preferences cannot be uniquely determined from behavior alone. This project aims to empirically quantify how many and which types of structural assumptions about an agent's decision-making process are necessary and sufficient to reliably infer its underlying preferences from observed behavior in gridworld environments.

■ Background

Stuart Armstrong's theoretical work shows that rational agents' preferences can potentially be inferred from behavior via revealed preference theory, but irrational agents require additional assumptions. This matters for AI safety because: (1) deployed AI systems may exhibit irrational behavior due to distributional shift, adversarial inputs, or misspecification; (2) we need to detect when AI systems have unintended objectives; (3) human preference learning requires handling human irrationalities. Prior work in inverse reinforcement learning typically assumes near-optimal behavior, which is often unrealistic. The synthesising human preferences research agenda (linked in Armstrong's posts) provides broader context on why preference inference under uncertainty is central to alignment.

■ Scope

IN SCOPE: Gridworld environments, discrete action/state spaces, supervised learning approaches to preference classification, systematic variation of assumption types (rationality violations, perceptual biases, decision heuristics), analysis of assumption sufficiency and informativeness. OUT OF SCOPE: Continuous control, real-world robotics, online/active learning approaches, theoretical proofs (focus is empirical), human subject experiments. CONSTRAINTS: Should produce reproducible computational experiments; code must be documented for reuse; gridworlds should be simple enough for clear interpretation but complex enough to demonstrate meaningful results.

■ Getting Started

■ Impact Assessment

Importance

High

Neglectedness

Medium

Tractability

High

■ Prerequisites

Required: Python programming, machine learning fundamentals (classification, train/test methodology), basic reinforcement learning concepts (MDP formalism, policies, rewards). Helpful: experience with gridworld environments, inverse reinforcement learning, familiarity with AI safety arguments about objective inference.

■ Acceptance Criteria

Implemented at least 5 distinct agent types with different combinations of preferences and biases in gridworld environments, with ground-truth preferences documented
Trained classifiers to predict preferences under at least 10 different assumption configurations, with accuracy/performance metrics reported for each
Produced quantitative analysis showing relationship between number/type of assumptions and inference accuracy, including identification of minimal sufficient assumption sets
Published codebase with documentation enabling reproduction and extension of experiments (README, requirements, example runs)
Drafted conference/workshop paper (6-8 pages) with introduction, methodology, results, and discussion of implications for AI safety

■ Expected Artifacts

repopaper

Submit a Solution

Working On This

No one is working on this yet. Be the first!

Related Resources

💬Forum Posts3

AI Safety Research Project Ideas (Evans & Armstrong)

This post

Sources

https://www.alignmentforum.org/posts/f69LK7CndhSNA7oPn/ai-safety-research-project-ideas

Discussion (0)

to join the discussion

No comments yet. Be the first to comment!

Created: 2/5/2026

Last updated: 2/9/2026