Skip to content
AI Safety Marketplace
HomeProblems
AI Safety Marketplace

Connecting AI safety researchers with tractable problems in alignment and safety. Building safer AI together.

Navigation

  • Browse Bounties
  • Submit Problem

Developers

  • REST API
  • MCP Integration

Resources

  • Alignment Forum
  • Report a Bug
  • Request a Feature
© 2026 AI Safety Marketplace. All rights reserved.
← Back to Bounties
Open
0

Quantifying Structural Assumptions Needed to Infer Preferences from Irrational Agent Behavior

interpretabilityalignmentresearch-ideaowain-evanspreference-learninggridworldinverse-reinforcement-learningbounded-rationalityempirical-research
Difficulty
Intermediate
Verification
Research Output
Compute
CPU Only
Source
owain-evans-projects

How many assumptions about agent biases are needed to infer preferences from irrational behavior?

■ Problem Statement

When agents behave irrationally (violating expected utility axioms, exhibiting biases, or acting suboptimally), their preferences cannot be uniquely determined from behavior alone. This project aims to empirically quantify how many and which types of structural assumptions about an agent's decision-making process are necessary and sufficient to reliably infer its underlying preferences from observed behavior in gridworld environments.

■ Background

Stuart Armstrong's theoretical work shows that rational agents' preferences can potentially be inferred from behavior via revealed preference theory, but irrational agents require additional assumptions. This matters for AI safety because: (1) deployed AI systems may exhibit irrational behavior due to distributional shift, adversarial inputs, or misspecification; (2) we need to detect when AI systems have unintended objectives; (3) human preference learning requires handling human irrationalities. Prior work in inverse reinforcement learning typically assumes near-optimal behavior, which is often unrealistic. The synthesising human preferences research agenda (linked in Armstrong's posts) provides broader context on why preference inference under uncertainty is central to alignment.

■ Scope

IN SCOPE: Gridworld environments, discrete action/state spaces, supervised learning approaches to preference classification, systematic variation of assumption types (rationality violations, perceptual biases, decision heuristics), analysis of assumption sufficiency and informativeness. OUT OF SCOPE: Continuous control, real-world robotics, online/active learning approaches, theoretical proofs (focus is empirical), human subject experiments. CONSTRAINTS: Should produce reproducible computational experiments; code must be documented for reuse; gridworlds should be simple enough for clear interpretation but complex enough to demonstrate meaningful results.

■ Getting Started

■ Impact Assessment

Importance
High
Neglectedness
Medium
Tractability
High

■ Prerequisites

Required: Python programming, machine learning fundamentals (classification, train/test methodology), basic reinforcement learning concepts (MDP formalism, policies, rewards). Helpful: experience with gridworld environments, inverse reinforcement learning, familiarity with AI safety arguments about objective inference.

■ Acceptance Criteria

  • Implemented at least 5 distinct agent types with different combinations of preferences and biases in gridworld environments, with ground-truth preferences documented
  • Trained classifiers to predict preferences under at least 10 different assumption configurations, with accuracy/performance metrics reported for each
  • Produced quantitative analysis showing relationship between number/type of assumptions and inference accuracy, including identification of minimal sufficient assumption sets
  • Published codebase with documentation enabling reproduction and extension of experiments (README, requirements, example runs)
  • Drafted conference/workshop paper (6-8 pages) with introduction, methodology, results, and discussion of implications for AI safety

■ Expected Artifacts

repopaper
Submit a Solution

Related Resources

💬Forum Posts3

AI Safety Research Project Ideas (Evans & Armstrong)
This post
This post

Sources

https://www.alignmentforum.org/posts/f69LK7CndhSNA7oPn/ai-safety-research-project-ideas

Created: 2/5/2026

Last updated: 2/9/2026

Time
Months
Team Size
Solo
  1. READ: Stuart Armstrong's posts linked in the problem description; 'The Easy Goal Inference Problem is Still Hard' (Shah et al.); classic IRL papers (Ng & Russell, Abbeel & Ng); literature on revealed preference with bounded rationality. 2. EXPLORE: Existing gridworld libraries (MiniGrid, AI Safety Gridworlds); simple IRL implementations on GitHub. 3. PROTOTYPE: Create 2-3 simple agents in a basic gridworld with one clear preference and one clear bias (e.g., 'prefers apples over oranges but always moves up when both are equidistant'). 4. EXPERIMENT: Train a simple classifier (logistic regression or small neural net) to predict the preference from trajectories, first without assumptions, then with the bias encoded as a feature. 5. ITERATE: Gradually increase complexity—more preference types, more bias types, systematic assumption combinations.

Working On This

No one is working on this yet. Be the first!

Sign in to indicate you're working on this problem.

Discussion (0)

to join the discussion

No comments yet. Be the first to comment!