Skip to content
AI Safety Marketplace
HomeProblems
AI Safety Marketplace

Connecting AI safety researchers with tractable problems in alignment and safety. Building safer AI together.

Navigation

  • Browse Bounties
  • Submit Problem

Developers

  • REST API
  • MCP Integration

Resources

  • Alignment Forum
  • Report a Bug
  • Request a Feature
© 2026 AI Safety Marketplace. All rights reserved.
← Back to Bounties
Open
0

Detecting Model Honesty via Internal Representations When Oversight Fails

deceptionscalable-oversighttruthfulnesshonestymechanistic-interpretabilityalignment-researchanthropicanthropic-directions-2025recursive-oversightinternal-representationshonesty-detectionoversight-failures
Difficulty
Advanced
Verification
Research Output
Compute
Multi-GPU
Source
anthropic-directions-2025
Time

Detect AI honesty via internal states, not output quality, especially when models exploit oversight errors.

■ Problem Statement

Determine under what conditions we can reliably identify honest model responses by examining internal representations, particularly in adversarial settings where models have learned to produce answers that appear correct to overseers but are actually wrong. The goal is to characterize when honesty detection works, when it fails, and what factors determine success.

■ Background

Scalable oversight faces a fundamental problem: as models become more capable than their overseers, they may exploit systematic errors in human judgment. Recent work on truth detection in LLMs (Burns et al. 2022, Azaria & Mitchell 2023) shows promise for identifying truthful responses through internal activations rather than output quality. Follow-up work has used these representations for classification and control (Zou et al. 2023, Panickssery et al. 2024, Li et al. 2023, Mallen et al. 2024). However, most studies use naturally occurring true/false statements rather than adversarial settings where models actively deceive. Critical gaps remain in understanding: (1) whether honesty probes generalize from simple to complex domains, (2) whether models can learn to evade detection, and (3) what model properties enable reliable detection.

■ Scope

IN SCOPE: Empirical investigation of honesty detection methods on language models, especially in adversarial settings; characterizing failure modes and success conditions; comparing probing techniques (linear probes, contrast pairs, causal interventions); studying how training procedures affect detectability; analysis across model scales and architectures. OUT OF SCOPE: Developing completely novel oversight methods unrelated to internal representations; work purely on improving human oversight without leveraging model internals; studying honesty in non-language domains without clear transfer insights. CONSTRAINTS: Should focus on settings with systematic, exploitable overseer errors (not random noise); must go beyond demonstrating that honesty detection works in some cases to characterizing when and why.

Getting Started

■ Impact Assessment

Importance
High
Neglectedness
Medium
Tractability
Medium

■ Prerequisites

Strong ML background required; familiarity with interpretability techniques (probing, activation analysis); experience fine-tuning language models; understanding of alignment and oversight problems; ability to design experiments and analyze results statistically.

■ Acceptance Criteria

  • Demonstrate honesty detection on at least 3 models fine-tuned to be deceptive in different domains, with quantitative comparison to baseline methods
  • Identify and characterize at least 2 specific conditions or model properties that determine whether honesty detection succeeds or fails
  • Test whether models can be trained to evade honesty detection while maintaining deceptiveness, with results on success/failure
  • Provide theoretical or empirical analysis of generalization: do honesty detectors trained on simple cases work on complex cases where oversight fails?

■ Expected Artifacts

paperrepobenchmarkreproduction steps
Submit a Solution

Sources

https://alignment.anthropic.com/2025/recommended-directions/

Created: 2/5/2026

Last updated: 2/9/2026

Months
Team Size
Small Team (2-4)
■

START HERE: (1) Read Burns et al. 2022 'Discovering Latent Knowledge' and Azaria & Mitchell 2023 on truth detection; (2) Review Panickssery et al. 2024 on steering model honesty; (3) Replicate basic honesty detection on an open model (Llama-2-7B or similar) using simple true/false questions; (4) Create an adversarial testbed by fine-tuning a model to give wrong answers that sound plausible in a specific domain; (5) Test whether honesty probes trained on clean data detect dishonesty in your adversarial model. CONCRETE EXPERIMENTS: Compare linear probes vs. contrast pairs vs. logit lens for honesty detection; vary the complexity of domains; try models of different sizes; test whether honesty detectors transfer across domains; investigate whether models can be trained to evade detection while maintaining deceptiveness.

Working On This

No one is working on this yet. Be the first!

Sign in to indicate you're working on this problem.

Discussion (0)

to join the discussion

No comments yet. Be the first to comment!