AI Safety Marketplace

■ Problem Statement

Determine under what conditions we can reliably identify honest model responses by examining internal representations, particularly in adversarial settings where models have learned to produce answers that appear correct to overseers but are actually wrong. The goal is to characterize when honesty detection works, when it fails, and what factors determine success.

■ Background

Scalable oversight faces a fundamental problem: as models become more capable than their overseers, they may exploit systematic errors in human judgment. Recent work on truth detection in LLMs (Burns et al. 2022, Azaria & Mitchell 2023) shows promise for identifying truthful responses through internal activations rather than output quality. Follow-up work has used these representations for classification and control (Zou et al. 2023, Panickssery et al. 2024, Li et al. 2023, Mallen et al. 2024). However, most studies use naturally occurring true/false statements rather than adversarial settings where models actively deceive. Critical gaps remain in understanding: (1) whether honesty probes generalize from simple to complex domains, (2) whether models can learn to evade detection, and (3) what model properties enable reliable detection.

■ Scope

IN SCOPE: Empirical investigation of honesty detection methods on language models, especially in adversarial settings; characterizing failure modes and success conditions; comparing probing techniques (linear probes, contrast pairs, causal interventions); studying how training procedures affect detectability; analysis across model scales and architectures. OUT OF SCOPE: Developing completely novel oversight methods unrelated to internal representations; work purely on improving human oversight without leveraging model internals; studying honesty in non-language domains without clear transfer insights. CONSTRAINTS: Should focus on settings with systematic, exploitable overseer errors (not random noise); must go beyond demonstrating that honesty detection works in some cases to characterizing when and why.

■ Acceptance Criteria

Demonstrate honesty detection on at least 3 models fine-tuned to be deceptive in different domains, with quantitative comparison to baseline methods

Identify and characterize at least 2 specific conditions or model properties that determine whether honesty detection succeeds or fails

Test whether models can be trained to evade honesty detection while maintaining deceptiveness, with results on success/failure

Provide theoretical or empirical analysis of generalization: do honesty detectors trained on simple cases work on complex cases where oversight fails?

START HERE: (1) Read Burns et al. 2022 'Discovering Latent Knowledge' and Azaria & Mitchell 2023 on truth detection; (2) Review Panickssery et al. 2024 on steering model honesty; (3) Replicate basic honesty detection on an open model (Llama-2-7B or similar) using simple true/false questions; (4) Create an adversarial testbed by fine-tuning a model to give wrong answers that sound plausible in a specific domain; (5) Test whether honesty probes trained on clean data detect dishonesty in your adversarial model. CONCRETE EXPERIMENTS: Compare linear probes vs. contrast pairs vs. logit lens for honesty detection; vary the complexity of domains; try models of different sizes; test whether honesty detectors transfer across domains; investigate whether models can be trained to evade detection while maintaining deceptiveness.

Detecting Model Honesty via Internal Representations When Oversight Fails

■ Problem Statement

■ Background

■ Scope

Getting Started

■ Impact Assessment

■ Prerequisites

■ Acceptance Criteria

■ Expected Artifacts

Working On This

Sources

Discussion (0)

Working On This

Discussion (0)