Detect AI honesty via internal states, not output quality, especially when models exploit oversight errors.
Determine under what conditions we can reliably identify honest model responses by examining internal representations, particularly in adversarial settings where models have learned to produce answers that appear correct to overseers but are actually wrong. The goal is to characterize when honesty detection works, when it fails, and what factors determine success.
Scalable oversight faces a fundamental problem: as models become more capable than their overseers, they may exploit systematic errors in human judgment. Recent work on truth detection in LLMs (Burns et al. 2022, Azaria & Mitchell 2023) shows promise for identifying truthful responses through internal activations rather than output quality. Follow-up work has used these representations for classification and control (Zou et al. 2023, Panickssery et al. 2024, Li et al. 2023, Mallen et al. 2024). However, most studies use naturally occurring true/false statements rather than adversarial settings where models actively deceive. Critical gaps remain in understanding: (1) whether honesty probes generalize from simple to complex domains, (2) whether models can learn to evade detection, and (3) what model properties enable reliable detection.
IN SCOPE: Empirical investigation of honesty detection methods on language models, especially in adversarial settings; characterizing failure modes and success conditions; comparing probing techniques (linear probes, contrast pairs, causal interventions); studying how training procedures affect detectability; analysis across model scales and architectures. OUT OF SCOPE: Developing completely novel oversight methods unrelated to internal representations; work purely on improving human oversight without leveraging model internals; studying honesty in non-language domains without clear transfer insights. CONSTRAINTS: Should focus on settings with systematic, exploitable overseer errors (not random noise); must go beyond demonstrating that honesty detection works in some cases to characterizing when and why.
Strong ML background required; familiarity with interpretability techniques (probing, activation analysis); experience fine-tuning language models; understanding of alignment and oversight problems; ability to design experiments and analyze results statistically.
Created: 2/5/2026
Last updated: 2/9/2026
START HERE: (1) Read Burns et al. 2022 'Discovering Latent Knowledge' and Azaria & Mitchell 2023 on truth detection; (2) Review Panickssery et al. 2024 on steering model honesty; (3) Replicate basic honesty detection on an open model (Llama-2-7B or similar) using simple true/false questions; (4) Create an adversarial testbed by fine-tuning a model to give wrong answers that sound plausible in a specific domain; (5) Test whether honesty probes trained on clean data detect dishonesty in your adversarial model. CONCRETE EXPERIMENTS: Compare linear probes vs. contrast pairs vs. logit lens for honesty detection; vary the complexity of domains; try models of different sizes; test whether honesty detectors transfer across domains; investigate whether models can be trained to evade detection while maintaining deceptiveness.
No one is working on this yet. Be the first!
Sign in to indicate you're working on this problem.
to join the discussion