Build unsupervised methods to detect unusual AI behavior without pre-labeled dangerous examples
This problem asks researchers to develop and evaluate anomaly detection techniques that can identify when AI models exhibit unusual or out-of-distribution behavior, without requiring labeled examples of dangerous behaviors.
The Challenge: Supervised monitoring approaches require curated datasets of positive and negative examples of harmful behaviors. However, we often cannot anticipate what dangerous failures will look like ahead of time—especially for novel capabilities or adversarial inputs like jailbreaks. Unsupervised anomaly detection offers an alternative: detecting when a model's computation or behavior deviates significantly from normal patterns.
Two Key Use Cases:
Flagging Out-of-Distribution Inputs: Consider a deployed model with a supervised monitor trained to catch known harmful behaviors. This monitor may fail on rare, unusual inputs (like novel jailbreaks) that differ substantially from training data. An anomaly detector analyzing the model's latent activations could flag these OOD inputs for manual inspection, allowing humans to label them and iteratively improve the supervised monitor's coverage. Existing anomaly detection techniques applied to neural network activations may already be suitable for this task.
Detecting Capability Gaps Between Models: More ambitiously, anomaly detection could identify inputs where a strong model understands something a weak model does not. When a powerful model encounters concepts or patterns beyond a weaker model's comprehension, this may produce behavior that appears novel or surprising from the weak model's perspective. Automatically detecting such capability-gap inputs is an unexplored problem that could be valuable for scalable oversight approaches where weak models monitor strong ones.
Why This Matters: As AI systems become more capable, we need monitoring techniques that don't rely solely on our ability to anticipate all possible failure modes. Anomaly detection provides a complementary layer of defense: even if we can't predict specific dangerous behaviors, we might still detect when something unusual is happening that warrants human attention. This is particularly important for AI control strategies, where we want to maintain oversight even as models develop unexpected capabilities.
Current State: Standard anomaly detection techniques exist in machine learning, including methods based on reconstruction error, density estimation, and latent space analysis. Some preliminary work (referenced in the problem statement) has explored applying these to neural network activations. However, the specific application to AI safety monitoring—especially the capability-gap detection problem—remains largely unexplored. Key open questions include: Which activation layers are most informative? How do we balance false positive rates with detection sensitivity? Can we distinguish genuinely concerning anomalies from benign distribution shift?
Created: 2/5/2026
Last updated: 2/9/2026
What Success Looks Like: Progress on this problem would involve developing anomaly detection methods tailored to AI monitoring contexts, demonstrating their effectiveness on realistic test cases (such as detecting jailbreak attempts or capability differences), and understanding their limitations and failure modes. Ideally, solutions would be practical enough to integrate into real deployment pipelines while providing meaningful safety benefits.
to join the discussion
No one is working on this yet. Be the first!
Sign in to indicate you're working on this problem.