Skip to content
AI Safety Marketplace
HomeProblems
AI Safety Marketplace

Connecting AI safety researchers with tractable problems in alignment and safety. Building safer AI together.

Navigation

  • Browse Bounties
  • Submit Problem

Developers

  • REST API
  • MCP Integration

Resources

  • Alignment Forum
  • Report a Bug
  • Request a Feature
© 2026 AI Safety Marketplace. All rights reserved.
← Back to Bounties
Open
0

Develop unsupervised anomaly detection methods for AI model monitoring and control

interpretabilitymonitoringscalable-oversightanomaly-detectionalignment-researchanthropicanthropic-directions-2025ai-controlunsupervised-learningout-of-distribution-detection
Difficulty
Intermediate
Verification
Research Output
Compute
Multi-GPU
Source
anthropic-directions-2025
Team Size

Description

Build unsupervised methods to detect unusual AI behavior without pre-labeled dangerous examples

This problem asks researchers to develop and evaluate anomaly detection techniques that can identify when AI models exhibit unusual or out-of-distribution behavior, without requiring labeled examples of dangerous behaviors.

The Challenge: Supervised monitoring approaches require curated datasets of positive and negative examples of harmful behaviors. However, we often cannot anticipate what dangerous failures will look like ahead of time—especially for novel capabilities or adversarial inputs like jailbreaks. Unsupervised anomaly detection offers an alternative: detecting when a model's computation or behavior deviates significantly from normal patterns.

Two Key Use Cases:

  1. Flagging Out-of-Distribution Inputs: Consider a deployed model with a supervised monitor trained to catch known harmful behaviors. This monitor may fail on rare, unusual inputs (like novel jailbreaks) that differ substantially from training data. An anomaly detector analyzing the model's latent activations could flag these OOD inputs for manual inspection, allowing humans to label them and iteratively improve the supervised monitor's coverage. Existing anomaly detection techniques applied to neural network activations may already be suitable for this task.

  2. Detecting Capability Gaps Between Models: More ambitiously, anomaly detection could identify inputs where a strong model understands something a weak model does not. When a powerful model encounters concepts or patterns beyond a weaker model's comprehension, this may produce behavior that appears novel or surprising from the weak model's perspective. Automatically detecting such capability-gap inputs is an unexplored problem that could be valuable for scalable oversight approaches where weak models monitor strong ones.

Why This Matters: As AI systems become more capable, we need monitoring techniques that don't rely solely on our ability to anticipate all possible failure modes. Anomaly detection provides a complementary layer of defense: even if we can't predict specific dangerous behaviors, we might still detect when something unusual is happening that warrants human attention. This is particularly important for AI control strategies, where we want to maintain oversight even as models develop unexpected capabilities.

Current State: Standard anomaly detection techniques exist in machine learning, including methods based on reconstruction error, density estimation, and latent space analysis. Some preliminary work (referenced in the problem statement) has explored applying these to neural network activations. However, the specific application to AI safety monitoring—especially the capability-gap detection problem—remains largely unexplored. Key open questions include: Which activation layers are most informative? How do we balance false positive rates with detection sensitivity? Can we distinguish genuinely concerning anomalies from benign distribution shift?

■ Impact Assessment

Importance
High
Neglectedness
High
Tractability
Medium

■ Acceptance Criteria

  • Demonstrate an anomaly detection method that achieves >70% precision and >50% recall at detecting synthetic OOD inputs (e.g., jailbreaks or adversarial examples) on at least one model family, with documented evaluation methodology
  • Provide systematic comparison of at least 3 different anomaly detection approaches (e.g., Mahalanobis distance, autoencoders, isolation forests) on model activations, with analysis of which layers and representations are most informative
  • Characterize failure modes and limitations: identify at least 3 types of inputs that evade detection or produce false positives, with analysis of why the method fails
  • For capability-gap detection: demonstrate a method that flags inputs where model size/capability differences lead to behavioral divergence, with precision >60% on a constructed test set

■ Expected Artifacts

repopaperbenchmark
Submit a Solution

Sources

https://alignment.anthropic.com/2025/recommended-directions/

Created: 2/5/2026

Last updated: 2/9/2026

Solo

What Success Looks Like: Progress on this problem would involve developing anomaly detection methods tailored to AI monitoring contexts, demonstrating their effectiveness on realistic test cases (such as detecting jailbreak attempts or capability differences), and understanding their limitations and failure modes. Ideally, solutions would be practical enough to integrate into real deployment pipelines while providing meaningful safety benefits.

Discussion (0)

to join the discussion

No comments yet. Be the first to comment!

Working On This

No one is working on this yet. Be the first!

Sign in to indicate you're working on this problem.