Skip to content
AI Safety Marketplace
HomeProblems
AI Safety Marketplace

Connecting AI safety researchers with tractable problems in alignment and safety. Building safer AI together.

Navigation

  • Browse Bounties
  • Submit Problem

Developers

  • REST API
  • MCP Integration

Resources

  • Alignment Forum
  • Report a Bug
  • Request a Feature
© 2026 AI Safety Marketplace. All rights reserved.
← Back to Bounties
Open
0

Robust Activation Monitoring for Detecting AI Misbehavior Under Distribution Shift

interpretabilitymonitoringrobustnessdistribution-shiftanomaly-detectionalignment-researchanthropicanthropic-directions-2025ai-controlactivation-probes
Difficulty
Advanced
Verification
Research Output
Compute
Multi-GPU
Source
anthropic-directions-2025
Team Size

Description

Can we reliably detect AI misbehavior from internal activations despite train-test distribution shift?

Activation monitoring proposes to detect unsafe AI behavior by analyzing a model's internal representations (latent activations) rather than only its input-output behavior. For instance, we could train classifiers (probes) on a model's hidden states to detect when it's generating insecure code, planning deception, or exhibiting other undesired behaviors. This approach offers potential advantages: activations directly reflect the model's internal reasoning process and are harder for the model to manipulate than its outputs.

Consider a concrete scenario: you're using a weak AI monitor to check whether a stronger AI code generator produces secure code. The strong model might obfuscate vulnerabilities in ways that fool the weak monitor, or the weak monitor might simply lack the capability to spot sophisticated exploits. By instead monitoring the strong model's own activations when it generates code, we might detect malicious intent or security-relevant concepts directly in its reasoning process, bypassing these failure modes.

However, activation monitoring faces a critical challenge: robustness to distribution shift. The most important shift is from synthetic training data to actual model-generated outputs. If we want to detect dangerous outputs we haven't yet seen, we must create synthetic training examples somehow—perhaps by prompting the model in controlled settings, using red-teaming, or generating adversarial examples. The core question is: will probes trained on these synthetic examples actually fire when the model naturally produces genuinely dangerous outputs in deployment?

This distribution shift problem is particularly acute because:

  1. Models may behave differently when actively trying to circumvent monitoring versus passively responding to test prompts
  2. Real dangerous outputs may have different activation patterns than our best synthetic approximations
  3. We cannot validate monitors on real dangerous outputs without actually producing those outputs

While preliminary research shows activation monitoring can work in controlled settings (the 'signs of life' mentioned in recent work), significant research gaps remain before this technique could reliably protect production AI systems. We need better understanding of when probes generalize, methods to generate synthetic training data that matches real threat models, techniques to verify monitor robustness without access to ground-truth dangerous outputs, and empirical studies of failure modes.

This problem sits at the intersection of AI safety, interpretability, and adversarial robustness. Success would provide a scalable, automatic method to detect specific unsafe behaviors—a less ambitious but more tractable alternative to fully understanding model cognition. Failure could mean deploying unreliable safety measures that provide false confidence while missing real threats.

■ Impact Assessment

Importance
High
Neglectedness
Medium
Tractability
Medium

■ Acceptance Criteria

  • Demonstrate activation probes achieving >80% accuracy on detecting a specific unsafe behavior (e.g., insecure code generation) when trained only on synthetic data and tested on model-generated examples
  • Provide empirical characterization of at least 3 distribution shift scenarios where probes fail, with quantitative degradation metrics
  • Develop and validate a method for generating synthetic training data that improves probe generalization by >15% over naive prompting baselines
  • Create reproducible benchmark suite with multiple probe tasks, datasets, and models for evaluating activation monitoring robustness

■ Expected Artifacts

repopaperbenchmarkreproduction steps
Submit a Solution

Sources

https://alignment.anthropic.com/2025/recommended-directions/

Created: 2/5/2026

Last updated: 2/9/2026

Small Team (2-4)

Discussion (0)

to join the discussion

No comments yet. Be the first to comment!

Working On This

No one is working on this yet. Be the first!

Sign in to indicate you're working on this problem.