Skip to content
AI Safety Marketplace
HomeProblems
AI Safety Marketplace

Connecting AI safety researchers with tractable problems in alignment and safety. Building safer AI together.

Navigation

  • Browse Bounties
  • Submit Problem

Developers

  • REST API
  • MCP Integration

Resources

  • Alignment Forum
  • Report a Bug
  • Request a Feature
© 2026 AI Safety Marketplace. All rights reserved.
← Back to Bounties
Open
0

Design realistic differential harm benchmarks for LLM jailbreak attacks

red-teamingevaluationjailbreaksalignment-researchanthropicanthropic-directions-2025adversarial-robustnessthreat-modelingbenchmarkingdifferential-capabilitymisuse-risk
Difficulty
Advanced
Verification
Research Output
Compute
Single GPU
Source
anthropic-directions-2025
Time

Build benchmarks measuring real-world harm from jailbreaks vs. what adversaries can already do with existing tools

■ Problem Statement

Current jailbreak benchmarks measure whether models refuse harmful queries, but don't assess whether jailbroken responses enable adversaries to cause real harm beyond their existing capabilities. We need benchmarks that measure realistic, differential harm: whether jailbreaks provide useful capabilities that adversaries with access to search engines and other common tools don't already possess, focusing on genuinely dangerous misuse cases rather than trivial or unrealistic scenarios.

■ Background

Language model jailbreaks bypass safety guardrails through clever prompting. Most benchmarks (like those measuring attack success rate on harmful queries) use binary metrics: did the model refuse or comply? This misses whether compliance actually enables harm. Recent work by Souly et al. (2024) on WMDP and Andriushchenko et al. (2024) on harmful capability evaluations represents progress toward measuring whether models can capably assist with dangerous tasks. However, these don't fully capture differential risk—whether AI assistance provides advantages over existing resources. This matters because overestimating risk wastes resources on non-threats, while underestimating it leads to dangerous deployments. Realistic threat modeling is essential for effective AI governance and safety prioritization.

■ Scope

IN SCOPE: Developing evaluation frameworks for measuring differential harm from jailbreaks; defining realistic adversary baselines (search engines, forums, existing tools); identifying genuinely dangerous use-cases where AI provides unique capabilities; creating multi-step evaluation protocols; proposing metrics beyond binary refusal/compliance. FOCUS AREAS: Cybersecurity, biosecurity, disinformation, other high-stakes domains. OUT OF SCOPE: Developing new jailbreak attacks themselves; building defenses against jailbreaks; theoretical frameworks without concrete evaluation proposals; scenarios where harm is obviously trivial or accomplishable through basic means. CONSTRAINTS: Evaluations must be safe to conduct (no actual harm), reproducible, and practical for research teams to implement.

■ Impact Assessment

Importance
High
Neglectedness
High
Tractability
Medium

■ Prerequisites

Strong background in ML safety, red-teaming, or security research. Experience with LLM evaluation and prompting. Understanding of threat modeling and realistic attack scenarios. Ability to design human subject studies or automated evaluation protocols. Familiarity with existing jailbreak literature and capability evaluations.

■ Acceptance Criteria

  • Propose concrete benchmark with 10+ realistic threat scenarios where differential harm can be measured
  • Define adversary baseline capabilities for each scenario with supporting evidence (what tools/info they already have)
  • Specify evaluation protocol including metrics for differential capability (not just binary refusal)
  • Demonstrate pilot results on 2+ scenarios comparing baseline vs. jailbroken model assistance
  • Provide implementation code/framework others can use to evaluate models on the benchmark

■ Expected Artifacts

paperbenchmarkrepodataseteval harness
Submit a Solution

Sources

https://alignment.anthropic.com/2025/recommended-directions/

Created: 2/5/2026

Last updated: 2/9/2026

Months
Team Size
Small Team (2-4)

■ Getting Started

  1. READ: Souly et al. 'The WMDP Benchmark' (2024), Andriushchenko et al. on harmful capability evals (2024), Anthropic's red-teaming papers on model capabilities. 2) SURVEY existing jailbreak benchmarks (AdvBench, HarmBench) to understand current metrics. 3) IDENTIFY 2-3 concrete threat scenarios (e.g., exploiting CVE vulnerabilities, synthesizing dangerous compounds) and baseline what adversaries can learn from Google/forums. 4) DESIGN a pilot evaluation: have humans attempt a task with (a) Google only, (b) base model, (c) jailbroken model, and measure success rates and time-to-completion. 5) PROTOTYPE metrics for differential capability (success rate delta, information gain, task complexity enabled). 6) ITERATE based on what differentiates genuine risks from noise.

Working On This

No one is working on this yet. Be the first!

Sign in to indicate you're working on this problem.

Discussion (0)

to join the discussion

No comments yet. Be the first to comment!