Build benchmarks measuring real-world harm from jailbreaks vs. what adversaries can already do with existing tools
Current jailbreak benchmarks measure whether models refuse harmful queries, but don't assess whether jailbroken responses enable adversaries to cause real harm beyond their existing capabilities. We need benchmarks that measure realistic, differential harm: whether jailbreaks provide useful capabilities that adversaries with access to search engines and other common tools don't already possess, focusing on genuinely dangerous misuse cases rather than trivial or unrealistic scenarios.
Language model jailbreaks bypass safety guardrails through clever prompting. Most benchmarks (like those measuring attack success rate on harmful queries) use binary metrics: did the model refuse or comply? This misses whether compliance actually enables harm. Recent work by Souly et al. (2024) on WMDP and Andriushchenko et al. (2024) on harmful capability evaluations represents progress toward measuring whether models can capably assist with dangerous tasks. However, these don't fully capture differential risk—whether AI assistance provides advantages over existing resources. This matters because overestimating risk wastes resources on non-threats, while underestimating it leads to dangerous deployments. Realistic threat modeling is essential for effective AI governance and safety prioritization.
IN SCOPE: Developing evaluation frameworks for measuring differential harm from jailbreaks; defining realistic adversary baselines (search engines, forums, existing tools); identifying genuinely dangerous use-cases where AI provides unique capabilities; creating multi-step evaluation protocols; proposing metrics beyond binary refusal/compliance. FOCUS AREAS: Cybersecurity, biosecurity, disinformation, other high-stakes domains. OUT OF SCOPE: Developing new jailbreak attacks themselves; building defenses against jailbreaks; theoretical frameworks without concrete evaluation proposals; scenarios where harm is obviously trivial or accomplishable through basic means. CONSTRAINTS: Evaluations must be safe to conduct (no actual harm), reproducible, and practical for research teams to implement.
Strong background in ML safety, red-teaming, or security research. Experience with LLM evaluation and prompting. Understanding of threat modeling and realistic attack scenarios. Ability to design human subject studies or automated evaluation protocols. Familiarity with existing jailbreak literature and capability evaluations.
Created: 2/5/2026
Last updated: 2/9/2026
No one is working on this yet. Be the first!
Sign in to indicate you're working on this problem.
to join the discussion