AI Safety Marketplace

■ Problem Statement

Current jailbreak benchmarks measure whether models refuse harmful queries, but don't assess whether jailbroken responses enable adversaries to cause real harm beyond their existing capabilities. We need benchmarks that measure realistic, differential harm: whether jailbreaks provide useful capabilities that adversaries with access to search engines and other common tools don't already possess, focusing on genuinely dangerous misuse cases rather than trivial or unrealistic scenarios.

■ Background

Language model jailbreaks bypass safety guardrails through clever prompting. Most benchmarks (like those measuring attack success rate on harmful queries) use binary metrics: did the model refuse or comply? This misses whether compliance actually enables harm. Recent work by Souly et al. (2024) on WMDP and Andriushchenko et al. (2024) on harmful capability evaluations represents progress toward measuring whether models can capably assist with dangerous tasks. However, these don't fully capture differential risk—whether AI assistance provides advantages over existing resources. This matters because overestimating risk wastes resources on non-threats, while underestimating it leads to dangerous deployments. Realistic threat modeling is essential for effective AI governance and safety prioritization.

■ Scope

IN SCOPE: Developing evaluation frameworks for measuring differential harm from jailbreaks; defining realistic adversary baselines (search engines, forums, existing tools); identifying genuinely dangerous use-cases where AI provides unique capabilities; creating multi-step evaluation protocols; proposing metrics beyond binary refusal/compliance. FOCUS AREAS: Cybersecurity, biosecurity, disinformation, other high-stakes domains. OUT OF SCOPE: Developing new jailbreak attacks themselves; building defenses against jailbreaks; theoretical frameworks without concrete evaluation proposals; scenarios where harm is obviously trivial or accomplishable through basic means. CONSTRAINTS: Evaluations must be safe to conduct (no actual harm), reproducible, and practical for research teams to implement.

■ Prerequisites

Strong background in ML safety, red-teaming, or security research. Experience with LLM evaluation and prompting. Understanding of threat modeling and realistic attack scenarios. Ability to design human subject studies or automated evaluation protocols. Familiarity with existing jailbreak literature and capability evaluations.

■ Acceptance Criteria

Propose concrete benchmark with 10+ realistic threat scenarios where differential harm can be measured

Define adversary baseline capabilities for each scenario with supporting evidence (what tools/info they already have)

Specify evaluation protocol including metrics for differential capability (not just binary refusal)

Demonstrate pilot results on 2+ scenarios comparing baseline vs. jailbroken model assistance

Provide implementation code/framework others can use to evaluate models on the benchmark

READ: Souly et al. 'The WMDP Benchmark' (2024), Andriushchenko et al. on harmful capability evals (2024), Anthropic's red-teaming papers on model capabilities. 2) SURVEY existing jailbreak benchmarks (AdvBench, HarmBench) to understand current metrics. 3) IDENTIFY 2-3 concrete threat scenarios (e.g., exploiting CVE vulnerabilities, synthesizing dangerous compounds) and baseline what adversaries can learn from Google/forums. 4) DESIGN a pilot evaluation: have humans attempt a task with (a) Google only, (b) base model, (c) jailbroken model, and measure success rates and time-to-completion. 5) PROTOTYPE metrics for differential capability (success rate delta, information gain, task complexity enabled). 6) ITERATE based on what differentiates genuine risks from noise.

Design realistic differential harm benchmarks for LLM jailbreak attacks

■ Problem Statement

■ Background

■ Scope

■ Impact Assessment

■ Prerequisites

■ Acceptance Criteria

■ Expected Artifacts

Working On This

Sources

Discussion (0)

■ Getting Started

Working On This

Discussion (0)