Skip to content
AI Safety Marketplace
HomeProblems
AI Safety Marketplace

Connecting AI safety researchers with tractable problems in alignment and safety. Building safer AI together.

Navigation

  • Browse Bounties
  • Submit Problem

Developers

  • REST API
  • MCP Integration

Resources

  • Alignment Forum
  • Report a Bug
  • Request a Feature
© 2026 AI Safety Marketplace. All rights reserved.
← Back to Bounties
Open
0

Comprehensive Defense-in-Depth Safety Stack Implementation and Evaluation on Open Models

red-teamingalignmentbenchmarksresearchsafetyadversarial-robustnessperegrine-reporttechnical-safetyai-risk-mitigationopen-modelsdefense-in-depthempirical-safetypost-training
Difficulty
Advanced
Verification
Research Output
Compute
Multi-GPU
Source
peregrine-2025
Time

Implement all major safety techniques on an open model to measure how they combine against real attacks

■ Problem Statement

Current AI safety research evaluates techniques in isolation, leaving unknown how different post-training safety interventions interact when combined. This project requires implementing a comprehensive stack of safety techniques on an open-source model and rigorously measuring their individual and combined effectiveness against real adversarial attacks, identifying gaps, conflicts, and optimal configurations for defense-in-depth.

■ Background

Most safety research publishes single-technique results on narrow benchmarks. However, deployed systems need multiple layers of defense, and we lack empirical understanding of how these layers interact. Do circuit breakers and RLHF provide additive protection or redundant coverage? Does adversarial training interfere with constitutional AI? Recent work like STACK (McKenzie et al., 2025) has begun examining these questions, but comprehensive empirical data is lacking. Additionally, standard safety benchmarks like HarmBench and XSTest have known limitations and don't reflect real adversarial pressure from sophisticated attackers. The field needs rigorous engineering work to understand practical defense-in-depth for AI systems.

■ Scope

IN SCOPE: Implementing major post-training safety techniques (RLHF, SFT, constitutional AI, circuit breakers, representation engineering, adversarial training, unlearning, input/output filtering); evaluation against real documented attack methods; measuring interaction effects between techniques; computational and capability cost analysis; recommendations for deployment. OUT OF SCOPE: Novel safety technique development; training from scratch; closed-source models; theoretical analysis without implementation; runtime performance optimization; production deployment infrastructure. CONSTRAINTS: Must use publicly available open-source models; must test against documented real attacks not just synthetic benchmarks; should focus on 5-10 major safety techniques rather than exhaustive coverage.

■ Getting Started

■ Impact Assessment

Importance
High
Neglectedness
High
Tractability
High

■ Prerequisites

Required: Strong Python and PyTorch skills, experience fine-tuning large language models, familiarity with RLHF/DPO techniques, understanding of current AI safety techniques. Helpful: Experience with adversarial ML, knowledge of safety benchmarking, familiarity with transformer internals, prior work with open-source LLMs.

■ Acceptance Criteria

  • Successfully implement at least 5 distinct safety techniques on a single open-source model (e.g., DeepSeek) with documented methodology and hyperparameters
  • Evaluate all individual techniques and at least 10 meaningful combinations against a suite of at least 20 documented real-world attack methods
  • Provide quantitative analysis of interaction effects between techniques, identifying at least 3 specific instances of synergy, interference, or redundancy
  • Deliver open-source codebase with modular implementation allowing others to reproduce results or apply techniques to different models
  • Produce comprehensive report with practical recommendations on which technique combinations provide optimal coverage-to-cost ratios for different deployment scenarios

■ Expected Artifacts

repopaperbenchmarkblog postreproduction steps
Submit a Solution

Related Resources

📄Papers1

The 2025 Peregrine Report (PDF)

✍️Blog Posts1

Peregrine Report Website

Sources

https://riskmitigation.ai/

Created: 2/5/2026

Last updated: 2/9/2026

Months
Team Size
Small Team (2-4)
  1. Read STACK (McKenzie et al., 2025) for framework on combining safety techniques. 2. Review recent safety technique papers: Anthropic's constitutional AI, circuit breakers (Zou et al.), representation engineering (Zou et al.), recent RLHF implementations. 3. Select a base model (DeepSeek-V3, Llama-3.1, or Qwen-2.5 recommended for strong baselines). 4. Set up evaluation harness using HarmBench as starting point but extend with real jailbreaks from sources like jailbreakchat.com. 5. Implement first 2-3 techniques independently (suggest starting with safety SFT and RLHF as they're well-documented). 6. Design measurement framework for both individual and combined effectiveness before scaling up. 7. Fork existing safety implementation repos (e.g., alignment-handbook, OpenRLHF) rather than building from scratch.

Discussion (0)

to join the discussion

No comments yet. Be the first to comment!

Working On This

No one is working on this yet. Be the first!

Sign in to indicate you're working on this problem.