AI Safety Marketplace

■ Problem Statement

Current AI safety research evaluates techniques in isolation, leaving unknown how different post-training safety interventions interact when combined. This project requires implementing a comprehensive stack of safety techniques on an open-source model and rigorously measuring their individual and combined effectiveness against real adversarial attacks, identifying gaps, conflicts, and optimal configurations for defense-in-depth.

■ Background

Most safety research publishes single-technique results on narrow benchmarks. However, deployed systems need multiple layers of defense, and we lack empirical understanding of how these layers interact. Do circuit breakers and RLHF provide additive protection or redundant coverage? Does adversarial training interfere with constitutional AI? Recent work like STACK (McKenzie et al., 2025) has begun examining these questions, but comprehensive empirical data is lacking. Additionally, standard safety benchmarks like HarmBench and XSTest have known limitations and don't reflect real adversarial pressure from sophisticated attackers. The field needs rigorous engineering work to understand practical defense-in-depth for AI systems.

■ Scope

IN SCOPE: Implementing major post-training safety techniques (RLHF, SFT, constitutional AI, circuit breakers, representation engineering, adversarial training, unlearning, input/output filtering); evaluation against real documented attack methods; measuring interaction effects between techniques; computational and capability cost analysis; recommendations for deployment. OUT OF SCOPE: Novel safety technique development; training from scratch; closed-source models; theoretical analysis without implementation; runtime performance optimization; production deployment infrastructure. CONSTRAINTS: Must use publicly available open-source models; must test against documented real attacks not just synthetic benchmarks; should focus on 5-10 major safety techniques rather than exhaustive coverage.

■ Prerequisites

Required: Strong Python and PyTorch skills, experience fine-tuning large language models, familiarity with RLHF/DPO techniques, understanding of current AI safety techniques. Helpful: Experience with adversarial ML, knowledge of safety benchmarking, familiarity with transformer internals, prior work with open-source LLMs.

■ Acceptance Criteria

Successfully implement at least 5 distinct safety techniques on a single open-source model (e.g., DeepSeek) with documented methodology and hyperparameters

Evaluate all individual techniques and at least 10 meaningful combinations against a suite of at least 20 documented real-world attack methods

Provide quantitative analysis of interaction effects between techniques, identifying at least 3 specific instances of synergy, interference, or redundancy

Deliver open-source codebase with modular implementation allowing others to reproduce results or apply techniques to different models

Produce comprehensive report with practical recommendations on which technique combinations provide optimal coverage-to-cost ratios for different deployment scenarios

Read STACK (McKenzie et al., 2025) for framework on combining safety techniques. 2. Review recent safety technique papers: Anthropic's constitutional AI, circuit breakers (Zou et al.), representation engineering (Zou et al.), recent RLHF implementations. 3. Select a base model (DeepSeek-V3, Llama-3.1, or Qwen-2.5 recommended for strong baselines). 4. Set up evaluation harness using HarmBench as starting point but extend with real jailbreaks from sources like jailbreakchat.com. 5. Implement first 2-3 techniques independently (suggest starting with safety SFT and RLHF as they're well-documented). 6. Design measurement framework for both individual and combined effectiveness before scaling up. 7. Fork existing safety implementation repos (e.g., alignment-handbook, OpenRLHF) rather than building from scratch.

Comprehensive Defense-in-Depth Safety Stack Implementation and Evaluation on Open Models

■ Problem Statement

■ Background

■ Scope

■ Getting Started

■ Impact Assessment

■ Prerequisites

■ Acceptance Criteria

■ Expected Artifacts

Working On This

Related Resources

📄Papers1

✍️Blog Posts1

Sources

Discussion (0)

Discussion (0)

Working On This