Implement all major safety techniques on an open model to measure how they combine against real attacks
Current AI safety research evaluates techniques in isolation, leaving unknown how different post-training safety interventions interact when combined. This project requires implementing a comprehensive stack of safety techniques on an open-source model and rigorously measuring their individual and combined effectiveness against real adversarial attacks, identifying gaps, conflicts, and optimal configurations for defense-in-depth.
Most safety research publishes single-technique results on narrow benchmarks. However, deployed systems need multiple layers of defense, and we lack empirical understanding of how these layers interact. Do circuit breakers and RLHF provide additive protection or redundant coverage? Does adversarial training interfere with constitutional AI? Recent work like STACK (McKenzie et al., 2025) has begun examining these questions, but comprehensive empirical data is lacking. Additionally, standard safety benchmarks like HarmBench and XSTest have known limitations and don't reflect real adversarial pressure from sophisticated attackers. The field needs rigorous engineering work to understand practical defense-in-depth for AI systems.
IN SCOPE: Implementing major post-training safety techniques (RLHF, SFT, constitutional AI, circuit breakers, representation engineering, adversarial training, unlearning, input/output filtering); evaluation against real documented attack methods; measuring interaction effects between techniques; computational and capability cost analysis; recommendations for deployment. OUT OF SCOPE: Novel safety technique development; training from scratch; closed-source models; theoretical analysis without implementation; runtime performance optimization; production deployment infrastructure. CONSTRAINTS: Must use publicly available open-source models; must test against documented real attacks not just synthetic benchmarks; should focus on 5-10 major safety techniques rather than exhaustive coverage.
Required: Strong Python and PyTorch skills, experience fine-tuning large language models, familiarity with RLHF/DPO techniques, understanding of current AI safety techniques. Helpful: Experience with adversarial ML, knowledge of safety benchmarking, familiarity with transformer internals, prior work with open-source LLMs.
Created: 2/5/2026
Last updated: 2/9/2026
to join the discussion
No one is working on this yet. Be the first!
Sign in to indicate you're working on this problem.