AI Safety Marketplace

Comprehensive Defense-in-Depth Safety Stack Implementation and Evaluation on Open Models

Open

This project aims to systematically implement and evaluate all major post-training safety techniques on a state-of-the-art open-source model (e.g., De...

open-modelsdefense-in-depthred-teamingempirical-safetypost-trainingadversarial-robustnessresearchperegrine-reportsafetybenchmarkstechnical-safetyai-risk-mitigationalignment

AdvancedMonthsMulti-GPU

0

Automated Analysis Tools for Autonomous Agent Execution Traces

Open

Autonomous AI agents that perform sequential actions—such as coding assistants, web automation tools, or research agents—generate massive execution lo...

agent-foundationsautomated-analysissecuritytool-developmentanomaly-detectioninterpretabilityresearchperegrine-reportoversighttechnical-safetyevaluationai-risk-mitigationalignment

IntermediateWeeksCPU Only

0

Scalable Training Data Attribution for Model Behavior Analysis

Open

Modern AI systems exhibit emergent behaviors that are difficult to predict or explain. Understanding which training examples contribute to specific mo...

mechanistic-interpretabilitytraining-dynamicsscalabilitydata-attributioninfluence-functionsinterpretabilityresearchperegrine-reportsafetytechnical-safetyai-risk-mitigationalignment

AdvancedMonthsMulti-GPU

0

Quantify AI-enabled cyber capability uplift for non-expert actors through comparative behavioral studies

Open

This research project aims to empirically measure how frontier AI systems enable less-sophisticated actors to conduct malicious cyber activities they ...

dual-usered-teamingempirical-researchcapability-upliftbehavioral-studieshuman-subjectscybersecurityevaluationgovernment-researchmisuseaisi-risksuk-aisi

AdvancedMonthsCPU Only

0

Developing Robust Machine Unlearning Methods Resistant to Few-Shot Re-Learning Attacks

Open

Machine unlearning aims to remove specific knowledge or capabilities from trained neural networks, which is critical for AI safety applications like r...

evaluation-methodologyllm-safetymodel-safetyknowledge-removaladversarial-robustnessrobustnessoxford-martinmachine-unlearningre-learningfine-tuningai-safety

AdvancedMonthsMulti-GPU

0

Quantifying Structural Assumptions Needed to Infer Preferences from Irrational Agent Behavior

Open

This project investigates a fundamental challenge in AI alignment: how can we infer an agent's true preferences from observing its behavior when that ...

gridworldinverse-reinforcement-learningbounded-rationalityempirical-researchinterpretabilityowain-evansresearch-ideaalignmentpreference-learning

IntermediateMonthsCPU Only

0

Design realistic differential harm benchmarks for LLM jailbreak attacks

Open

Language models can be jailbroken—prompted in clever ways to bypass their safety guardrails and respond to queries they would normally refuse. Current...

benchmarkingthreat-modelingdifferential-capabilityevaluationmisuse-riskjailbreaksadversarial-robustnessalignment-researchred-teaminganthropicanthropic-directions-2025

AdvancedMonthsSingle GPU

0

Detecting Model Honesty via Internal Representations When Oversight Fails

Open

In scalable oversight, we face a fundamental challenge: as AI models become more capable, they may learn to exploit systematic errors in human oversig...

internal-representationshonesty-detectionmechanistic-interpretabilityoversight-failurestruthfulnessdeceptionhonestyscalable-oversightrecursive-oversightalignment-researchanthropicanthropic-directions-2025

AdvancedMonthsMulti-GPU

0

Develop unsupervised anomaly detection methods for AI model monitoring and control

Open

This problem asks researchers to develop and evaluate anomaly detection techniques that can identify when AI models exhibit unusual or out-of-distribu...

unsupervised-learningscalable-oversightinterpretabilityout-of-distribution-detectionai-controlalignment-researchmonitoringanomaly-detectionanthropicanthropic-directions-2025

IntermediateMulti-GPU

0

Robust Activation Monitoring for Detecting AI Misbehavior Under Distribution Shift

Open

Activation monitoring proposes to detect unsafe AI behavior by analyzing a model's internal representations (latent activations) rather than only its ...

robustnessdistribution-shiftinterpretabilityactivation-probesai-controlalignment-researchmonitoringanomaly-detectionanthropicanthropic-directions-2025

AdvancedMulti-GPU

0

Measuring and improving chain-of-thought faithfulness in language models

Open

Language models increasingly use chain-of-thought (CoT) reasoning to explain their outputs, but these explanations may not faithfully represent the mo...

mechanistic-interpretabilityfaithfulnesschain-of-thoughtreasoning-transparencyexplainabilityinterpretabilityevaluationsalignmentmodel-cognitionalignment-researchanthropicanthropic-directions-2025

AdvancedSingle GPU

0

AI Safety Bounties

AI Safety Bounties