Skip to content
AI Safety Marketplace
HomeProblems
AI Safety Marketplace

Connecting AI safety researchers with tractable problems in alignment and safety. Building safer AI together.

Navigation

  • Browse Bounties
  • Submit Problem

Developers

  • REST API
  • MCP Integration

Resources

  • Alignment Forum
  • Report a Bug
  • Request a Feature
© 2026 AI Safety Marketplace. All rights reserved.

AI Safety Bounties

Search and filter bounties in AI safety research

Quick filters

11bounties found

Comprehensive Defense-in-Depth Safety Stack Implementation and Evaluation on Open Models
Open
This project aims to systematically implement and evaluate all major post-training safety techniques on a state-of-the-art open-source model (e.g., De...
open-modelsdefense-in-depthred-teamingempirical-safetypost-trainingadversarial-robustnessresearchperegrine-reportsafetybenchmarkstechnical-safetyai-risk-mitigationalignment
AdvancedMonthsMulti-GPU
0
Automated Analysis Tools for Autonomous Agent Execution Traces
Open
Autonomous AI agents that perform sequential actions—such as coding assistants, web automation tools, or research agents—generate massive execution lo...
agent-foundationsautomated-analysissecuritytool-developmentanomaly-detectioninterpretabilityresearchperegrine-reportoversighttechnical-safetyevaluationai-risk-mitigationalignment
IntermediateWeeksCPU Only
0
Scalable Training Data Attribution for Model Behavior Analysis
Open
Modern AI systems exhibit emergent behaviors that are difficult to predict or explain. Understanding which training examples contribute to specific mo...
mechanistic-interpretabilitytraining-dynamicsscalabilitydata-attributioninfluence-functionsinterpretabilityresearchperegrine-reportsafetytechnical-safetyai-risk-mitigationalignment
AdvancedMonthsMulti-GPU
0
Quantify AI-enabled cyber capability uplift for non-expert actors through comparative behavioral studies
Open
This research project aims to empirically measure how frontier AI systems enable less-sophisticated actors to conduct malicious cyber activities they ...
dual-usered-teamingempirical-researchcapability-upliftbehavioral-studieshuman-subjectscybersecurityevaluationgovernment-researchmisuseaisi-risksuk-aisi
AdvancedMonthsCPU Only
0
Developing Robust Machine Unlearning Methods Resistant to Few-Shot Re-Learning Attacks
Open
Machine unlearning aims to remove specific knowledge or capabilities from trained neural networks, which is critical for AI safety applications like r...
evaluation-methodologyllm-safetymodel-safetyknowledge-removaladversarial-robustnessrobustnessoxford-martinmachine-unlearningre-learningfine-tuningai-safety
AdvancedMonthsMulti-GPU
0
Quantifying Structural Assumptions Needed to Infer Preferences from Irrational Agent Behavior
Open
This project investigates a fundamental challenge in AI alignment: how can we infer an agent's true preferences from observing its behavior when that ...
gridworldinverse-reinforcement-learningbounded-rationalityempirical-researchinterpretabilityowain-evansresearch-ideaalignmentpreference-learning
IntermediateMonthsCPU Only
0
Design realistic differential harm benchmarks for LLM jailbreak attacks
Open
Language models can be jailbroken—prompted in clever ways to bypass their safety guardrails and respond to queries they would normally refuse. Current...
benchmarkingthreat-modelingdifferential-capabilityevaluationmisuse-riskjailbreaksadversarial-robustnessalignment-researchred-teaminganthropicanthropic-directions-2025
AdvancedMonthsSingle GPU
0
Detecting Model Honesty via Internal Representations When Oversight Fails
Open
In scalable oversight, we face a fundamental challenge: as AI models become more capable, they may learn to exploit systematic errors in human oversig...
internal-representationshonesty-detectionmechanistic-interpretabilityoversight-failurestruthfulnessdeceptionhonestyscalable-oversightrecursive-oversightalignment-researchanthropicanthropic-directions-2025
AdvancedMonthsMulti-GPU
0
Develop unsupervised anomaly detection methods for AI model monitoring and control
Open
This problem asks researchers to develop and evaluate anomaly detection techniques that can identify when AI models exhibit unusual or out-of-distribu...
unsupervised-learningscalable-oversightinterpretabilityout-of-distribution-detectionai-controlalignment-researchmonitoringanomaly-detectionanthropicanthropic-directions-2025
IntermediateMulti-GPU
0
Robust Activation Monitoring for Detecting AI Misbehavior Under Distribution Shift
Open
Activation monitoring proposes to detect unsafe AI behavior by analyzing a model's internal representations (latent activations) rather than only its ...
robustnessdistribution-shiftinterpretabilityactivation-probesai-controlalignment-researchmonitoringanomaly-detectionanthropicanthropic-directions-2025
AdvancedMulti-GPU
0
Measuring and improving chain-of-thought faithfulness in language models
Open
Language models increasingly use chain-of-thought (CoT) reasoning to explain their outputs, but these explanations may not faithfully represent the mo...
mechanistic-interpretabilityfaithfulnesschain-of-thoughtreasoning-transparencyexplainabilityinterpretabilityevaluationsalignmentmodel-cognitionalignment-researchanthropicanthropic-directions-2025
AdvancedSingle GPU
0

You've reached the end of the results