AI Safety Marketplace

■ Problem Statement

Develop computationally tractable methods to attribute specific model behaviors to training examples in frontier AI systems, enabling identification of which data contributes to safety-relevant behaviors. Current influence function approaches are too expensive for models with billions of parameters and trillion-token datasets. Solutions must scale to production systems while maintaining sufficient accuracy to guide safety interventions.

■ Background

Training data attribution has roots in influence functions from robust statistics, adapted to neural networks by Koh & Liang (2017). Recent work includes Anthropic's investigations into influence functions for language models (2023a) and Cheng et al.'s scaling methods (2024). The field has produced approximation techniques including TracIn, representer point methods, and datamodels. However, these approaches either don't scale to frontier models or lose too much accuracy. The AI safety motivation is clear: post-training alignment (RLHF, constitutional AI) addresses symptoms but may not fix root causes in training data. Direct data attribution could enable more robust interventions by identifying and modifying problematic training examples. Related work includes dataset auditing, data valuation in federated learning, and mechanistic interpretability research on how models process information.

■ Scope

IN SCOPE: Algorithmic methods for efficient attribution computation (approximations, sampling, mathematical frameworks); AI-assisted analysis tools for processing attribution outputs; counterfactual simulation capabilities; validation methodologies; application to safety-relevant behaviors (bias, deception, harmful capabilities). OUT OF SCOPE: Full retraining experiments for every validation (too expensive); attribution for models smaller than 1B parameters (already tractable); purely theoretical results without implementation paths; dataset collection or curation itself; legal/copyright aspects of training data. CONSTRAINTS: Methods must be implementable with compute budgets 100-1000x smaller than full retraining; must work with proprietary models where full gradient access may be limited; should integrate with existing ML pipelines.

■ Prerequisites

Strong ML background required: deep learning, optimization theory, experience training large models. Familiarity with influence functions, second-order optimization methods, and computational complexity analysis. Systems engineering skills for implementing efficient pipelines. Python, PyTorch/JAX proficiency essential.

■ Acceptance Criteria

Implement attribution method that scales to models ≥1B parameters with <10% of full retraining compute cost

Demonstrate attribution accuracy ≥0.7 correlation with ground-truth leave-one-out scores on validation set

Show successful identification of training examples responsible for at least one safety-relevant behavior (bias, toxicity, or capability)

Release open-source implementation with documentation and reproduce results on public model/dataset

Publish methodology and results in peer-reviewed venue or equivalent preprint with detailed evaluation

START HERE: (1) Read foundational papers: Koh & Liang 2017 'Understanding Black-box Predictions via Influence Functions', Pruthi et al. 2020 'Estimating Training Data Influence by Tracing Gradient Descent', Anthropic 2023 'Influence Functions for Deep Learning'. (2) Study recent scaling work: Cheng et al. 2024, Park et al. 2023 'TRAK'. (3) Experiment with existing libraries: Captum (PyTorch), pytorch-influence-functions repo. (4) Try simple experiments: compute influence scores for toy models (ResNet on CIFAR-10 or GPT-2 on small corpus), measure computational cost scaling, test approximation accuracy. (5) Identify bottlenecks: profile where computation time goes (Hessian-vector products, gradient computation, storage). (6) Prototype one improvement: try a specific approximation (low-rank, random projection, sampling strategy) and measure accuracy vs. speed tradeoff. (7) Join discussions: AI safety Slack channels, LessWrong posts on interpretability.

Scalable Training Data Attribution for Model Behavior Analysis

■ Problem Statement

■ Background

■ Scope

■ Impact Assessment

■ Prerequisites

■ Acceptance Criteria

■ Expected Artifacts

Working On This

Related Resources

📄Papers1

✍️Blog Posts1

Sources

Discussion (0)

■ Getting Started

Working On This

Discussion (0)