Scale influence functions to trace frontier model behaviors back to training data for targeted safety fixes
Develop computationally tractable methods to attribute specific model behaviors to training examples in frontier AI systems, enabling identification of which data contributes to safety-relevant behaviors. Current influence function approaches are too expensive for models with billions of parameters and trillion-token datasets. Solutions must scale to production systems while maintaining sufficient accuracy to guide safety interventions.
Training data attribution has roots in influence functions from robust statistics, adapted to neural networks by Koh & Liang (2017). Recent work includes Anthropic's investigations into influence functions for language models (2023a) and Cheng et al.'s scaling methods (2024). The field has produced approximation techniques including TracIn, representer point methods, and datamodels. However, these approaches either don't scale to frontier models or lose too much accuracy. The AI safety motivation is clear: post-training alignment (RLHF, constitutional AI) addresses symptoms but may not fix root causes in training data. Direct data attribution could enable more robust interventions by identifying and modifying problematic training examples. Related work includes dataset auditing, data valuation in federated learning, and mechanistic interpretability research on how models process information.
IN SCOPE: Algorithmic methods for efficient attribution computation (approximations, sampling, mathematical frameworks); AI-assisted analysis tools for processing attribution outputs; counterfactual simulation capabilities; validation methodologies; application to safety-relevant behaviors (bias, deception, harmful capabilities). OUT OF SCOPE: Full retraining experiments for every validation (too expensive); attribution for models smaller than 1B parameters (already tractable); purely theoretical results without implementation paths; dataset collection or curation itself; legal/copyright aspects of training data. CONSTRAINTS: Methods must be implementable with compute budgets 100-1000x smaller than full retraining; must work with proprietary models where full gradient access may be limited; should integrate with existing ML pipelines.
Strong ML background required: deep learning, optimization theory, experience training large models. Familiarity with influence functions, second-order optimization methods, and computational complexity analysis. Systems engineering skills for implementing efficient pipelines. Python, PyTorch/JAX proficiency essential.
Created: 2/5/2026
Last updated: 2/9/2026
START HERE: (1) Read foundational papers: Koh & Liang 2017 'Understanding Black-box Predictions via Influence Functions', Pruthi et al. 2020 'Estimating Training Data Influence by Tracing Gradient Descent', Anthropic 2023 'Influence Functions for Deep Learning'. (2) Study recent scaling work: Cheng et al. 2024, Park et al. 2023 'TRAK'. (3) Experiment with existing libraries: Captum (PyTorch), pytorch-influence-functions repo. (4) Try simple experiments: compute influence scores for toy models (ResNet on CIFAR-10 or GPT-2 on small corpus), measure computational cost scaling, test approximation accuracy. (5) Identify bottlenecks: profile where computation time goes (Hessian-vector products, gradient computation, storage). (6) Prototype one improvement: try a specific approximation (low-rank, random projection, sampling strategy) and measure accuracy vs. speed tradeoff. (7) Join discussions: AI safety Slack channels, LessWrong posts on interpretability.
No one is working on this yet. Be the first!
Sign in to indicate you're working on this problem.
to join the discussion