Make machine unlearning robust against few-shot attacks that efficiently restore supposedly forgotten knowledge.
Current machine unlearning methods remove knowledge in ways that are vulnerable to efficient re-learning through minimal fine-tuning. The challenge is to develop unlearning techniques and architectural modifications that make it computationally or structurally difficult to recover forgotten knowledge, while maintaining model utility on retained tasks. This requires both new algorithms and rigorous evaluation methods that distinguish true forgetting from surface-level suppression.
Machine unlearning has emerged as a critical AI safety capability, with applications in privacy (GDPR right-to-be-forgotten), safety (removing hazardous knowledge), and fairness (eliminating biased data influence). Existing approaches include: (1) exact methods like retraining from scratch or SISA, which are computationally expensive, and (2) approximate methods like gradient ascent on forget sets, influence function perturbations, and task-vector negation, which are efficient but potentially superficial. Recent work by Jia et al. (2023) demonstrated that unlearned models can recover forgotten information with <100 examples. The WMDP benchmark (Li et al., 2024) provides standardized evaluation for unlearning hazardous knowledge. Key insight: standard accuracy metrics on forget sets don't capture latent knowledge that enables efficient re-learning. This problem connects to broader questions about knowledge representation, neural network interpretability, and adversarial robustness.
In scope: (1) Developing new unlearning algorithms robust to few-shot fine-tuning attacks (2) Proposing architectural constraints or modifications that prevent efficient knowledge recovery (3) Creating evaluation protocols that measure re-learning resistance (4) Theoretical analysis of fundamental limits and trade-offs in robust unlearning (5) Experiments on language models (LLMs preferred) and vision models. Out of scope: (1) Privacy-preserving training methods that prevent learning in the first place (2) Watermarking or detection methods that don't remove knowledge (3) Simple prompt engineering or output filtering (4) Unlearning in non-neural-network models. Constraints: Solutions should maintain >90% performance on retained tasks, be more efficient than full retraining, and demonstrate robustness across multiple attack scenarios.
Strong background in deep learning, experience with PyTorch/JAX, familiarity with fine-tuning and training dynamics. Understanding of optimization theory, gradient-based methods, and model evaluation helpful. Prior work with language models or machine unlearning is beneficial but not required.
Created: 2/5/2026
Last updated: 2/9/2026
Start by reading: (1) 'Machine Unlearning' by Bourtoule et al. (2021) for foundational concepts, (2) 'Knowledge Unlearning for LLMs' by Jia et al. (2023) for re-learning vulnerability demonstrations, (3) 'WMDP: Measuring Robustness to Hazardous Knowledge' by Li et al. (2024) for evaluation frameworks. Implement baseline experiments: Fine-tune a small language model (GPT-2 or Llama-2-7B) on a specific knowledge domain, apply standard unlearning methods (gradient ascent, task negation), then attempt re-learning with varying shot counts (5, 10, 50, 100 examples). Measure both forget-set accuracy and re-learning sample efficiency. Explore: (1) Different unlearning loss functions (2) Regularization techniques that prevent efficient gradient updates (3) Architectural interventions like freezing specific layers or adding bottleneck modules (4) Adversarial training approaches where re-learning is part of the unlearning objective. Available resources: WMDP benchmark codebase, open-source unlearning libraries, pretrained models from Hugging Face.
No one is working on this yet. Be the first!
Sign in to indicate you're working on this problem.
to join the discussion