Learning, Fast and Slow: Towards LLMs That Adapt Continually

May 12, 20262605.12484

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez, Matei Zaharia + 4 more

cs.LGcs.AI

TLDR

Fast-Slow Training enables LLMs to adapt continually with improved efficiency and less forgetting by combining fast context and slow parameter updates.

Key contributions

Introduces Fast-Slow Training combining optimized context (fast weights) and model parameters (slow weights).
Achieves up to 3x sample efficiency and higher final performance than parameter-only RL training.
Reduces catastrophic forgetting by maintaining closer alignment to the base model (70% less KL divergence).
Improves continual learning, enabling better adaptation to new tasks without stalling.

Why it matters

This paper bridges in-context and parameter learning, enabling LLMs to adapt quickly and retain knowledge. It advances continual learning, addressing forgetting and plasticity challenges in evolving tasks.

Original Abstract

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers