Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs
Yujia Chen, Yang Ye, Xiao Chu, Yuchi Ma, Cuiyun Gao
TLDR
ASTOR is a multi-task RL framework for code LLMs that uses utility-driven data scheduling and policy optimization, outperforming specialists.
Key contributions
- Introduces ASTOR, a multi-task RL framework for code LLMs, using utility-driven coordination.
- Employs Hierarchical Utility-Routed Data Scheduling to prioritize valuable training data.
- Utilizes Adaptive Utility-Calibrated Policy Optimization for dynamic per-task KL regularization.
- Achieves 9.0-9.5% better performance than task-specific specialists on coding tasks.
Why it matters
Training separate LLM specialists for each coding task is costly. Existing multi-task RL methods struggle with uniform task treatment. ASTOR provides a unified, effective solution by dynamically managing tasks, significantly improving code LLM performance.
Original Abstract
Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. However, existing MTRL methods treat all coding tasks uniformly, relying on fixed data curricula under a shared optimization strategy, ultimately limiting the effectiveness of multi-task training. To address these limitations, we propose ASTOR, a multi-tASk code reinforcement learning framework via uTility-driven coORdination. Centered on task utility, a signal capturing each task learning potential and cross-task synergy, ASTOR comprises two coupled modules: 1) Hierarchical Utility-Routed Data Scheduling module hierarchically allocates training budget and prioritizes informative prompts, steering training toward the most valuable data and 2) Adaptive Utility-Calibrated Policy Optimization module dynamically scales per-task KL regularization, matching update constraints to each tasks current training state. Experiments on two widely-used LLMs across four representative coding tasks demonstrate that ASTOR consistently improves a single model across all tasks, outperforming the best task-specific specialist by 9.0%-9.5% and surpassing the strongest MTRL baseline by 7.5%-12.8%.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.