When Can LLMs Learn to Reason with Weak Supervision?
Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel + 1 more
TLDR
LLMs generalize under weak supervision when reward saturation is slow and reasoning is faithful, with SFT on traces being crucial.
Key contributions
- Generalization under weak supervision is governed by training reward saturation dynamics.
- Reasoning faithfulness predicts generalization, while output diversity is uninformative.
- Supervised fine-tuning (SFT) on explicit reasoning traces is necessary for generalization.
- Continual pre-training on domain data amplifies SFT's effect, enabling generalization in Llama3.2-3B-Base.
Why it matters
Understanding how LLMs learn reasoning with weak supervision is critical as high-quality reward signals become harder to construct. This work provides empirical insights into the mechanisms governing generalization. It offers actionable strategies, like SFT and continual pre-training, to improve LLM reasoning capabilities under practical constraints.
Original Abstract
Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.