Where does output diversity collapse in post-training?

April 17, 20262604.16027

Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras

cs.CLcs.AIcs.LG

TLDR

Post-training reduces language model output diversity, a collapse primarily driven by training data composition, not inference methods.

Key contributions

Output diversity collapse in LMs co-varies with training data composition across Olmo 3 lineages.
Think models lose semantic diversity at SFT; DPO's impact is greater in Instruct models.
Collapse is embedded in model weights by training data, not imposed by generation format.
Diversity loss has quality-control and residual components, which are task-dependent.

Why it matters

This paper reveals that output diversity collapse in post-trained LMs is fundamentally tied to training data composition, not just specific methods or inference strategies. Understanding this helps improve model reliability and creativity by informing better training data choices, rather than relying on post-hoc fixes.

Original Abstract

Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers