Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

April 14, 20262604.12736

Xingyu Lin, Yilin Wen, Du Su, Jinchang Hou, En Wang + 3 more

cs.CL

TLDR

TEPO improves LLM mathematical reasoning by linking group-level rewards to tokens and using a masked KL constraint, achieving SOTA performance and faster training.

Key contributions

Introduces TEPO to address sparse token rewards in LLM Chain-of-Thought (CoT) reasoning.
Links group-level rewards to individual tokens via sequence-level likelihood aggregation.
Applies a token-level KL-Divergence mask constraint to stabilize policy updates.
Achieves state-of-the-art mathematical reasoning and 50% faster convergence than GRPO.

Why it matters

This paper introduces TEPO, which solves the sparse token-level reward problem in LLM chain-of-thought reasoning. It significantly boosts mathematical reasoning performance and training stability, making LLMs more robust and efficient for complex tasks.

Original Abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers