StaRPO: Stability-Augmented Reinforcement Policy Optimization

April 10, 20262604.08905

Jinghan Zhang, Fengran Mo, Tharindu Cyril Weerasooriya, Ruimin Dai, Xiaoyan Han + 3 more

cs.AIcs.LG

TLDR

StaRPO is a new RL framework that improves LLM reasoning by incorporating stability metrics (ACF, PE) to enhance logical consistency and accuracy.

Key contributions

Introduces StaRPO, an RL framework enhancing LLM reasoning by integrating "stability" into optimization.
Decomposes stability into Autocorrelation Function (ACF) for local coherence and Path Efficiency (PE) for global goal-directedness.
Combines ACF/PE stability rewards with task rewards for process-aware feedback.
Achieves superior final-answer accuracy and logical stability on four reasoning benchmarks.

Why it matters

Current RL for LLMs often produces fluent but logically flawed reasoning. StaRPO addresses this by explicitly optimizing for reasoning stability, not just final answers, using novel metrics. This leads to more reliable and coherent LLM outputs for complex tasks.

Original Abstract

Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers