ArXiv TLDR

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

🐦 Tweet
2605.11922

Hao Wang, Rui Li, Lei Sha, Jie M. Zhang

cs.SEcs.CL

TLDR

StepCodeReasoner uses RL to align code reasoning with stepwise execution traces, achieving SOTA performance by supervising intermediate states.

Key contributions

  • Introduces StepCodeReasoner, a framework for explicit intermediate execution-state supervision in code reasoning.
  • Automatically inserts print-based execution-trace anchors to predict runtime states at each step.
  • Proposes Bi-Level GRPO, an RL algorithm for structured credit assignment across and within execution paths.
  • Achieves SOTA on CRUXEval (91.1%) and LiveCodeBench (86.5%), outperforming GPT-4o and baselines.

Why it matters

This paper addresses a key limitation in code reasoning by supervising intermediate execution states, preventing reward hacking. By aligning reasoning with actual execution, it significantly improves the accuracy and reliability of code reasoning models. This approach also enhances code generation, demonstrating broad impact.

Original Abstract

Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.