CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction
Jun Gao, Yun Peng, Qian Qiao, Changhai Zhou, Yuhua Zhou + 4 more
TLDR
CoRE is a new benchmark evaluating LLM code reasoning by testing consistency across equivalent implementations and accuracy of intermediate execution states.
Key contributions
- Introduces CoRE, a benchmark for evaluating LLM code reasoning beyond just output prediction.
- Evaluates implementation invariance (consistency across equivalent code) and process transparency (intermediate states).
- Reveals a "robustness gap" where LLMs perform inconsistently across functionally equivalent implementations.
- Identifies "superficial execution," where models get correct outputs but fail on intermediate steps.
Why it matters
This paper introduces a crucial benchmark, CoRE, that moves beyond simple output prediction to assess true code reasoning. It exposes significant limitations in current LLMs, showing they lack robustness and often achieve correct answers superficially. CoRE is essential for developing more faithful and reliable code-reasoning models.
Original Abstract
Despite strong performance on code generation tasks, it remains unclear whether large language models (LLMs) genuinely reason about code execution. Existing code reasoning benchmarks primarily evaluate final output correctness under a single canonical implementation, leaving two critical aspects underexplored: (1) whether LLMs can maintain consistency to functionally equivalent implementations, and (2) whether LLMs can accurately reason about intermediate execution states. We introduce \textbf{CoRE}, a \textbf{Co}de \textbf{Re}asoning benchmark that evaluates code reasoning through \textbf{implementation invariance} and \textbf{process transparency}. Extensive evaluations on eight frontier LLMs reveal two fundamental limitations. First, models exhibit a substantial \textbf{robustness gap}, with performance varying significantly across equivalent implementations. Second, we observe \textbf{superficial execution}, where models arrive at correct final outputs without correctly reasoning about intermediate execution states. Together, these findings demonstrate that output-only evaluations are insufficient for assessing code reasoning and position CoRE as a necessary benchmark for evaluating robust and faithful code reasoning.\footnote{Data and code are available at https://github.com/ZJUSig/CoRE.}
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.