Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation

May 4, 20262605.02195

Fazle Rabbi, Soumit Kanti Saha, Jinqiu Yang

cs.SE

TLDR

Many reported LLM code translation failures are false negatives caused by evaluation setup, not logical errors, demanding better evaluation standards.

Key contributions

Reveals many LLM code translation "failures" are false negatives due to evaluation setup, not logical errors.
Identifies common evaluation-induced errors like improper compilation flags and missing library links.
Large-scale study across 5 languages, 3 benchmarks, and 3 LLMs (GPT-4o, DeepSeek, Magicoder).
Categorizes failures into pipeline-induced (general) and model-dependent behaviors.

Why it matters

This paper is crucial as it exposes a fundamental flaw in current LLM code translation evaluation. By differentiating true model errors from evaluation-induced false negatives, it enables more accurate progress assessment. This will lead to more reliable benchmarks and foster better model development.

Original Abstract

Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frameworks has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), covering 6,164 translations generated by GPT-4o, DeepSeek-Coder, and Magicoder. Our analysis identifies and categorizes common false negatives, distinguishing pipeline-induced failures that affect any model from model-dependent behaviors that vary across LLMs. Our findings highlight the necessity for transparent, configuration-aware evaluation standards to accurately assess progress in LLM-based code translation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers