ArXiv TLDR

MUCOCO: Automated Consistency Testing of Code LLMs

🐦 Tweet
2604.19086

Chua Jin Chou, Khant That Lwin, Ezekiel Soremekun

cs.SE

TLDR

MUCOCO introduces an automated method using semantic-preserving mutation to test and expose inconsistent behaviors in Code LLMs, outperforming baselines.

Key contributions

  • Proposes MUCOCO, an automated method for consistency testing of Code LLMs.
  • Employs semantic-preserving mutation analysis to generate equivalent program variants.
  • Detects inconsistencies by comparing outputs/test results between original and mutated programs.
  • Evaluated across 4 tasks and 7 LLMs, exposing inconsistencies in 15% of inputs.

Why it matters

Code LLMs are widely used, but their inconsistent behavior is a critical concern for reliability. MUCOCO provides a novel, automated approach to systematically identify these inconsistencies. This work highlights a crucial gap in current LLM evaluation and motivates better testing practices for robust AI development.

Original Abstract

Code LLMs often portray inconsistent program behaviors. Developers typically employ benchmarks to assess Code LLMs, but most benchmarks are hand-crafted, static and do not target consistency property. In this work, we pose the scientific question: how can we automatically discover inconsistent program behaviors in Code LLMs? To address this challenge, we propose an automated consistency testing method, called MUCOCO, which employs semantic-preserving mutation analysis to expose inconsistent behaviors in code LLMs. Given a coding query, MUCOCO automatically transforms its program into semantically equivalent programs (aka mutants) and detects inconsistencies between the mutants and the original program (e.g., different output or test failure). We evaluate MUCOCO using four (4) coding tasks and seven (7) LLMs. Results show that MUCOCO is effective in exposing inconsistency and outperforms the closest baseline (TURBULENCE). About one in seven (15%) inputs generated by MUCOCO exposed inconsistencies. Our work motivates the need to test Code LLMs for consistency property

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.