LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB
Vekil Bekmyradov, Noah C. Pütz, Thomas Bartz-Beielstein
TLDR
LLMs excel at test generation for familiar open-source code but struggle with unseen proprietary systems, prioritizing compilability over semantic correctness.
Key contributions
- Compared LLM test generation on open-source (LevelDB) vs. proprietary (SAP HANA) systems.
- Showed LLMs excel on familiar code but struggle with unseen, complex domains.
- Found LLMs prioritize compilability over semantic effectiveness in generated tests.
- Emphasizes new evaluation frameworks to reward true generalization, not shortcuts.
Why it matters
This paper provides independent software engineering evidence that current LLMs lack robust reasoning, especially when faced with unseen, complex codebases. It highlights the critical need for evaluation frameworks that penalize trivial shortcuts and reward genuine generalization, pushing for more reliable AI in software testing.
Original Abstract
Large Language Models (LLMs) have achieved impressive results on public benchmarks, often leading to claims of advanced reasoning and understanding. However, recent research in cognitive science reveals that these models sometimes rely on shallow heuristics and memorization, taking shortcuts rather than demonstrating genuine cognitive abilities. This paper investigates LLM behavior in automated test generation for software, contrasting performance on an open-source system (LevelDB) with SAP HANA, one of the most widely deployed commercial database systems worldwide, whose proprietary codebase is guaranteed to be absent from training data. We combine cognitive evaluation principles, drawing on Mitchell's mechanism-focused assessment methodology, with empirical software testing, employing mutation score and iterative compiler-feedback repair loops to assess both accuracy and underlying reasoning strategies. Results show that LLMs excel on familiar, open-source benchmarks but struggle with unseen, complex domains, often prioritizing compilability over semantic effectiveness. These findings provide independent software engineering evidence for the broader claim that current LLMs lack robust reasoning, and highlight the need for evaluation frameworks that penalize trivial shortcuts and reward true generalization.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.