ArXiv TLDR

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

🐦 Tweet
2604.24703

Amal Akli, Mike Papadakis, Maxime Cordy, Yves Le Traon

cs.SEcs.AI

TLDR

Introduces SpecValidator, a lightweight classifier that effectively detects defective task descriptions in LLM-based code generation, outperforming larger models.

Key contributions

  • Developed SpecValidator, a lightweight, finetuned classifier for detecting defects in LLM task descriptions.
  • Achieves F1=0.804 and MCC=0.745, significantly outperforming GPT-5-mini and Claude Sonnet 4.
  • Demonstrates SpecValidator's ability to generalize and detect unknown Under-Specification defects in real data.
  • Reveals LLM robustness to defects depends on defect type and description characteristics, not model capacity.

Why it matters

Defective task descriptions severely impact LLM code generation. SpecValidator detects these issues, improving reliability. It also shows LLM robustness depends on defect type, not model capacity, emphasizing structured descriptions.

Original Abstract

Large language models are widely used for code generation, yet they rely on an implicit assumption that the task descriptions are sufficiently detailed and well-formed. However, in practice, users may provide defective descriptions, which can have a strong effect on code correctness. To address this issue, we develop SpecValidator, a lightweight classifier based on a small model that has been parameter-efficiently finetuned, to automatically detect task description defects. We evaluate SpecValidator on three types of defects, Lexical Vagueness, Under-Specification and Syntax-Formatting on 3 benchmarks with task descriptions of varying structure and complexity. Our results show that SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469 and MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518 and MCC = 0.359). Perhaps more importantly, our analysis indicates that SpecValidator can generalize to unseen issues and detect unknown Under-Specification defects in the original (real) descriptions of the benchmarks used. Our results also show that the robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. We further found that benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience, highlighting the importance of structured task descriptions for reliable LLM-based code generation.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.