ArXiv TLDR

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

🐦 Tweet
2604.26923

Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang + 2 more

cs.SEcs.CL

TLDR

ClassEval-Pro is a new cross-domain benchmark for evaluating LLMs on class-level code generation, revealing significant challenges in compositional code creation.

Key contributions

  • Introduces ClassEval-Pro, a 300-task, 11-domain benchmark for class-level code generation.
  • Constructed via an automated pipeline, validated by LLM Judge Ensemble and >90% test coverage.
  • Frontier LLMs achieve only 45.6% Pass@1, showing a 17.7-point gap between best and worst models.
  • Logic (56.2%) and dependency (38.0%) errors dominate, highlighting cross-method coordination as a key bottleneck.

Why it matters

This paper fills a critical gap in LLM evaluation for compositional class-level code. ClassEval-Pro offers a robust, scalable benchmark, revealing current LLMs struggle with cross-method coordination. This provides clear directions for improving LLM capabilities in complex software development.

Original Abstract

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes -- compositional code creation, i.e., building a complete, internally structured class from a specification -- remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark's discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.