CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation
Zirui Wan, Zhaonan Wu, Xinyi Hou, Yanjie Zhao, Pengcheng Xia + 1 more
TLDR
CommitSuite is a new benchmark for commit classification and message generation, featuring 63k CCS-compliant commits and a reference-free evaluation framework.
Key contributions
- Introduces CommitSuite, a large benchmark of 63,533 CCS-compliant commits from 243 repos.
- Enriches commits with AST-level code changes and LLM-assisted semantic annotations.
- Proposes a novel reference-free evaluation framework for commit message generation.
Why it matters
This paper addresses the critical need for large-scale benchmarks and reliable evaluation methods in commit message research. CommitSuite provides a unified, semantically-rich resource, enabling more robust and reproducible studies on commit classification and generation. It significantly advances the field by offering a standardized way to improve software project maintenance.
Original Abstract
High-quality commit messages are critical for maintaining software projects, yet ensuring their consistency and informativeness remains a practical challenge. While the Conventional Commits Specification (CCS) provides a structured format for commit messages, research on CCS-based commit classification and commit message generation (CMG) is limited by the absence of large-scale benchmarks, semantic annotations, and reliable evaluation methods. In this paper, we introduce CommitSuite, a benchmark comprising 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit is labeled with its CCS type and enriched with AST-level code changes, along with LLM-assisted semantic annotations that capture the "what" and "why" behind the change. To evaluate CMG systems, we propose a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality, enabling semantic-level assessment without relying on human-written references. Our experiments show that LLMs can effectively support both generation and evaluation, with evaluation achieving 0.849 Cohen's Kappa agreement against human judgments. CommitSuite offers a unified resource for structured commit understanding and facilitates reproducible research on commit classification and generation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.