ArXiv TLDR

RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

🐦 Tweet
2604.26523

Dong Xu, Mingwei Liu, Xiwen Wang, Jianfeng Zhong, Zibin Zheng

cs.SE

TLDR

RepoDoc uses a knowledge graph to generate comprehensive, semantically structured code documentation 3x faster with 85% fewer tokens and efficient incremental updates.

Key contributions

  • Introduces RepoDoc, a framework using a RepoKG for semantic code documentation generation and maintenance.
  • Generates modular, cross-referenced documentation with auto-generated diagrams by querying the RepoKG.
  • Enables efficient incremental updates via semantic impact propagation, targeting only affected code parts.
  • Outperforms SOTA, achieving 3x faster generation with 85% fewer tokens and 73% faster updates.

Why it matters

Current automated documentation tools are inefficient and lack semantic structure. RepoDoc solves this by using a knowledge graph, drastically improving documentation quality, generation speed, and update efficiency. This makes maintaining accurate, comprehensive documentation for large, evolving codebases far more practical.

Original Abstract

Maintaining up-to-date, comprehensive documentation for large codebases is a persistent challenge. Recent progress in automated documentation has moved from template-based rules to large language models (LLMs), yet existing tools still process source code as flat fragments, producing isolated documents that lack semantic structure. This design also leads to excessive token consumption and slow generation, while failing to capture how code changes propagate across dependencies. We propose RepoDoc, a system that uses a repository knowledge graph (RepoKG) as the semantic foundation for the entire documentation lifecycle. Our framework consists of three stages: (1) RepoKG construction, which extracts code entities and their relationships; (2) module clustering, which groups code into functionally cohesive, hierarchical units; and (3) skillful agent-based generation, which queries the graph to create modular, cross-referenced documentation with auto-generated Mermaid diagrams. For incremental maintenance, a semantic impact propagation mechanism navigates the RepoKG bidirectionally to pinpoint all affected parts, allowing selective, targeted regeneration. Evaluated on 24 repositories across 8 programming languages, RepoDoc substantially outperforms state-of-the-art alternatives. It improves API coverage by 32.5% and completeness by 10.4%, while generating documentation 3x faster with 85% fewer tokens. For incremental updates, it cuts update time by 73% and token usage by 77%, and achieves 10.2% higher update recall, more accurately reflecting code changes in the regenerated documentation. The source code and experimental artifacts are available at https://github.com/SYSUSELab/RepoDoc.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.