How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study

May 6, 20262605.04763

Xinjian Wu, Jingzhi Gong, Gunel Jahangirova, Jie Zhang

cs.SE

TLDR

A study on RAG code completion found function-based chunking underperforms, while Sliding Window and cAST are optimal, and context length is key.

Key contributions

Function-based chunking significantly underperforms other strategies for RAG code completion.
Sliding Window and cAST chunking strategies offer the best cost-quality trade-off.
Increasing cross-file context length from 2k to 8k tokens improves performance by up to 4.2 percentage points.
The choice of chunking strategy has a statistically significant impact on RAG code completion quality.

Why it matters

This paper provides crucial empirical evidence for optimizing RAG pipelines in code completion, addressing a gap in justified chunking strategies. It challenges intuitions by showing function chunking underperforms, while Sliding Window and cAST are optimal. This guides practitioners to design more effective RAG systems.

Original Abstract

Retrieval-augmented generation (RAG) pipelines for code completion rely on chunking to segment source files into retrievable units, yet chunking strategies are typically adopted without empirical justification, and practitioner recommendations are notably inconsistent. We present a controlled empirical study isolating the effect of chunking on code completion quality by crossing four representative strategies (Function, Declaration, Sliding Window, and cAST) with four retrievers, five generators, and nine parameter configurations on two benchmarks (RepoEval and CrossCodeEval), totaling 864 experimental settings. Our results reveal that chunking strategy has a statistically significant effect on RAG-based code completion. Contrary to intuition, chunking based on functions underperforms all other strategies by 3.57--5.64 percentage points on RepoEval (Cliff's delta = -1.0), while the remaining chunking strategies perform comparably. Our further analysis demonstrates that this observation holds across all retriever--generator combinations. We also find that cross-file context length is the dominant parameter: doubling from 2,048 to 8,192 tokens yields up to 4.2 percentage points of improvement, whereas chunk size has a weaker, non-monotonic effect. On the cost--quality Pareto front, Sliding Window and cAST dominate both benchmarks; Function chunking is never Pareto-optimal.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers