ArXiv TLDR

Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation

🐦 Tweet
2605.00318

Pooja Guttal, Varun Magotra, Vasudeva Mahavishnu, Natasha Chanto, Sidharth Sivaprasad + 1 more

cs.CLcs.IR

TLDR

STC introduces a structure-aware chunking framework for tabular data in RAG, significantly improving retrieval performance by preserving semantic relationships.

Key contributions

  • Proposes Structure-Aware Tabular Chunking (STC) for RAG, addressing limitations of text-based chunking.
  • Uses a hierarchical Row Tree and key-value encoding to preserve row-level semantic relationships.
  • Reduces chunk count by up to 56% and improves token utilization and processing efficiency.
  • Achieves significant retrieval gains, boosting MRR from 0.3576 to 0.5945 and Recall@1 from 0.366 to 0.754.

Why it matters

Existing RAG chunking struggles with tabular data, leading to poor retrieval. This paper introduces a novel, structure-aware method that dramatically improves how tabular information is processed and retrieved. It offers a crucial advancement for enterprise data pipelines relying on RAG.

Original Abstract

Tabular documents such as CSV and Excel files are widely used in enterprise data pipelines, yet existing chunking strategies for retrieval-augmented generation (RAG) are primarily designed for unstructured text and do not account for tabular structure. We propose a structure-aware tabular chunking (STC) framework that operates on row-level units by constructing a hierarchical Row Tree representation, where each row is encoded as a key-value block. STC performs token-constrained splitting aligned with structural boundaries and applies overlap-free greedy merging to produce dense, non-overlapping chunks. This design preserves semantic relationships between fields within a row while improving token utilization and reducing fragmentation. Across evaluations on the MAUD dataset, STC reduces chunk count by up to 40% and 56% compared to standard recursive and key-value based baselines, respectively, while improving token utilization and processing efficiency. In retrieval benchmarks, STC improves MRR from 0.3576 to 0.5945 in a hybrid setting and increases Recall@1 from 0.366 to 0.754 in BM25-only retrieval. These results demonstrate that preserving structure during chunking improves retrieval performance, highlighting the importance of structure-aware chunking for RAG over tabular data.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.