Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
Omar El Bachyr, Yewei Song, Saad Ezzini, Jacques Klein, Tegawendé F. Bissyandé + 3 more
TLDR
This paper empirically evaluates PDF parsing and chunking strategies for RAG-based financial question answering, offering practical guidelines.
Key contributions
- Systematically evaluates multiple PDF parsers and chunking strategies for RAG systems.
- Focuses on Question Answering in the financial domain using two benchmarks.
- Introduces TableQuest, a new publicly available financial QA benchmark.
- Provides practical guidelines for building robust RAG pipelines for PDF understanding.
Why it matters
PDFs are challenging for automated processing, especially with RAG systems lacking comprehensive evaluation. This paper addresses this by systematically examining parsing and chunking for financial QA, providing essential guidelines for robust RAG pipelines.
Original Abstract
PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.