ParseBench: A Document Parsing Benchmark for AI Agents
Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet + 1 more
TLDR
ParseBench introduces a new benchmark for AI agents to evaluate semantic document parsing across enterprise documents, revealing current system limitations.
Key contributions
- Introduces ParseBench, a new benchmark with ~2,000 human-verified enterprise document pages.
- Evaluates document parsing across five key dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding.
- Benchmarked 14 methods, revealing fragmented capabilities and no consistently strong performer across all dimensions.
- Highlights significant remaining capability gaps in current AI document parsing systems for enterprise automation.
Why it matters
AI agents need robust document parsing for autonomous decisions, especially in enterprise settings. This benchmark addresses a critical gap by providing a comprehensive evaluation framework for semantic correctness, driving future research and development in this vital area.
Original Abstract
AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.