ParseBench: A Document Parsing Benchmark for AI Agents

April 9, 20262604.08538

Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet + 1 more

cs.CV

TLDR

ParseBench introduces a new benchmark for AI agents to evaluate semantic document parsing across enterprise documents, revealing current system limitations.

Key contributions

Introduces ParseBench, a new benchmark with ~2,000 human-verified enterprise document pages.
Evaluates document parsing across five key dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding.
Benchmarked 14 methods, revealing fragmented capabilities and no consistently strong performer across all dimensions.
Highlights significant remaining capability gaps in current AI document parsing systems for enterprise automation.

Why it matters

AI agents need robust document parsing for autonomous decisions, especially in enterprise settings. This benchmark addresses a critical gap by providing a comprehensive evaluation framework for semantic correctness, driving future research and development in this vital area.

Original Abstract

AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers