CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
Zhipeng Xu, Junhao Ji, Zulong Chen, Zhenghao Liu, Qing Liu + 8 more
TLDR
CC-OCR V2 introduces a new benchmark for evaluating Large Multimodal Models on real-world document processing, revealing current models fall short.
Key contributions
- Introduces CC-OCR V2, a comprehensive benchmark for real-world enterprise document processing.
- Features 7,093 high-difficulty samples across 5 OCR tracks: recognition, parsing, grounding, KIE, and QA.
- Benchmarks 14 advanced LMMs, revealing their substantial performance degradation in practical settings.
- Identifies a critical gap between current LMM performance on existing benchmarks and real-world application needs.
Why it matters
This paper is important as it exposes the limitations of current Large Multimodal Models in practical, real-world document processing. CC-OCR V2, a new challenging benchmark, provides a crucial tool for developing LMMs that truly meet enterprise needs. The findings highlight a significant research direction for improving AI document literacy.
Original Abstract
Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC-OCR V2, a comprehensive and challenging OCR benchmark tailored to real-world document processing. CC-OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR-centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high-difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real-world application requirements. Even state-of-the-art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real-world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC-OCR-V2.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.