GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
Amir Hossein Kargaran, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze
TLDR
Current OCR models struggle with generalization beyond a few common scripts, failing on over 100 diverse Unicode scripts, as shown by GlotOCR Bench.
Key contributions
- Introduces GlotOCR Bench, a comprehensive benchmark for OCR generalization across 100+ Unicode scripts.
- Evaluates open-weight and proprietary models on clean and degraded multilingual texts.
- Finds most models perform well on fewer than 10 scripts; frontier models fail beyond 30.
- Shows OCR performance correlates with script-level pretraining coverage, not just visual recognition.
Why it matters
This paper highlights a critical limitation of current OCR models, showing their poor generalization across diverse global scripts. It provides a much-needed comprehensive benchmark to drive future research towards more inclusive and robust multilingual OCR systems. The findings emphasize the need for broader pretraining data.
Original Abstract
Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.