Document-as-Image Representations Fall Short for Scientific Retrieval

April 20, 20262604.18508

Ghazal Khalighinejad, Raghuveer Thirukovalluru, Alexander H. Oh, Bhuwan Dhingra

cs.IRcs.AIcs.CL

TLDR

Document-as-image representations are suboptimal for scientific document retrieval; text-based methods, even for figures, and interleaved text+image perform better.

Key contributions

Document-as-image representations are consistently suboptimal for scientific retrieval, especially with longer documents.
Text-based representations prove most effective, even for figure-based queries, using captions and context.
Interleaved text+image representations outperform document-as-image without requiring specialized training.
Introduced ArXivDoc, a new benchmark from LaTeX sources, enabling structured scientific retrieval studies.

Why it matters

Document-as-image representations are suboptimal for scientific retrieval. This paper introduces ArXivDoc, a new LaTeX-based benchmark, showing text-based and interleaved multimodal methods are superior. This guides more effective retrieval.

Original Abstract

Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers