Very Efficient Listwise Multimodal Reranking for Long Documents

May 12, 20262605.11864

Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

cs.IRcs.AIcs.CVcs.MM

TLDR

ZipRerank is a highly efficient listwise multimodal reranker that significantly speeds up M-RAG for long documents by reducing input length and eliminating autoregressive decoding.

Key contributions

Reduces input length with a lightweight query-image early interaction mechanism.
Eliminates autoregressive decoding by scoring all candidates in a single forward pass.
Uses two-stage training: listwise pretraining on text-as-images and VLM-teacher-distilled finetuning.
Matches SOTA multimodal rerankers while reducing LLM inference latency by an order of magnitude.

Why it matters

This paper addresses the critical bottleneck of computational expense in multimodal reranking for long documents. By introducing ZipRerank, it provides a highly efficient solution that maintains accuracy while drastically reducing latency. This makes advanced M-RAG systems practical for real-world, latency-sensitive applications.

Original Abstract

Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers