SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
Hector G. Rodriguez, Marcus Rohrbach
TLDR
SIEVES improves selective prediction in MLLMs by scoring visual evidence localization, tripling coverage on OOD tasks without model-specific training.
Key contributions
- Introduces SIEVES, a selective prediction method for MLLMs using visual evidence scoring.
- Learns to estimate the quality of localized visual evidence provided by reasoner models.
- Achieves up to 3x better coverage on challenging OOD benchmarks compared to baselines.
- Transfers to proprietary MLLMs (e.g., Gemini-3-Pro) without needing internal access.
Why it matters
Reliable MLLM deployment in real-world, out-of-distribution scenarios is crucial. SIEVES offers a novel approach to selective prediction that significantly boosts coverage and generalizes across diverse models and datasets. This enhances trustworthiness and practical applicability of MLLMs.
Original Abstract
Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.