Beyond Referring Expressions: Scenario Comprehension Visual Grounding

April 2, 20262604.02323

Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo + 1 more

cs.CV

TLDR

A new benchmark, RSC, challenges visual grounding models to infer targets from complex scenarios, revealing current model limitations.

Key contributions

Introduces Referring Scenario Comprehension (RSC), a new benchmark for scenario-based visual grounding.
RSC features paragraph-length queries requiring inference from roles, intentions, and relational context.
Provides interpretable difficulty tags and an out-of-distribution split for fine-grained analysis.

Why it matters

Current visual grounding benchmarks often overlook complex inference, leading to models that lack deep contextual understanding. This paper introduces a challenging new benchmark that exposes these limitations, pushing the field towards more robust and intelligent visual comprehension systems. It provides crucial tools for developing models that can truly understand real-world scenarios.

Original Abstract

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers