ArXiv TLDR

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

🐦 Tweet
2604.20623

Roie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel, Genady Beryozkin

cs.CVcs.AI

TLDR

RSRCC is a new benchmark for remote sensing change question-answering, enabling fine-grained semantic reasoning about localized changes.

Key contributions

  • Presents RSRCC, a 126k-question benchmark for remote sensing change QA, focusing on localized semantic reasoning.
  • Addresses the gap in fine-grained change explanation, moving beyond traditional image-level change detection.
  • Develops a hierarchical semi-supervised curation pipeline, using Best-of-N ranking for robust data construction.
  • The pipeline leverages semantic segmentation, image-text embeddings, and retrieval-augmented validation.

Why it matters

This paper addresses a key limitation in remote sensing by enabling models to explain *what* changes occur, beyond just *where*. This fine-grained semantic reasoning is vital for applications like environmental monitoring and urban planning, offering more actionable insights.

Original Abstract

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.