ArXiv TLDR

Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

🐦 Tweet
2604.14044

Xiaohe Li, Jiahao Li, Kaixin Zhang, Yuqiang Fang, Leilei Lin + 3 more

cs.CV

TLDR

Delta-LLaVA is a new MLLM framework and benchmark (Delta-QA) for remote sensing change detection, overcoming "temporal blindness" with novel attention mechanisms.

Key contributions

  • Introduces Delta-QA, a 180k VQA benchmark for multi-temporal remote sensing change interpretation.
  • Proposes Delta-LLaVA, an MLLM framework for remote sensing, addressing "temporal blindness" in change detection.
  • Features Change-Enhanced Attention and Change-SEG module to amplify differences and extract features for LLM.
  • Employs Local Causal Attention to prevent cross-temporal context leakage, improving change understanding.

Why it matters

This paper is crucial for advancing MLLMs in remote sensing. It tackles "temporal blindness" with a new benchmark (Delta-QA) and an innovative model (Delta-LLaVA). Its specialized architecture significantly improves change detection and understanding in earth observation, enabling more intelligent monitoring.

Original Abstract

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.