Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
Xiaohe Li, Jiahao Li, Kaixin Zhang, Yuqiang Fang, Leilei Lin + 3 more
TLDR
Delta-LLaVA is a new MLLM framework and benchmark (Delta-QA) for remote sensing change detection, overcoming "temporal blindness" with novel attention mechanisms.
Key contributions
- Introduces Delta-QA, a 180k VQA benchmark for multi-temporal remote sensing change interpretation.
- Proposes Delta-LLaVA, an MLLM framework for remote sensing, addressing "temporal blindness" in change detection.
- Features Change-Enhanced Attention and Change-SEG module to amplify differences and extract features for LLM.
- Employs Local Causal Attention to prevent cross-temporal context leakage, improving change understanding.
Why it matters
This paper is crucial for advancing MLLMs in remote sensing. It tackles "temporal blindness" with a new benchmark (Delta-QA) and an innovative model (Delta-LLaVA). Its specialized architecture significantly improves change detection and understanding in earth observation, enabling more intelligent monitoring.
Original Abstract
While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.