ArXiv TLDR

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

🐦 Tweet
2604.21806

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li + 1 more

cs.CV

TLDR

TEMA is a new framework for Composed Image Retrieval that handles complex, multi-modification text queries, outperforming existing methods.

Key contributions

  • Tackles limitations in Composed Image Retrieval (CIR) regarding insufficient entity coverage and clause-entity misalignment.
  • Proposes TEMA, the first CIR framework specifically designed for multi-modification text queries.
  • Introduces two new instruction-rich multi-modification datasets: M-FashionIQ and M-CIRR.
  • Demonstrates superior retrieval accuracy and efficiency across four benchmark datasets.

Why it matters

Current Composed Image Retrieval (CIR) struggles with complex, multi-modification text queries. This work introduces TEMA and new datasets, significantly advancing CIR's ability to handle real-world, nuanced requests. It bridges a critical gap, making CIR more practical and robust.

Original Abstract

Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at https://github.com/lee-zixu/ACL26-TEMA/.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.