ArXiv TLDR

ROSE: Retrieval-Oriented Segmentation Enhancement

🐦 Tweet
2604.14147

Song Tang, Guangquan Jie, Henghui Ding, Yu-Gang Jiang

cs.CV

TLDR

ROSE is a plug-and-play framework that enhances MLLM segmentation by integrating real-time web retrieval to handle novel and emerging entities.

Key contributions

  • Introduces NEST, a new task for segmenting novel and emerging entities beyond MLLM training.
  • Creates a NEST benchmark with news-related data for comprehensive evaluation.
  • Proposes ROSE, a plug-and-play framework using internet retrieval and prompt enhancers for MLLMs.
  • ROSE achieves significant performance gains on NEST, outperforming baselines by 19.2 gIoU.

Why it matters

MLLMs struggle with segmenting novel and emerging entities due to outdated knowledge. This paper introduces NEST and ROSE, a retrieval-augmented framework that uses real-time web information to enhance MLLM segmentation. This significantly improves accuracy for dynamic content, making MLLMs more robust and practical.

Original Abstract

Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model's knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model's perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs' lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.