ArXiv TLDR

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

🐦 Tweet
2604.20146

Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong + 4 more

cs.IRcs.CL

TLDR

SAKE improves GMNER by adaptively combining internal MLLM knowledge with external search, using self-aware reasoning to avoid noisy retrievals.

Key contributions

  • Harmonizes internal knowledge exploitation and external exploration for GMNER in open-world settings.
  • Quantifies model uncertainty via Difficulty-aware Search Tag Generation to signal knowledge gaps.
  • Creates SAKE-SeCoT, a CoT dataset for supervised fine-tuning of self-awareness and tool-use.
  • Utilizes agentic reinforcement learning with hybrid rewards for adaptive, self-aware retrieval decisions.

Why it matters

This paper introduces SAKE, a novel approach for Grounded Multimodal Named Entity Recognition in dynamic, open-world environments. By intelligently combining internal model knowledge with external search, it overcomes limitations of prior methods, reducing hallucinations and noise. This advancement is crucial for robust entity extraction in social media and similar applications.

Original Abstract

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.