ArXiv TLDR

Personal Visual Context Learning in Large Multimodal Models

🐦 Tweet
2605.10936

Zihui Xue, Ami Baid, Sangho Kim, Mi Luo, Kristen Grauman

cs.CV

TLDR

This paper defines Personal VCL for LMMs, presents a benchmark, and proposes the Agentic Context Bank to enable personalized visual reasoning.

Key contributions

  • Formalizes Personal Visual Context Learning (Personal VCL) for LMMs in wearable devices.
  • Introduces Personal-VCL-Bench, a comprehensive benchmark for evaluating personalized visual reasoning.
  • Identifies a "context utilization gap" in frontier LMMs for leveraging and aggregating visual evidence.
  • Proposes Agentic Context Bank, an inference-time baseline for structured, self-refining visual memory.

Why it matters

This paper tackles the critical challenge of visual personalization for LMMs in wearable devices, a prerequisite for true personal assistants. By formalizing Personal VCL, introducing a benchmark, and proposing the Agentic Context Bank, it offers a practical path to more effective and personalized AI.

Original Abstract

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.