ArXiv TLDR

EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges

🐦 Tweet
2604.22595

Hyo Jin Jon, Longbin Jin, Eun Yi Kim

cs.CV

TLDR

EV-CLIP efficiently adapts CLIP for few-shot action recognition, addressing visual challenges like low-light and egocentric views with novel visual prompts.

Key contributions

  • Addresses spatial perception limitations of CLIP in action recognition, especially under visual challenges.
  • Introduces EV-CLIP, an efficient adaptation framework for few-shot video action recognition.
  • Uses mask prompts to guide attention to action-relevant regions and context prompts for lightweight temporal modeling.
  • Outperforms existing parameter-efficient methods and maintains efficiency regardless of backbone scale.

Why it matters

CLIP often struggles with spatial understanding in action recognition under real-world visual challenges. EV-CLIP provides an efficient, parameter-light solution that significantly improves performance. This makes CLIP more practical for diverse, resource-constrained video applications.

Original Abstract

CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model's attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at https://github.com/AI-CV-Lab/EV-CLIP.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.