ArXiv TLDR

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

🐦 Tweet
2605.02834

Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi + 4 more

cs.CVcs.LG

TLDR

VideoNet is a new large-scale dataset and benchmark for domain-specific action recognition, revealing VLM struggles and proposing a novel training approach.

Key contributions

  • Introduces VideoNet, a new benchmark with 1,000 domain-specific actions across 37 domains.
  • Shows modern VLMs struggle significantly on domain-specific action recognition, even with few-shot examples.
  • Highlights that VLMs fail to exploit in-context examples as effectively as non-expert humans.
  • Provides the first large-scale training dataset (500k video QA pairs) and fine-tunes Molmo2-4B to outperform 8B models.

Why it matters

Modern VLMs lack robust evaluation for domain-specific action recognition due to insufficient data. VideoNet fills this gap, revealing current VLM limitations and providing a crucial dataset for future model development.

Original Abstract

Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.