MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
Xiran Zhao, Jing Jin, Yan Bai, Zhongan Wang, Yifeng Sun + 4 more
TLDR
Introduces MMVIAD, the first continuous multi-view video dataset for industrial anomaly detection, and VISTA, a model outperforming GPT-5.4.
Key contributions
- MMVIAD: the first continuous multi-view video dataset for industrial anomaly detection and understanding.
- Dataset features 48 object categories, 14 environments, 6 anomaly types, supporting 4 multi-tasks.
- Proposes VISTA, a two-stage post-training pipeline for transferable anomaly understanding.
- VISTA improves base model performance from 45.0 to 57.5 on MMVIAD-Unseen, surpassing GPT-5.4.
Why it matters
This paper addresses the critical need for realistic multi-view video datasets in industrial anomaly detection, vital for manufacturing quality control. It provides MMVIAD, a novel dataset, and VISTA, a model that significantly advances the state-of-the-art. This work offers crucial resources and a strong baseline for future research in robust industrial inspection.
Original Abstract
Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.