ArXiv TLDR

SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere

🐦 Tweet
2605.08003

Chao Huang, Penfei Wei, Wei Wang, Jie Wen, Zhihua Wang + 3 more

cs.CV

TLDR

SphereVAD offers training-free video anomaly detection by leveraging pre-trained MLLM features and geometric inference on a unit hypersphere.

Key contributions

  • Proposes SphereVAD, a training-free, zero-shot VAD framework using geometric inference on a unit hypersphere.
  • Leverages latent geometric discriminability from pre-trained MLLM features for robust anomaly detection.
  • Utilizes Frechet mean centering, Holistic Scene Attention (HSA), and Spherical Geodesic Pulling (SGP).
  • Establishes new SOTA among training-free VAD and remains competitive with fully supervised baselines.

Why it matters

SphereVAD solves the deployment challenge in VAD by offering a training-free, zero-shot method. It uses geometric discriminability from pre-trained MLLM features, making anomaly detection highly adaptable and efficient for novel scenes without costly annotations.

Original Abstract

Video anomaly detection (VAD) aims to automatically identify events that deviate from normal patterns in untrimmed surveillance videos. Existing methods universally depend on large-scale annotations or task-specific training procedures, severely limiting their rapid deployment to novel scenes. We observe that intermediate-layer features of pre-trained multimodal large language models (MLLMs) already encode rich anomaly semantics, yet existing approaches rely on the language output pathway and fail to exploit the geometric discriminability latent in these representations. Based on this finding, we propose SphereVAD, a fully training-free, zero-shot VAD framework that recasts anomaly discrimination as von Mises-Fisher (vMF) likelihood-ratio geodesic inference on the unit hypersphere, unleashing latent discriminability through principled geometric reasoning rather than learning new representations. Specifically, SphereVAD first applies Frechet mean centering to unfold feature distributions and eliminate domain biases, then employs Holistic Scene Attention (HSA) to reinforce feature consistency using cross-video priors, and finally performs vMF-guided Spherical Geodesic Pulling (SGP) to align ambiguous segments with directional prototypes on the spherical manifold. This training-free pipeline requires only minimal synthetic images for calibration. SphereVAD establishes new state-of-the-art results among training-free approaches on three major benchmarks and remains competitive with fully supervised baselines. Code will be available upon acceptance.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.