Training-Free Semantic Multi-Object Tracking with Vision-Language Models

April 15, 20262604.14074

Laurence Bonat, Francesco Tonini, Elisa Ricci, Lorenzo Vaquero

cs.CV

TLDR

TF-SMOT introduces a training-free semantic multi-object tracking pipeline using pre-trained vision-language models for improved video summaries and captions.

Key contributions

Proposes TF-SMOT, a novel training-free pipeline for Semantic Multi-Object Tracking (SMOT).
Composes pre-trained models (D-FINE, SAM2, InternVideo2.5) for robust detection, tracking, and video-language generation.
Achieves state-of-the-art tracking performance and improves video summary/caption quality on BenSMOT.
Utilizes contour grounding and LLM disambiguation for semantic video summaries and interaction labels.

Why it matters

Existing SMOT systems are expensive to train and hard to adapt. TF-SMOT's training-free approach offers a flexible and efficient way to leverage new foundation models. It significantly advances semantic understanding of dynamic scenes, providing human-interpretable descriptions.

Original Abstract

Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers