UrbanClipAtlas: A Visual Analytics Framework for Event and Scene Retrieval in Urban Videos

April 16, 20262604.15225

Joel Perca, Luis Sante, Juanpablo Heredia, Joao Rulff, Claudio Silva + 1 more

cs.HC

TLDR

URBANCLIPATLAS is a visual analytics system using RAG and VLM to efficiently retrieve and interpret events in long urban videos.

Key contributions

Introduces URBANCLIPATLAS, a visual analytics system for urban video event and scene retrieval.
Combines RAG, taxonomy-aware entity extraction, and video grounding for robust analysis.
Segments videos, generates VLM descriptions, and indexes for semantic search and interpretation.
Aligns textual reasoning with visual evidence to reduce validation effort and refine hypotheses.

Why it matters

This paper addresses the challenge of analyzing vast urban video data, offering a system that automates event retrieval and interpretation. By integrating AI with visual analytics, it significantly reduces manual effort. This innovation is crucial for improving urban safety and mobility insights.

Original Abstract

Extracting actionable insights from long-duration urban videos is often labor-intensive: analysts must manually sift through raw footage to pinpoint target events or uncover broader behavioral trends. In this work, we present URBANCLIPATLAS, a visual analytics system for exploring long urban videos recorded at street intersections. URBANCLIPATLAS combines retrieval-augmented generation (RAG), taxonomy-aware entity extraction, and video grounding to support event retrieval and interpretation. The system segments extended recordings into short clips, generates textual descriptions with a vision-language model, and indexes them for semantic retrieval. A knowledge graph maps entities and relations from LLM answers onto a domain-specific taxonomy and aligns them with detected objects and trajectories to support visual grounding and verification. URBANCLIPATLAS supports scene retrieval through an augmented chat-based interface and improves scene interpretation by tightly aligning textual outputs with video evidence. This design strengthens the connection between textual reasoning and visual evidence, reducing the effort required to validate model outputs and refine hypotheses. We demonstrate the usefulness of URBANCLIPATLAS on the StreetAware dataset through two case studies involving hazardous scenarios and crossing dynamics at street intersections. URBANCLIPATLAS helps analysts reason about safety- and mobility-related patterns across large urban video collections.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers