ArXiv TLDR

Static and Dynamic Graph Alignment Network for Temporal Video Grounding

🐦 Tweet
2605.00684

Zhanjie Hu, Bolin Zhang, Jianhua Wang, Jianbo Zheng, Chenchen Yan + 3 more

cs.CV

TLDR

SDGAN enhances Temporal Video Grounding by aligning static/dynamic features, using query-aware graphs, and employing multi-granularity training.

Key contributions

  • Aligns static and dynamic visual features using Position-wise Nodes Alignment for robust visual representation.
  • Employs Query-Clip Contrastive Learning and Adaptive Graph Modeling for query-aware visual representations.
  • Integrates multi-granularity temporal proposals with a Progressive Easy-to-Hard Training Strategy.

Why it matters

Existing TVG methods face issues with incomplete features, query-agnostic graphs, and single-granularity matching. SDGAN resolves these by combining static/dynamic features, query-aware graphs, and multi-granularity training. This significantly improves temporal video grounding precision.

Original Abstract

Temporal Video Grounding (TVG) aims to localize temporal moments in an untrimmed video that semantically correspond to given natural language queries. Recently, Graph Convolutional Networks (GCN) have been widely adopted in TVG to model temporal relations among video clips and enhance contextual reasoning by constructing clip-level graphs. Despite their effectiveness, existing GCN-based TVG methods encounter three critical bottlenecks: 1) Most methods construct graph nodes using either static or dynamic features alone, resulting in incomplete visual representation and overlooking complementary semantics, 2) Most methods construct temporal graphs in a query-agnostic manner, leading to inefficient feature interaction within the temporal graph representation, and 3) Most methods often suffer from a single-granularity semantic matching, while direct training on complex temporal localization task may lead to slow convergence and suboptimal precision. To address these challenges, we propose Static and Dynamic Graph Alignment Network (SDGAN). First, SDGAN jointly exploits static and dynamic visual features to construct two complementary temporal graphs and performs Position-wise Nodes Alignment, enabling more expressive and robust visual representation. Second, SDGAN introduces Query-Clip Contrastive Learning and Adaptive Graph Modeling to explicitly align visual clips with their corresponding textual queries, yielding query-aware visual representations. Third, SDGAN incorporates multi-granularity temporal proposals within Progressive Easy-to-Hard Training Strategy, effectively bridging coarse-grained semantic localization and fine-grained temporal boundary refinement. Extensive experiments on three benchmark datasets demonstrate that SDGAN achieves superior performance across complex TVG scenarios. Codes and datasets are available at https://github.com/ZhanJieHu/SDGAN.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.