ArXiv TLDR

MOSA: Motion-Guided Semantic Alignment for Dynamic Scene Graph Generation

🐦 Tweet
2604.19631

Xuejiao Wang, Bohao Zhang, Changbo Wang, Gaoqi He

cs.CV

TLDR

MoSA improves Dynamic Scene Graph Generation by using motion-guided semantic alignment to better model fine-grained and tail relationships in videos.

Key contributions

  • Motion Feature Extractor (MFE) encodes object-pair motion attributes like distance, velocity, and persistence.
  • Motion-guided Interaction Module (MIM) fuses motion with spatial features for motion-aware relationship representations.
  • Action Semantic Matching (ASM) aligns visual features with text embeddings to enhance semantic discrimination.
  • A category-weighted loss strategy is used to improve learning of challenging tail relationships.

Why it matters

This paper addresses key limitations in Dynamic Scene Graph Generation, particularly in modeling complex and rare relationships. By integrating motion features and semantic alignment, MoSA offers a more robust and accurate approach to understanding dynamic interactions in videos. This advancement is crucial for high-level video understanding tasks.

Original Abstract

Dynamic Scene Graph Generation (DSGG) aims to structurally model objects and their dynamic interactions in video sequences for high-level semantic understanding. However, existing methods struggle with fine-grained relationship modeling, semantic representation utilization, and the ability to model tail relationships. To address these issues, this paper proposes a motion-guided semantic alignment method for DSGG (MoSA). First, a Motion Feature Extractor (MFE) encodes object-pair motion attributes such as distance, velocity, motion persistence, and directional consistency. Then, these motion attributes are fused with spatial relationship features through the Motion-guided Interaction Module (MIM) to generate motion-aware relationship representations. To further enhance semantic discrimination capabilities, the cross-modal Action Semantic Matching (ASM) mechanism aligns visual relationship features with text embeddings of relationship categories. Finally, a category-weighted loss strategy is introduced to emphasize learning of tail relationships. Extensive and rigorous testing shows that MoSA performs optimally on the Action Genome dataset.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.