ArXiv TLDR

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

🐦 Tweet
2604.11804

Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin + 7 more

cs.CV

TLDR

OmniShow unifies multimodal conditions (text, image, audio, pose) for high-quality Human-Object Interaction Video Generation, achieving SOTA performance.

Key contributions

  • OmniShow: an end-to-end framework for Human-Object Interaction Video Generation (HOIVG).
  • Unified Channel-wise Conditioning & Gated Local-Context Attention for precise multimodal control.
  • Decoupled-Then-Joint Training strategy to overcome data scarcity in HOIVG.
  • Establishes HOIVG-Bench, a comprehensive benchmark for evaluating HOIVG.

Why it matters

OmniShow significantly advances content creation for e-commerce, short videos, and entertainment by automating human-object interaction video generation. It unifies diverse multimodal conditions and addresses data scarcity, setting a new standard for controllable and high-quality synthetic media.

Original Abstract

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.