OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin + 7 more
TLDR
OmniShow unifies multimodal conditions (text, image, audio, pose) for high-quality Human-Object Interaction Video Generation, achieving SOTA performance.
Key contributions
- OmniShow: an end-to-end framework for Human-Object Interaction Video Generation (HOIVG).
- Unified Channel-wise Conditioning & Gated Local-Context Attention for precise multimodal control.
- Decoupled-Then-Joint Training strategy to overcome data scarcity in HOIVG.
- Establishes HOIVG-Bench, a comprehensive benchmark for evaluating HOIVG.
Why it matters
OmniShow significantly advances content creation for e-commerce, short videos, and entertainment by automating human-object interaction video generation. It unifies diverse multimodal conditions and addresses data scarcity, setting a new standard for controllable and high-quality synthetic media.
Original Abstract
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.