OneHOI: Unifying Human-Object Interaction Generation and Editing

April 15, 20262604.14062

Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan, Chee Seng Chan

cs.CVcs.MM

TLDR

OneHOI unifies human-object interaction generation and editing using a diffusion transformer, achieving SOTA results across various control conditions.

Key contributions

Introduces OneHOI, a unified diffusion transformer for HOI generation and editing.
Employs Relational Diffusion Transformer (R-DiT) with role/instance-aware tokens and Structured HOI Attention.
Supports diverse control conditions: layout-guided, layout-free, arbitrary-mask, and mixed-condition.
Achieves state-of-the-art performance in both HOI generation and editing tasks.

Why it matters

Current HOI generation and editing are disjoint and limited. OneHOI unifies these tasks into a single diffusion transformer, enabling flexible, SOTA scene synthesis and manipulation, thus significantly advancing HOI modeling.

Original Abstract

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers