ArXiv TLDR

Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

🐦 Tweet
2604.12929

Ayce Idil Aytekin, Xu Chen, Zhengyang Shen, Thabo Beeler, Helge Rhodin + 2 more

cs.CV

TLDR

GraG rapidly reconstructs dynamic 3D hand-object interactions from monocular video using a compact Sum-of-Gaussians representation.

Key contributions

  • Reconstructs dynamic 3D hand-object interactions from monocular video 6.4x faster than prior work.
  • Employs a compact Sum-of-Gaussians (SoG) representation for efficient and stable tracking.
  • Initializes objects via video-adapted SAM3D, then converts to lightweight SoG for fidelity.
  • Refines hand motion with 2D joint/depth losses, avoiding complex 3D hand appearance models.

Why it matters

This paper significantly advances real-time 3D reconstruction of complex hand-object interactions from single videos. By leveraging a compact Gaussian representation, it offers substantial speed and accuracy improvements over existing methods. This enables new applications in AR/VR, robotics, and human-computer interaction.

Original Abstract

We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand's per-joint position error by over 65%.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.