HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar
Yeheng Zong, Pou-Chun Kung, Yike Pan, Seth Isaacson, Yizhou Chen + 2 more
TLDR
HumanSplatHMR jointly optimizes 3D human pose and high-fidelity Gaussian Splatting avatars for novel-view and novel-pose synthesis via differentiable rendering.
Key contributions
- Proposes HumanSplatHMR, a joint optimization for 3D human pose refinement and high-fidelity avatar learning.
- Closes the loop between geometric pose estimation and differentiable rendering for improved accuracy.
- Backpropagates image-level losses through a renderer to refine pose parameters and global position.
- Achieves better novel-view/pose synthesis and pose recovery without relying on motion capture.
Why it matters
This paper addresses critical shortcomings in existing human avatar methods by jointly optimizing pose and appearance. It enables more accurate 3D human geometry recovery and better generalization for novel views and poses from in-the-wild video, crucial for VR and digital twinning.
Original Abstract
Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplatHMR, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplatHMR backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.