GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos
Yiqian Wu, Rawal Khirodkar, Egor Zakharov, Timur Bagautdinov, Lei Xiao + 4 more
TLDR
GenLCA introduces a 3D diffusion model for photorealistic full-body avatars, trained on millions of in-the-wild videos using a novel visibility-aware strategy.
Key contributions
- Generates photorealistic full-body avatars from text/image inputs with high-fidelity animation.
- Trains a 3D diffusion model using millions of partially observable 2D in-the-wild videos.
- Employs a pretrained avatar reconstruction model as an animatable 3D tokenizer.
- Introduces a visibility-aware diffusion strategy to handle partial body observations.
Why it matters
This paper enables training 3D diffusion models with large-scale real-world video data, overcoming limitations of partial observations. This leads to superior photorealism and generalizability for full-body avatar generation and editing, outperforming current methods.
Original Abstract
We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.