MUA: Mobile Ultra-detailed Animatable Avatars
Heming Zhu, Guoxing Sun, Marc Habermann
TLDR
MUA enables ultra-detailed, animatable avatars for mobile devices by using a novel representation that drastically reduces computational cost and model size.
Key contributions
- Introduces "Wavelet-guided Multi-level Spatial Factorized Blendshapes" for efficient, high-fidelity avatars.
- Achieves 2000X lower computational cost and 10X smaller model size than high-quality teacher models.
- Preserves ultra-high fidelity, dynamics, and appearance details comparable to server-class models.
- Enables real-time performance (24 FPS) on standalone mobile VR devices like Meta Quest 3.
Why it matters
Existing high-fidelity avatars are computationally expensive, while lightweight ones lack detail. This paper bridges the gap by enabling ultra-detailed, animatable avatars to run efficiently on mobile devices. This significantly improves the practicality of immersive applications like VR.
Original Abstract
Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.