ArXiv TLDR

Face Anything: 4D Face Reconstruction from Any Image Sequence

🐦 Tweet
2604.19702

Umut Kocasari, Simon Giebenhain, Richard Shaw, Matthias Nießner

cs.CV

TLDR

Face Anything reconstructs high-fidelity 4D faces from any image sequence using canonical facial point prediction, achieving state-of-the-art accuracy.

Key contributions

  • Introduces "canonical facial point prediction" for unified, high-fidelity 4D face reconstruction.
  • Transforms dynamic reconstruction into a canonical problem, ensuring temporally consistent geometry.
  • A transformer model jointly predicts depth and canonical coordinates for robust tracking.
  • Achieves state-of-the-art performance, with 3x lower correspondence error and 16% better depth.

Why it matters

This paper addresses the complex challenge of reconstructing dynamic 4D faces from image sequences, which is critical for realistic digital humans. Its unified approach significantly improves accuracy and temporal consistency over existing methods. This breakthrough has major implications for AR/VR, animation, and virtual try-on applications.

Original Abstract

Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.