CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping
TLDR
CA-IDD introduces the first diffusion-based face swapping method, using cross-attention and multi-modal guidance for identity-consistent, realistic results.
Key contributions
- Presents CA-IDD, the first diffusion-based face swapping approach for identity-consistent image generation.
- Integrates multi-modal guidance (gaze, identity, facial parsing) via multi-scale cross-attention.
- Utilizes hierarchical attention and expert-guided supervision for accurate identity transfer and visual quality.
- Outperforms GAN-based methods with stable training, robust generalization, and an FID of 11.73.
Why it matters
Existing GAN-based face swapping methods struggle with identity preservation and realism. CA-IDD offers a stable, robust diffusion framework that overcomes these limitations. Its multi-modal guidance ensures superior identity consistency and visual quality, setting a new standard for face editing.
Original Abstract
Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.