ArXiv TLDR

Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles

🐦 Tweet
2604.25889

Minh-Khoa Le-Phan, Minh-Hoang Le, Trong-Le Do, Minh-Triet Tran

cs.CV

TLDR

This paper introduces a robust deepfake detection framework that uses a degradation engine and multi-stream ensemble to mitigate spatial attention drift under real-world conditions.

Key contributions

  • Integrates extreme degradation with multi-stream architecture for robust deepfake detection.
  • Optimizes DINOv2-Giant backbone to extract invariant geometric and semantic priors.
  • Employs three specialized streams: Global Texture, Localized Facial, and Hybrid Semantic Fusion.
  • Aggregates predictions via a calibrated voting ensemble to suppress attention drift.

Why it matters

Current deepfake detectors struggle with real-world degradations due to attention drift. This paper offers a robust framework that stabilizes attention and extracts invariant features, achieving stable zero-shot generalization crucial for practical deepfake detection.

Original Abstract

Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic priors. We then process images through three specialized pathways: a Global Texture stream, a Localized Facial stream, and a Hybrid Semantic Fusion stream incorporating CLIP. Through analyzing spatial attribution via Score-CAM and feature stability using Cosine Similarity, we quantitatively demonstrate that these streams extract non-redundant, complementary feature representations and stabilize attention entropy. By aggregating these predictions via a calibrated, discretized voting mechanism, our ensemble successfully suppresses background attention drift while acting as a robust geometric anchor. Our approach yields highly stable zero-shot generalization, achieving Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR. Code is available at https://github.com/khoalephanminh/ntire26-deepfake-challenge.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.