TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

April 29, 20262604.26772

Ahmed Abdullah, Nikolas Ebert, Oliver Wasenmüller

cs.CV

TLDR

This paper benchmarks modern Vision Foundation Models for AI-generated image detection, introducing Tunable Attention Pooling (TAP) for SOTA results.

Key contributions

Benchmarked diverse Vision Foundation Models (VFMs) for AI-generated image detection.
Found top VFMs outperform CLIP by over 12% accuracy for AI-generated image detection.
Introduced Tunable Attention Pooling (TAP) to refine VFM features for AIGI classification.
Achieved new state-of-the-art on two challenging benchmarks for in-the-wild AIGI detection.

Why it matters

Detecting AI-generated images is crucial for combating misinformation. This work shows modern Vision Foundation Models, combined with a novel pooling method, significantly advance the state-of-the-art. It provides a robust solution for identifying both fully generated and inpainted AI images.

Original Abstract

Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP's release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers