One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

April 30, 20262604.27674

Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai

cs.CLcs.AIcs.CRcs.IR

TLDR

This paper identifies hubness vulnerabilities in cross-modal encoders like CLIP by finding single hub texts that achieve high, unreasonable similarity scores.

Key contributions

Introduces a method to identify "hub text" and embeddings in cross-modal encoders.
Demonstrates that a single hub text can achieve unreliably high similarity scores with many images.
Exposes vulnerabilities in cross-modal encoders like CLIP via image captioning and retrieval tasks.
Shows hub texts can match or exceed human-written captions, indicating encoder flaws.

Why it matters

This paper exposes a critical 'hubness' flaw in cross-modal encoders like CLIP. It shows how single 'hub texts' generate misleadingly high similarity scores, highlighting the urgent need for more robust evaluation and design. This impacts reliable information retrieval and AI safety.

Original Abstract

The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applications, and thus, the existence of hubs may pose practical threats. To reveal the vulnerabilities of cross-modal encoders, we propose a method for identifying the hub embedding and its corresponding hub text. Experiments on image captioning evaluation in MSCOCO and nocaps along with image-to-text retrieval tasks in MSCOCO and Flickr30k showed that our method can identify a single hub text that unreasonably achieves comparable or higher similarity scores than human-written reference captions in many images, thereby revealing the vulnerabilities in cross-modal encoders.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers