ArXiv TLDR

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

🐦 Tweet
2604.18572

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros

cs.CVcs.AIcs.LG

TLDR

Evidence for cross-modal representational convergence in neural networks is weaker than previously thought, especially at scale.

Key contributions

  • Cross-modal representational convergence evidence is fragile, depending on evaluation regime.
  • Alignment degrades substantially when scaling datasets from 1K to millions of samples.
  • Remaining alignment shows coarse semantic overlap, not consistent fine-grained structure.
  • One-to-one evaluations inflate alignment; realistic many-to-many settings reduce it.

Why it matters

This paper critically re-evaluates the "Platonic Representation Hypothesis," challenging the idea that neural networks trained on different modalities converge to identical representations. Its findings suggest that modality choice still significantly matters, as models may learn equally rich but distinct views of the world. This reframes our understanding of cross-modal learning.

Original Abstract

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.