The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

April 5, 20262604.04155

cs.LGcs.ITq-bio.QMstat.ML

TLDR

Scientific foundation models suffer a "Geometric Alignment Tax" where discrete tokenization distorts continuous geometry, hindering accurate representation.

Key contributions

Identifies the "Geometric Alignment Tax" as the cost of discrete tokenization on continuous geometry in scientific FMs.
Shows continuous heads reduce geometric distortion by up to 8.5x compared to discrete cross-entropy.
Demonstrates finer quantization in learned codebooks worsens geometry despite improving reconstruction.
Evaluates 14 biological FMs, identifying failure regimes like Local-Global Decoupling and Geometric Vacuity.

Why it matters

This paper reveals a fundamental limitation in scientific foundation models: their discrete tokenization inherently distorts continuous geometric representations. Understanding this "Geometric Alignment Tax" is crucial for developing more accurate and robust models in fields like biology and physics, guiding future architectural and objective function design.

Original Abstract

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes: Local-Global Decoupling, Representational Compression, and Geometric Vacuity. A controlled experiment confirms that Evo 2's reverse-complement robustness on real DNA reflects conserved sequence composition, not learned symmetry. No model achieves simultaneously low distortion, high mutual information, and global coherence.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers