ArXiv TLDR

Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability

🐦 Tweet
2604.04469

Jon-Paul Cacioli

cs.CLq-bio.QM

TLDR

Transformers' magnitude representations show *decreasing* variability with magnitude, the opposite of biological scalar variability, despite reproducing log-compressive geometry.

Key contributions

  • Analyzed hidden-state variability of numerical magnitudes in Llama-3 and Mistral models.
  • Found representational variability *decreased* with magnitude, opposite to biological scalar variability.
  • This anti-scalar pattern was stronger along the magnitude axis and correlated with corpus frequency.
  • Suggests distributional learning alone is insufficient for scalar variability in LLMs.

Why it matters

This paper reveals a fundamental difference between how transformers and biological systems represent numerical magnitudes. Transformers fail to exhibit scalar variability, a key property of biological magnitude systems. This highlights a limitation of current distributional learning approaches in replicating human-like cognitive properties.

Original Abstract

Scalar variability -- the finding that representational noise scales proportionally with magnitude, producing a constant coefficient of variation -- is a hallmark of biological magnitude systems. We tested whether transformer language models exhibit this property by analysing the dispersion of hidden-state representations across carrier sentences for 26 numerical magnitudes in three 7-8B parameter models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base; data from Cacioli, 2026). We found the opposite: representational variability decreased with magnitude along the magnitude axis (scaling exponent alpha approx -0.19; 0/16 primary layers with alpha > 0, all three models). The negative sign was consistent in full-dimensional space (alpha approx -0.04) and after sentence-identity correction (alpha approx -0.007). The anti-scalar pattern was 3-5x stronger along the magnitude axis than orthogonal dimensions, and corpus frequency strongly predicted per-magnitude variability (rho = .84). These results demonstrate that distributional learning alone is insufficient to produce scalar variability: transformers reproduce log-compressive magnitude geometry but not the constant-CV noise signature observed in biological systems.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.