Geometric Factual Recall in Transformers
Shauli Ravfogel, Gilad Yehudai, Joan Bruna, Alberto Bietti
TLDR
Transformers memorize facts geometrically, using embeddings that encode relational structure and an MLP as a relation-conditioned selector.
Key contributions
- Proposes 'geometric memorization' where embeddings encode relational structure, and MLPs select attributes.
- Proves logarithmic embedding dimension suffices for bijections via subject embedding superpositions.
- MLPs act as relation-conditioned selectors using ReLU gating, not associative key-value maps.
- MLP transfers zero-shot to new bijections, learning a generic selection mechanism.
Why it matters
This paper offers a new perspective on how transformers store factual knowledge, moving beyond simple associative memories. Understanding this geometric memorization can lead to more efficient and generalizable models for factual recall. It suggests MLPs learn generic selection mechanisms.
Original Abstract
How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, \emph{geometric} form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode \emph{linear superpositions} of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of $x$?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.