Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
TLDR
Novel gating mechanisms, inspired by the Free Energy Principle, significantly improve Mixture-of-Experts routing at domain transitions.
Key contributions
- Identifies that standard MoE routing fails at domain transitions, assigning only 0.006 probability to the correct expert.
- Introduces three lightweight gate modifications: temporal memory (beta), precision-weighted gating (Pi), and anticipatory routing.
- Demonstrates a super-additive interaction between beta and anticipatory routing, closing 75% of the oracle gap.
- Achieves significant improvements in expert selection probability and reduces Bits Per Character (BPC) at transition steps.
Why it matters
This paper tackles a key weakness of sparse MoE models at domain transitions, a common real-world data challenge. By integrating biologically inspired gating mechanisms, it drastically improves expert routing accuracy and efficiency. This research enables more robust and adaptable MoE architectures.
Original Abstract
Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanisms draw from Friston's Free Energy Principle and use LIF dynamics from spiking neural networks. An ablation across all 2^3 subsets reveals a super-additive beta x Ant interaction: anticipation alone gives nothing (+0.000 +/- 0.001); beta alone gives modest gain (+0.295 +/- 0.013); combined they close 75% of the oracle gap (+0.741 +/- 0.002, exceeding the sum by +0.446 +/- 0.014). This is structural: a stateless predictor cannot detect approaching transitions because pre-transition tokens are distributionally identical to within-domain tokens. In a character-level MoE LM (5 seeds), beta-routing reduces transition-step BPC from 6.56 +/- 0.01 (Standard) to 4.01 +/- 0.15 (beta-MoE); the beta + Ant gate places 0.86 +/- 0.02 probability on the correct domain expert before that domain appears in input, vs 0.42 +/- 0.12 for Standard MoE. Reference implementations (~200 lines each): https://github.com/russellwmy/affinity-is-not-enough
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.