Compute Where it Counts: Self Optimizing Language Models

May 11, 20262605.10875

cs.LGcs.CL

TLDR

Self-Optimizing Language Models (SOL) dynamically allocate computation per token, improving LLM inference efficiency and quality over static methods.

Key contributions

Introduces Self-Optimizing Language Models (SOL) for dynamic, token-level compute allocation.
Policy network dynamically controls attention sparsity, MLP pruning, and activation quantization.
Trains with group-relative policy optimization, comparing counterfactual compute schedules.
Achieves up to 7.3% MMLU accuracy gain and better quality-efficiency Pareto-front.

Why it matters

Current LLM inference methods apply uniform computation, wasting resources on easy tokens. This paper introduces a novel approach to dynamically adjust compute per token, significantly boosting efficiency and quality. It offers a new direction for optimizing LLM inference.

Original Abstract

Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., "counterfactual" schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers