MinT: Managed Infrastructure for Training and Serving Millions of LLMs
Mind Lab, :, Song Cao, Vic Cao, Andrew Chen + 58 more
TLDR
MinT is a managed infrastructure system for efficiently training and serving millions of LoRA-adapted LLMs over shared base models.
Key contributions
- Provides managed infrastructure for LoRA post-training and online serving.
- Scales LoRA RL to frontier-scale dense and MoE architectures beyond 1T parameters.
- Reduces overhead by moving only small LoRA adapters, improving step time by up to 18.3x.
- Manages million-scale LoRA policy catalogs and thousands of active adapters on shared base models.
Why it matters
This paper addresses the critical challenge of efficiently managing and deploying a vast number of fine-tuned LLMs using LoRA. By keeping base models resident and only moving small adapter revisions, MinT significantly reduces resource overhead and enables large-scale experimentation and serving of personalized or task-specific models. This is crucial for advancing LLM applications at scale.
Original Abstract
We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.