LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li + 3 more
TLDR
LoRA introduces a low-rank adaptation method that enables efficient fine-tuning of large language models by injecting trainable low-rank matrices while freezing original weights, drastically reducing trainable parameters and resource usage without sacrificing performance.
Key contributions
- Proposes Low-Rank Adaptation (LoRA) to freeze pre-trained weights and inject trainable low-rank matrices into Transformer layers for efficient fine-tuning.
- Achieves up to 10,000x reduction in trainable parameters and 3x lower GPU memory usage compared to full fine-tuning on models like GPT-3 175B.
- Demonstrates comparable or better performance than full fine-tuning on multiple models (RoBERTa, DeBERTa, GPT-2, GPT-3) with no added inference latency.
Why it matters
As language models grow larger, full fine-tuning becomes computationally and economically impractical, especially for deploying multiple task-specific versions. LoRA addresses this challenge by enabling parameter-efficient adaptation that maintains model quality while significantly reducing resource demands. This makes it feasible to customize massive models for diverse applications, accelerating research and deployment in NLP.
Original Abstract
An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.