CRAFT: Clustered Regression for Adaptive Filtering of Training data
Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda
TLDR
CRAFT efficiently selects high-quality training data subsets for seq2seq models using a two-stage clustered regression, achieving superior speed and performance.
Key contributions
- Proposes CRAFT, a vectorization-agnostic two-stage method for selecting training data for seq2seq models.
- Matches validation source distribution via k-means clusters and selects target pairs minimizing conditional distance.
- Achieves 43.34 BLEU on En-Hi translation, outperforming TSDS by 2.13 points and 40x faster selection.
- Selection process is significantly faster than TAROT (2.8x speedup) and completes under a minute on CPU.
Why it matters
Efficient data selection is crucial for fine-tuning large models, reducing costs and improving performance. CRAFT provides a fast, effective solution, significantly speeding up the selection process while maintaining or improving model quality. This makes large-scale model fine-tuning more accessible and practical.
Original Abstract
Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.