Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus

May 5, 20262605.03742

cs.CL

TLDR

This paper benchmarks PEFT strategies for LLMs on Tajik, creating the largest Tajik corpus and finding Mistral 7B with QLoRA performs best.

Key contributions

Created and released the Tajik Web Corpus, the largest open-access Tajik dataset (~1.11 billion characters).
Systematically benchmarked 17 LLM configurations using full fine-tuning, LoRA, and QLoRA for Tajik text generation.
Found Mistral 7B with QLoRA (r=16) achieved the best perplexity (5.03) for Tajik, optimizing computational costs.
Observed full fine-tuning on small GPT-2 models yielded lower perplexity but induced catastrophic forgetting.

Why it matters

This work addresses the critical lack of resources for low-resource languages like Tajik by providing a substantial new corpus. It offers practical, data-driven recommendations for efficient LLM adaptation, guiding researchers in selecting optimal architectures and fine-tuning strategies to minimize computational costs.

Original Abstract

This paper is devoted to the adaptation of generative large language models for the Tajik language, a low-resource language with Cyrillic script. To overcome the shortage of digital text resources, the author created and publicly released the Tajik Web Corpus, the largest open-access corpus of Tajik, comprising 319,298 documents (~1.11 billion characters). On a subsample of 10,000 documents, 17 configurations were benchmarked, covering autoregressive, encoder-decoder, and encoder-only models with three fine-tuning strategies: full fine-tuning, LoRA, and QLoRA (ranks 8 and 16). Quality was assessed via perplexity and cross-entropy loss; peak GPU memory and training time were also recorded. Best results were achieved by Mistral 7B with QLoRA (r=16): mean perplexity 5.03, standard deviation 0.03. Increasing rank from 8 to 16 gave statistically insignificant improvement while raising memory consumption. For small GPT-2 family models, full fine-tuning yielded lower perplexity (3.48 for GPT-2 Medium) than LoRA (7.60-8.42), but induced catastrophic forgetting. The encoder-only XLM-RoBERTa showed the worst results (perplexity 59.3). The novelty lies in creating the largest verified Tajik corpus and the first systematic analysis of PEFT effectiveness for Tajik text generation. Practical value lies in recommendations for architecture and fine-tuning strategy selection, optimizing computational costs without substantial quality loss.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers