ArXiv TLDR

An empirical study of LoRA-based fine-tuning of large language models for automated test case generation

🐦 Tweet
2604.06946

Milad Moradi, Ke Yan, David Colwell, Rhona Asgari

cs.SEcs.AI

TLDR

This study empirically evaluates LoRA fine-tuning for LLMs in automated test case generation, showing open-source models can rival proprietary ones.

Key contributions

  • Empirical study on LoRA fine-tuning for LLMs in automated test case generation from natural language requirements.
  • Evaluates LoRA hyperparameters and introduces a GPT-4o-based framework for assessing test case quality.
  • Shows LoRA significantly boosts open-source LLM performance, with Ministral-8B excelling among them.
  • Demonstrates fine-tuned 8B open-source models achieve performance comparable to pre-fine-tuned GPT-4.1.

Why it matters

This paper is important because it demonstrates that cost-efficient, locally deployable open-source LLMs, when fine-tuned with LoRA, can become viable alternatives to expensive proprietary systems for automated test case generation. It provides crucial insights into effective fine-tuning strategies and model selection, making advanced test automation more accessible.

Original Abstract

Automated test case generation from natural language requirements remains a challenging problem in software engineering due to the ambiguity of requirements and the need to produce structured, executable test artifacts. Recent advances in LLMs have shown promise in addressing this task; however, their effectiveness depends on task-specific adaptation and efficient fine-tuning strategies. In this paper, we present a comprehensive empirical study on the use of parameter-efficient fine-tuning, specifically LoRA, for requirement-based test case generation. We evaluate multiple LLM families, including open-source and proprietary models, under a unified experimental pipeline. The study systematically explores the impact of key LoRA hyperparameters, including rank, scaling factor, and dropout, on downstream performance. We propose an automated evaluation framework based on GPT-4o, which assesses generated test cases across nine quality dimensions. Experimental results demonstrate that LoRA-based fine-tuning significantly improves the performance of all open-source models, with Ministral-8B achieving the best results among them. Furthermore, we show that a fine-tuned 8B open-source model can achieve performance comparable to pre-fine-tuned GPT-4.1 models, highlighting the effectiveness of parameter-efficient adaptation. While GPT-4.1 models achieve the highest overall performance, the performance gap between proprietary and open-source models is substantially reduced after fine-tuning. These findings provide important insights into model selection, fine-tuning strategies, and evaluation methods for automated test generation. In particular, they demonstrate that cost-efficient, locally deployable open-source models can serve as viable alternatives to proprietary systems when combined with well-designed fine-tuning approaches.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.