STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

April 27, 20262604.24544

Alessio Sordo, Lingxiao Du, Meeka-Hanna Lenisa, Evgeny Bogdanov, Maxim Romanovsky

cs.AIcs.CL

TLDR

STELLAR-E is an automated system that generates high-quality synthetic datasets for rigorous, scalable, and domain-adaptable evaluation of LLM applications.

Key contributions

Introduces STELLAR-E, an automated system generating custom, high-quality synthetic datasets for LLM evaluation.
Utilizes a modified TGRT Self-Instruct framework for controllable data generation without existing datasets.
Includes an evaluation pipeline with statistical and LLM-based metrics to assess synthetic dataset quality.
Demonstrates comparable quality, with synthetic data achieving +5.7% LLM-as-a-judge scores vs. benchmarks.

Why it matters

This paper addresses the critical need for robust, domain-specific LLM evaluation datasets, which are challenging to collect manually due to privacy and cost. STELLAR-E offers a scalable, automated solution, enabling faster and more efficient quality assurance for LLM applications. It provides a fair and adaptable benchmarking framework.

Original Abstract

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers