ArXiv TLDR

Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution

🐦 Tweet
2605.09781

Dongxin Guo, Jikun Wu, Siu Ming Yiu

cs.NEcs.AIcs.CLcs.LG

TLDR

QD-LLM uses neuroevolution to evolve prompt embeddings, enabling diverse and high-quality LLM outputs without fine-tuning.

Key contributions

  • Evolves prompt embeddings via gradient-free neuroevolution to steer frozen LLMs for diverse generation.
  • Introduces hybrid behavior characterization with formal coverage bounds for diverse output assessment.
  • Achieves 46.4% higher coverage and 41.4% higher QD-Score on benchmarks vs. baselines.
  • Improves test generation (34% more edge cases) and fine-tuning data quality (8.3% accuracy).

Why it matters

This paper solves LLM mode collapse, enabling diverse, high-quality outputs without expensive fine-tuning. It introduces a parameter-efficient neuroevolution method via prompt embeddings, boosting test generation and fine-tuning data quality. This work effectively bridges neuroevolution with modern LLMs.

Original Abstract

Large Language Models exhibit mode collapse, producing homogeneous outputs that fail to explore valid solution spaces. We present QD-LLM, a framework for parameter-efficient neuroevolution that evolves prompt embeddings, compact neural interfaces (~32K parameters) that steer generation in frozen LLMs (70B+ parameters), within a Quality-Diversity (QD) optimization framework. Our contributions: (1) evolved prompt embeddings via gradient-free optimization enabling behavioral steering without model fine-tuning; (2) hybrid behavior characterization combining semantic and explicit features with formal coverage bounds (Theorem 1) under validated near-independence (NMI $= 0.08 \pm 0.02$); (3) co-evolutionary variation operators including targeted behavioral mutation via finite-difference gradient estimation. On HumanEval (164 problems), MBPP, and creative writing benchmarks, QD-LLM achieves 46.4% higher coverage and 41.4% higher QD-Score than QDAIF ($p<0.001$, 30 runs, Vargha-Delaney $A=0.94$). We demonstrate downstream utility: diverse archives improve test generation (34% more edge cases) and fine-tuning data quality (8.3% accuracy gain). We validate across open-source LLMs (Llama-3-70B, Mistral-Large) with full embedding access, establishing prompt embedding evolution as an effective paradigm bridging neuroevolution and modern LLMs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.