ArXiv TLDR

Uncertainty Quantification for LLM-based Code Generation

🐦 Tweet
2605.12201

Senrong Xu, Yuhao Tan, Yanke Zhou, Guangyuan Wu, Zenan Li + 4 more

cs.SEcs.AI

TLDR

RisCoSet quantifies uncertainty in LLM code generation by creating risk-controlled prediction sets, significantly reducing incorrect code generation.

Key contributions

  • Proposes RisCoSet, a novel approach for uncertainty quantification in LLM-based code generation.
  • Leverages multiple hypothesis testing to construct risk-controlling prediction sets for code.
  • Generates partial programs guaranteed to contain a correct solution with high confidence.
  • Reduces incorrect code generation by up to 24.5% compared to state-of-the-art methods.

Why it matters

This paper addresses a critical challenge in LLM-based code generation: reliably quantifying uncertainty. By providing a method to generate highly confident, risk-controlled prediction sets, it significantly improves the trustworthiness and practical utility of AI-generated code, making it more robust for real-world applications.

Original Abstract

Prediction sets provide a theoretically grounded framework for quantifying uncertainty in machine learning models. Adapting them to structured generation tasks, in particular, large language model (LLM) based code generation, remains a challenging problem. An existing attempt proposes PAC prediction sets but is limited by its strong monotonicity assumption on risk and single-label classification framework, which severely limits the space of candidate programs and cannot accommodate the multiple valid outputs inherent to code generation. To address these limitations, we propose an approach RisCoSet that leverages multiple hypothesis testing to construct risk-controlling predictions for LLM-based code generation. Given a trained code generation model, we produce a prediction set represented by a partial program, which is guaranteed to contain a correct solution with high confidence. Extensive experiments on three LLMs demonstrate the effectiveness of the proposed method. For instance, compared with the state-of-the-art, our method can significantly reduce the code removal by up to 24.5%, at the same level of risk.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.