Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision
Hong Yi Lin, Chunhua Liu, Haoyu Gao, Patanamon Thongtanunam, Christoph Treude
TLDR
This paper introduces fine-grained confidence calibration methods for LLMs in automated code revision tasks, improving reliability and trustworthiness.
Key contributions
- Tackles LLM confidence miscalibration in automated code revision (ACR) tasks.
- Introduces local Platt-scaling for three distinct fine-grained confidence scores.
- Demonstrates fine-grained scores consistently lower calibration error across diverse models/tasks.
- Finds calibration error further reduced when combined with global Platt-scaling.
Why it matters
LLMs are vital in software engineering but their imperfections hinder productivity. This work provides a practical solution for well-calibrated confidence scores, enabling more trustworthy and efficient use of LLMs in automated code revision tasks.
Original Abstract
In today's AI-assisted software engineering landscape, developers increasingly depend on LLMs that are highly capable, yet inherently imperfect. The tendency of these models to produce incorrect outputs can reduce developer productivity. To this end, a canonical mitigation method is to provide calibrated confidence scores that faithfully reflect their likelihood of correctness at the instance-level. Such information allows users to make immediate decisions regarding output acceptance, abstain error-prone outputs, and better align their expectations with the model's capabilities. Since post-trained LLMs do not inherently produce well-calibrated confidence scores, researchers have developed post-hoc calibration methods, with global Platt-scaling of sequence-level confidence scores proving effective in many generative software engineering tasks but remaining unreliable or unexplored for automated code revision (ACR) tasks such as program repair, vulnerability repair, and code refinement. We hypothesise that the coarse-grained nature of this conventional method makes it ill-suited for ACR tasks, where correctness is often determined by local edit decisions and miscalibration can be sample-dependent, thereby motivating fine-grained confidence calibration. To address this, our study proposes local Platt-scaling applied separately to three different fine-grained confidence scores. Through experiments across 3 separate tasks and correctness metrics, as well as 14 different models of various sizes, we find that fine-grained confidence scores consistently achieve lower calibration error across a broader range of probability intervals, and this effect is further amplified when global Platt-scaling is applied. Our proposed approaches offer a practical solution to eliciting well-calibrated confidence scores, enabling more trustworthy and streamlined usage of imperfect models in ACR tasks.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.