Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling

April 16, 20262604.15124

Zhijun Guo, Alvina Lai, Emmanouil Korakas, Aristeidis Vagenas, Irshad Ahamed + 4 more

cs.CL

TLDR

A study found a retrieval-grounded LLM outperformed clinicians in quality, empathy, and actionability for CGM-informed diabetes counseling.

Key contributions

Developed a retrieval-grounded LLM for CGM interpretation and diabetes counseling support.
LLM responses received significantly higher quality scores than clinician-authored responses.
LLM particularly excelled in empathy and actionability dimensions compared to clinicians.
Safety concerns were similarly low for both LLM and clinician-authored responses.

Why it matters

This study highlights LLMs' potential to enhance patient education and pre-consultation in diabetes care. Showing higher quality and empathy than clinicians, LLMs could be valuable adjunct tools. It wisely advises against autonomous therapeutic decision-making, emphasizing human oversight.

Original Abstract

Continuous glucose monitoring (CGM) is central to diabetes care, but explaining CGM patterns clearly and empathetically remains time-intensive. Evidence for retrieval-grounded large language model (LLM) systems in CGM-informed counseling remains limited. To evaluate whether a retrieval-grounded LLM-based conversational agent (CA) could support patient understanding of CGM data and preparation for routine diabetes consultations. We developed a retrieval-grounded LLM-based CA for CGM interpretation and diabetes counseling support. The system generated plain-language responses while avoiding individualized therapeutic advice. Twelve CGM-informed cases were constructed from publicly available datasets. Between Oct 2025 and Feb 2026, 6 senior UK diabetes clinicians each reviewed 2 assigned cases and answered 24 questions. In a blinded multi-rater evaluation, each CA-generated and clinician-authored response was independently rated by 3 clinicians on 6 quality dimensions. Safety flags and perceived source labels were also recorded. Primary analyses used linear mixed-effects models. A total of 288 unique responses (144 CA and 144 clinician) generated 864 ratings. The CA received higher quality scores than clinician responses (mean 4.37 vs 3.58), with an estimated mean difference of 0.782 points (95% CI 0.692-0.872; P<.001). The largest differences were for empathy (1.062, 95% CI 0.948-1.177) and actionability (0.992, 95% CI 0.877-1.106). Safety flag distributions were similar, with major concerns rare in both groups (3/432, 0.7% each). Retrieval-grounded LLM systems may have value as adjunct tools for CGM review, patient education, and preconsultation preparation. However, these findings do not support autonomous therapeutic decision-making or unsupervised real-world use.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers