Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

April 22, 20262604.20791

Mariano Barone, Francesco Di Serio, Roberto Moio, Marco Postiglione, Giuseppe Riccio + 2 more

cs.CLcs.AI

TLDR

LLMs in healthcare often lack empathy and clarity, but collaborative rewriting significantly improves their communication alignment with clinical standards.

Key contributions

Baseline LLMs show higher negative affect and linguistic complexity than physicians.
Empathy-oriented prompting reduces negativity and complexity but doesn't boost semantic fidelity.
Collaborative rewriting, particularly rephrasing, achieves highest semantic similarity and improves readability.
Patients prefer rewritten LLM responses for clarity and tone, but models don't surpass physician epistemic criteria.

Why it matters

LLMs often lack empathy and clarity in clinical communication compared to doctors. This paper shows collaborative rewriting significantly improves their alignment and patient preference. This is crucial for responsibly integrating AI as a communication enhancer in healthcare, not a replacement.

Original Abstract

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers