ArXiv TLDR

Evaluating the False Trust engendered by LLM Explanations

🐦 Tweet
2605.10930

Vardhan Palod, Upasana Biswas, Subbarao Kambhampati

cs.HC

TLDR

Study finds common LLM explanations foster false trust, while a novel dual explanation method significantly improves users' ability to discern AI correctness.

Key contributions

  • Common LLM explanations (traces, post-hoc) are persuasive, increasing user acceptance regardless of AI correctness.
  • These standard explanations are uninformative, failing to help users identify incorrect AI-generated answers.
  • A novel "dual explanation" method, presenting pro/con arguments, significantly improves users' error detection.
  • The study developed a user-centered protocol to evaluate false trust engendered by various LLM explanation types.

Why it matters

As LLMs are deployed in critical tasks, understanding how explanations influence user trust is vital. This research reveals that typical explanations often foster dangerous false trust, but introduces a promising dual explanation method that genuinely helps users identify AI errors. This work is crucial for building more trustworthy AI systems.

Original Abstract

Large Language Models (LLMs) and Large Reasoning Models (LRMs) are increasingly used for critical tasks, yet they provide no guarantees about the correctness of their solutions. Users must decide whether to trust the model's answer, aided by reasoning traces, their summaries, or post-hoc generated explanations. These reasoning traces, despite evidence that they are neither faithful representations of the model's computations nor necessarily semantically meaningful, are often interpreted as provenance explanations. It is unclear whether explanations or reasoning traces help users identify when the AI is incorrect, or whether they simply persuade users to trust the AI regardless. In this paper, we take a user-centered approach and develop an evaluation protocol to study how different explanation types affect users' ability to judge the correctness of AI-generated answers and engender false trust in the users. We conduct a between-subject user study, simulating a setting where users do not have the means to verify the solution and analyze the false trust engendered by commonly used LLM explanations - reasoning traces, their summaries and post-hoc explanations. We also test a contrastive dual explanation setting where we present arguments for and against the AI's answer. We find that reasoning traces and post-hoc explanations are persuasive but not informative: they increase user acceptance of LLM predictions regardless of their correctness. In contrast, dual explanation is the only condition that genuinely improves users' ability to distinguish correct from incorrect AI outputs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.