MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models
Suhyun Lee, Palakorn Achananuparp, Neemesh Yadav, Ee-Peng Lim, Yang Deng
TLDR
MHSafeEval introduces a role-aware, interaction-level framework to evaluate mental health safety in LLMs, revealing cumulative safety failures.
Key contributions
- Presents R-MHSafe, a new taxonomy for mental health harm based on AI counselor roles (perpetrator, instigator).
- Proposes MHSafeEval, an agent-based framework for multi-turn, adversarial safety evaluation.
- Discovers substantial role-dependent and cumulative safety failures in LLMs, missed by static benchmarks.
- Significantly improves failure-mode coverage and diagnostic granularity for LLM mental health safety.
Why it matters
LLMs are increasingly explored for mental health, but existing safety evaluations miss cumulative harms in multi-turn interactions. This work provides a crucial, dynamic framework to diagnose how harms emerge, leading to safer AI counseling tools.
Original Abstract
Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.