ArXiv TLDR

RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

🐦 Tweet
2604.11655

Riccardo Rosati, Edoardo Colucci, Massimiliano Bolognini, Adriano Mancini, Paolo Sernani

cs.CLcs.AIcs.MA

TLDR

RPA-Check is a multi-stage automated framework for objectively evaluating dynamic LLM-based role-playing agents in complex environments.

Key contributions

  • Introduces RPA-Check, a multi-stage automated framework for evaluating LLM role-playing agents.
  • Assesses role adherence, logical consistency, and long-term narrative stability.
  • Employs a four-step pipeline: Dimension Definition, Augmentation, Semantic Filtering, and LLM-as-a-Judge.
  • Validated on LLM Court, showing smaller models (8-9B) can outperform larger ones in consistency.

Why it matters

Evaluating dynamic LLM role-playing agents is challenging with standard metrics. RPA-Check provides a standardized, objective, and reproducible framework to address this, revealing that smaller, instruction-tuned models can surprisingly outperform larger ones in procedural consistency. This is crucial for developing reliable generative agents.

Original Abstract

The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework's ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.