From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

April 17, 20262604.16270

cs.CLcs.AI

TLDR

This paper evaluates LLMs on Vietnamese legal texts using a dual-aspect framework, revealing trade-offs between readability and accurate legal reasoning.

Key contributions

Introduces a dual-aspect framework for evaluating LLMs on complex Vietnamese legal texts.
Benchmarks 4 SOTA LLMs (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, Grok-1) on Accuracy, Readability, and Consistency.
Conducts a large-scale error analysis with an expert-validated typology on 60 Vietnamese legal articles.
Reveals LLM trade-offs between readability and accurate legal reasoning, highlighting 'Incorrect Example' and 'Misinterpretation' errors.

Why it matters

This paper provides a crucial, in-depth assessment of LLMs for legal applications, moving beyond surface-level metrics. It highlights that accurate legal reasoning, not just summarization, remains a significant challenge for current models. This work offers actionable insights for developing more reliable legal AI.

Original Abstract

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the "why" behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textit{Incorrect Example} and \textit{Misinterpretation} as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers