RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

April 21, 20262604.19593

Mircea Timpuriu, Dumitru-Clementin Cercel

cs.CLcs.AIcs.LG

TLDR

This paper introduces RoLegalGEC, the first parallel dataset for Romanian legal grammatical error correction, and evaluates neural models on it.

Key contributions

Introduces RoLegalGEC, the first parallel dataset for Romanian legal grammatical error correction.
Comprises 350,000 annotated examples of grammatical errors in legal passages.
Evaluates various neural network models, including Transformers, for detection and correction tasks.
Enriches the resource base for further research on Romanian natural language processing.

Why it matters

Clear legal text is crucial, but Romanian lacks specialized grammatical error correction resources for this domain. This dataset and model evaluation fill a critical gap, enabling better legal AI tools and advancing Romanian NLP research.

Original Abstract

The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers