EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

April 23, 20262604.21890

Praval Sharma, Ashok Samal, Leen-Kiat Soh, Deepti Joshi

cs.CL

TLDR

Introduces EVENT5Ws, a large, manually annotated open-domain event extraction dataset, benchmarking LLMs and enabling generalizable algorithms.

Key contributions

Created EVENT5Ws, a large, manually annotated, open-domain dataset for event extraction.
Designed a systematic annotation pipeline and provided insights into annotation complexity.
Evaluated SOTA LLMs on EVENT5Ws, establishing a new benchmark for open-domain event extraction.
Demonstrated that models trained on EVENT5Ws generalize effectively to diverse geographical contexts.

Why it matters

This paper addresses the critical need for robust open-domain event extraction by introducing EVENT5Ws, a large, high-quality dataset. It provides a crucial benchmark for evaluating advanced language models and demonstrates potential for generalizable algorithms, advancing real-world applications like emergency response.

Original Abstract

Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers