SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

April 17, 20262604.16262

Deshan Sumanathilaka, Nicholas Micallef, Julian Hough, Saman Jayasinghe

cs.CL

TLDR

SwanNLP proposes an LLM framework for narrative word sense disambiguation, showing large LLMs with dynamic few-shot prompting replicate human plausibility judgments.

Key contributions

Proposed an LLM-based framework for plausibility scoring of word senses in narrative texts using structured reasoning.
Explored fine-tuning low-parameter LLMs and dynamic few-shot prompting for large-parameter models.
Found large commercial LLMs with dynamic few-shot prompting closely replicate human plausibility judgments.
Demonstrated model ensembling slightly improves performance, better simulating human agreement.

Why it matters

This paper addresses a crucial gap in LLM evaluation by testing their practical applicability in narrative contexts for word sense plausibility. It shows advanced LLMs can achieve human-like judgments, paving the way for more nuanced and context-aware NLU systems.

Original Abstract

Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers