K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

April 27, 20262604.24645

Soyeon Kim, Cheongwoong Kang, Myeongjin Lee, Eun-Chul Chang, Jaedeok Lee + 1 more

cs.CLcs.AI

TLDR

K-MetBench is a new benchmark for evaluating multimodal LLMs for Korean weather forecasting, revealing gaps in expert reasoning, locality, and multimodality.

Key contributions

Introduces K-MetBench, a diagnostic benchmark for Korean weather forecasting LLMs, based on national exams.
Evaluates models across expert visual reasoning, logical validity, Korean geo-cultural comprehension, and domain analysis.
Reveals significant modality and reasoning gaps, with models hallucinating logic despite correct predictions.
Shows Korean models outperform larger global models in local contexts, highlighting cultural dependency.

Why it matters

This paper introduces K-MetBench, a crucial benchmark for developing reliable, culturally aware AI assistants for expert domains like weather forecasting. It highlights that raw model scale isn't enough; specialized, local, and multimodal evaluation is essential. The findings provide a clear roadmap for improving expert AI.

Original Abstract

The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies. K-MetBench serves as a roadmap for developing reliable, culturally aware expert AI agents. The dataset is available at https://huggingface.co/datasets/soyeonbot/K-MetBench .

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers