ArXiv TLDR

K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

🐦 Tweet
2604.24645

Soyeon Kim, Cheongwoong Kang, Myeongjin Lee, Eun-Chul Chang, Jaedeok Lee + 1 more

cs.CLcs.AI

TLDR

K-MetBench is a new benchmark for evaluating multimodal LLMs for Korean weather forecasting, revealing gaps in expert reasoning, locality, and multimodality.

Key contributions

  • Introduces K-MetBench, a diagnostic benchmark for Korean weather forecasting LLMs, based on national exams.
  • Evaluates models across expert visual reasoning, logical validity, Korean geo-cultural comprehension, and domain analysis.
  • Reveals significant modality and reasoning gaps, with models hallucinating logic despite correct predictions.
  • Shows Korean models outperform larger global models in local contexts, highlighting cultural dependency.

Why it matters

This paper introduces K-MetBench, a crucial benchmark for developing reliable, culturally aware AI assistants for expert domains like weather forecasting. It highlights that raw model scale isn't enough; specialized, local, and multimodal evaluation is essential. The findings provide a clear roadmap for improving expert AI.

Original Abstract

The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies. K-MetBench serves as a roadmap for developing reliable, culturally aware expert AI agents. The dataset is available at https://huggingface.co/datasets/soyeonbot/K-MetBench .

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.