Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

April 16, 20262604.15210

Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu, Demir Ekin Arikan, Bob Mankoff + 2 more

cs.AIcs.CL

TLDR

This paper introduces IRS, a framework that uses incongruity-resolution supervision to teach multimodal models structured reasoning for humor understanding.

Key contributions

Introduces IRS, a framework for multimodal humor understanding based on incongruity-resolution theory.
Decomposes humor into incongruity modeling, resolution modeling, and preference alignment.
Supervises intermediate reasoning with structured traces, making humor interpretation explicit and learnable.
Achieves state-of-the-art performance on NYCC, with largest models approaching expert-level ranking.

Why it matters

This paper addresses a critical gap in AI's humor understanding by moving beyond black-box prediction to structured reasoning. By explicitly supervising the cognitive processes behind humor, it enables models to "think" like humans. This approach highlights that reasoning structure, not just model scale, is crucial for complex cognitive tasks.

Original Abstract

Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path from visual perception to humorous interpretation explicit and learnable. Across 7B, 32B, and 72B models on NYCC, IRS outperforms strong open and closed multimodal baselines across caption matching and ranking tasks, with our largest model approaching expert-level performance on ranking. Zero-shot transfer to external benchmarks shows that IRS learns generalizable reasoning patterns. Our results suggest that supervising reasoning structure, rather than scale alone, is key for reasoning-centric tasks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers