When Drawing Is Not Enough: Exploring Spontaneous Speech with Sketch for Intent Alignment in Multimodal LLMs

April 13, 20262604.11964

Weiyan Shi, Dorien Herremans, Kenny Tsu Wei Choo

cs.HCcs.MM

TLDR

This paper shows that combining spontaneous speech with sketches significantly improves Multimodal LLMs' ability to align with user intent in early design ideation.

Key contributions

Introduces TalkSketchD, a dataset of spontaneous speech temporally aligned with freehand sketches.
Compares MLLM sketch-to-image generation with and without concurrent speech transcripts.
Uses a reasoning MLLM to judge generated images against designers' self-reported intent.
Shows incorporating speech significantly improves MLLM intent alignment for design images.

Why it matters

Early design sketches often lack explicit intent, hindering MLLM interpretation. This research demonstrates that incorporating spontaneous speech provides vital context, significantly improving MLLMs' ability to align with designers' true intent. This advancement can lead to more effective AI-powered design tools.

Original Abstract

Early-stage design ideation often relies on rough sketches created under time pressure, leaving much of the designer's intent implicit. In practice, designers frequently speak while sketching, verbally articulating functional goals and ideas that are difficult to express visually. We introduce TalkSketchD, a sketch-while-speaking dataset that captures spontaneous speech temporally aligned with freehand sketches during early-stage toaster ideation. To examine the dataset's value, we conduct a sketch-to-image generation study comparing sketch-only inputs with sketches augmented by concurrent speech transcripts using multimodal large language models (MLLMs). Generated images are evaluated against designers' self-reported intent using a reasoning MLLM as a judge. Quantitative results show that incorporating spontaneous speech significantly improves judged intent alignment of generated design images across form, function, experience, and overall intent. These findings demonstrate that temporally aligned sketch-and-speech data can enhance MLLMs' ability to interpret user intent in early-stage design ideation.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers