Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

April 22, 20262604.21144

Biswesh Mohapatra, Giovanni Duca, Laurent Romary, Justine Cassell

cs.CLcs.AIcs.HC

TLDR

This paper proposes a visual scaffolding framework using machine mental imagery to enhance common ground in situated dialogue and mitigate 'representational blur'.

Key contributions

Addresses 'representational blur' in situated dialogue where agents fail to maintain persistent common ground.
Proposes an active visual scaffolding framework that builds persistent visual histories from dialogue state.
This framework reduces representational blur and enforces concrete scene commitments, improving common ground.
A hybrid multimodal approach, integrating visual and textual information, yields the best performance.

Why it matters

This paper addresses a critical flaw in conversational agents: their inability to maintain persistent common ground, leading to 'representational blur.' By introducing a visual scaffolding framework, it significantly improves dialogue coherence and robustness for real-world applications.

Original Abstract

Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers