ArXiv TLDR

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

🐦 Tweet
2604.02315

Sarath Shekkizhar, Romain Cosentino, Adam Earle

cs.AI

TLDR

This paper introduces user-turn generation as a novel method to probe interaction awareness in LLMs, revealing it's distinct from task accuracy.

Key contributions

  • Proposes user-turn generation to measure LLM interaction awareness beyond assistant responses.
  • Shows interaction awareness is decoupled from task accuracy across 11 LLMs and 5 datasets.
  • Reveals latent interaction awareness in LLMs through higher temperature sampling.
  • Demonstrates improved interaction awareness via collaboration-oriented post-training.

Why it matters

Current LLM benchmarks only evaluate assistant responses, missing crucial interaction awareness. This paper introduces a new probe, user-turn generation, to uncover this hidden dimension. It highlights that models can be accurate without being interactively aware, opening new avenues for training more collaborative LLMs.

Original Abstract

Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41\%$ ($0.8$B) to $96.8\%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22\%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.