ArXiv TLDR

Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

🐦 Tweet
2604.28031

Garvin Kruthof

cs.CL

TLDR

LLMs often violate original constraints in multi-turn ideation, even when they can recall them, a phenomenon called "knows-but-violates."

Key contributions

  • Introduced DriftBench, a benchmark to evaluate LLM constraint adherence in multi-turn ideation.
  • Found LLMs often violate original constraints despite recalling them (knows-but-violates, KBV), with rates up to 99%.
  • Iterative pressure increases complexity and reduces adherence; checkpointing only partially mitigates KBV.
  • Released DriftBench as an open benchmark, including all data, for further research.

Why it matters

This paper reveals a critical flaw in LLM-assisted ideation: models often violate original constraints despite recalling them. Understanding and mitigating this "knows-but-violates" phenomenon is crucial for reliable human-LLM collaboration in creative and scientific tasks. The open benchmark provides a valuable tool for future research.

Original Abstract

When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench, a benchmark for evaluating constraint adherence in multi-turn LLM-assisted scientific ideation. Across 2,146 scored benchmark runs spanning seven models from five providers (including two open-weight), four interaction conditions, and 38 research briefs from 24 scientific domains, we find that iterative pressure reliably increases structural complexity and often reduces adherence to original constraints. A restatement probe reveals a dissociation between declarative recall and behavioral adherence, as models accurately restate constraints they simultaneously violate. The knows-but-violates (KBV) rate, measuring constraint non-compliance despite preserved recall, ranges from 8% to 99% across models. Structured checkpointing partially reduces KBV rates but does not close the dissociation, and complexity inflation persists. Human validation against blind raters confirms that the LLM judge under-detects constraint violations, making reported constraint adherence scores conservative. Sensitivity analyses confirm the findings are robust to temperature (0.7 vs.\ 1.0) and pressure type (novelty vs.\ rigor). We release all briefs, prompts, rubrics, transcripts, and scores as an open benchmark.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.