Pause or Fabricate? Training Language Models for Grounded Reasoning
Yiwen Qiu, Linjuan Wu, Yizhou Liu, Yuchen Yan, Jin Ma + 7 more
TLDR
GRIL is a new RL framework that trains LLMs to detect incomplete information, pause, and clarify, reducing fabrication and improving grounded reasoning.
Key contributions
- GRIL trains LLMs to detect missing premises and pause, preventing ungrounded reasoning.
- Decomposes reasoning into 'clarify/pause' and 'grounded reasoning' stages.
- Uses stage-specific rewards to penalize hallucinations and encourage clarification.
- Achieves 45% better premise detection and 30% higher task success on benchmarks.
Why it matters
Large language models often fabricate information when inputs are incomplete, leading to unreliable conclusions. This paper introduces GRIL, a framework that teaches LLMs to recognize information gaps and seek clarification. It significantly enhances model reliability and trustworthiness for complex reasoning tasks.
Original Abstract
Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.