Generating Effective CoT Traces for Mitigating Causal Hallucination

April 14, 20262604.12748

cs.CL

TLDR

This paper introduces a pipeline to generate Chain-of-Thought traces and a new metric (CHR) to mitigate causal hallucination in smaller LLMs.

Key contributions

Investigates criteria for effective CoT traces to mitigate causal hallucination in smaller LLMs.
Designs a pipeline to generate CoT traces that meet these identified criteria for ECI.
Introduces Causal Hallucination Rate (CHR), a new metric to quantify causal hallucination.
Fine-tuning with generated CoT traces reduces hallucination and improves accuracy in smaller LLMs.

Why it matters

This work addresses a critical issue of causal hallucination in smaller LLMs, which limits their reliability. By providing a method to generate effective CoT traces and a new quantification metric, it enables more accurate and robust event causality identification. This advances the practical application of smaller, more efficient LLMs.

Original Abstract

Although large language models (LLMs) excel in complex reasoning tasks, they suffer from severe causal hallucination in event causality identification (ECI), particularly in smaller models ($\leq$1.5B parameters). A promising approach to address this issue is to fine-tune them with Chain-of-Thought (CoT) traces. However, there is currently a lack of CoT trace dataset available for ECI. In this paper, we first investigate the essential criteria that effective CoT traces should possess to mitigate causal hallucination in smaller models. We then design a pipeline to generate CoT traces that meet these criteria. Moreover, since there is currently no metric for quantifying causal hallucination, we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline. Our experiments show that fine-tuning with the CoT traces generated by our pipeline not only substantially reduces causal hallucination in smaller LLMs but also improves mean accuracy. Moreover, the fine-tuned models exhibit strong cross-dataset and cross-difficulty generalization, as well as robustness under misleading intervention prompts.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers