ArXiv TLDR

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

🐦 Tweet
2605.08070

James Petullo, Sonny George, Dylan Cashman, Nianwen Xue

cs.AI

TLDR

VecCISC improves confidence-informed self-consistency by clustering reasoning traces, reducing LLM call costs by 47% while maintaining accuracy.

Key contributions

  • Proposes VecCISC, a lightweight framework for confidence-informed self-consistency.
  • Uses semantic similarity to filter redundant, degenerate, or hallucinated reasoning traces.
  • Reduces critic LLM calls, decreasing total token usage by 47% compared to CISC.
  • Maintains or exceeds the accuracy of Confidence-Informed Self-Consistency (CISC).

Why it matters

Weighted majority voting (like CISC) improves LLM reasoning but is costly due to many critic LLM calls. VecCISC addresses this by efficiently filtering reasoning traces. This makes advanced LLM inference techniques more practical and affordable, broadening their real-world applicability.

Original Abstract

A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.