ArXiv TLDR

Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

🐦 Tweet
2604.11141

Chenhao Fang, Jordi Mola, Mark Harman, Jason Nawrocki, Vaibhav Shrivastava + 8 more

cs.LGcs.CR

TLDR

This paper introduces HUMBR, a Minimum Bayes Risk framework that significantly reduces LLM hallucination in high-stakes enterprise workflows.

Key contributions

  • Introduces Hybrid Utility Minimum Bayes Risk (HUMBR) for mitigating LLM hallucination.
  • Synthesizes semantic embedding similarity with lexical precision to identify consensus without ground truth.
  • Provides rigorous error bounds and empirical evaluation on TruthfulQA, LegalBench, and real-world data.
  • Significantly outperforms Universal Self-Consistency, with 81% preference over human ground truth.

Why it matters

Hallucinations in LLMs pose severe risks in critical enterprise applications like legal and risk management. This paper offers a robust, theoretically-backed solution that dramatically reduces these risks, outperforming existing methods and achieving high human preference. It's crucial for deploying reliable AI in high-stakes environments.

Original Abstract

Although LLMs drive automation, it is critical to ensure immense consideration for high-stakes enterprise workflows such as those involving legal matters, risk management, and privacy compliance. For Meta, and other organizations like ours, a single hallucinated clause in such high stakes workflows risks material consequences. We show that by framing hallucination mitigation as a Minimum Bayes Risk (MBR) problem, we can dramatically reduce this risk. Specifically, we introduce a Hybrid Utility MBR (HUMBR) framework that synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, for which we derive rigorous error bounds. We complement this theoretical analysis with a comprehensive empirical evaluation on widely-used public benchmark suites (TruthfulQA and LegalBench) and also real world data from Meta production deployment. The results from our empirical study show that MBR significantly outperforms standard Universal Self-Consistency. Notably, 81% of the pipeline's suggestions were preferred over human-crafted ground truth, and critical recall failures were virtually eliminated.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.