Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
Chenhao Fang, Jordi Mola, Mark Harman, Jason Nawrocki, Vaibhav Shrivastava + 8 more
TLDR
This paper introduces HUMBR, a Minimum Bayes Risk framework that significantly reduces LLM hallucination in high-stakes enterprise workflows.
Key contributions
- Introduces Hybrid Utility Minimum Bayes Risk (HUMBR) for mitigating LLM hallucination.
- Synthesizes semantic embedding similarity with lexical precision to identify consensus without ground truth.
- Provides rigorous error bounds and empirical evaluation on TruthfulQA, LegalBench, and real-world data.
- Significantly outperforms Universal Self-Consistency, with 81% preference over human ground truth.
Why it matters
Hallucinations in LLMs pose severe risks in critical enterprise applications like legal and risk management. This paper offers a robust, theoretically-backed solution that dramatically reduces these risks, outperforming existing methods and achieving high human preference. It's crucial for deploying reliable AI in high-stakes environments.
Original Abstract
Although LLMs drive automation, it is critical to ensure immense consideration for high-stakes enterprise workflows such as those involving legal matters, risk management, and privacy compliance. For Meta, and other organizations like ours, a single hallucinated clause in such high stakes workflows risks material consequences. We show that by framing hallucination mitigation as a Minimum Bayes Risk (MBR) problem, we can dramatically reduce this risk. Specifically, we introduce a Hybrid Utility MBR (HUMBR) framework that synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, for which we derive rigorous error bounds. We complement this theoretical analysis with a comprehensive empirical evaluation on widely-used public benchmark suites (TruthfulQA and LegalBench) and also real world data from Meta production deployment. The results from our empirical study show that MBR significantly outperforms standard Universal Self-Consistency. Notably, 81% of the pipeline's suggestions were preferred over human-crafted ground truth, and critical recall failures were virtually eliminated.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.