FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification
Dongxin Guo, Jikun Wu, Siu Ming Yiu
TLDR
FinGround is a new pipeline that significantly reduces financial AI hallucination by verifying atomic claims against regulatory filings, crucial for compliance.
Key contributions
- FinGround is a three-stage pipeline for detecting and grounding financial hallucinations.
- Decomposes answers into atomic claims, verified using a six-type financial taxonomy and formula reconstruction.
- Achieves a 68% reduction in hallucination over baselines and 78% over GPT-4o.
- An 8B distilled detector offers 18x lower latency at $0.003/query.
Why it matters
Financial AI systems often hallucinate critical data, posing significant regulatory risks, especially with the EU AI Act approaching. FinGround offers a robust solution by precisely verifying financial claims, drastically reducing errors. This ensures greater accuracy and compliance for high-stakes financial applications.
Original Abstract
Financial AI systems must produce answers grounded in specific regulatory filings, yet current LLMs fabricate metrics, invent citations, and miscalculate derived quantities. These errors carry direct regulatory consequences as the EU AI Act's high-risk enforcement deadline approaches (August 2026). Existing hallucination detectors treat all claims uniformly, missing 43% of computational errors that require arithmetic re-verification against structured tables. We present FinGround, a three-stage verify-then-ground pipeline for financial document QA. Stage 1 performs finance-aware hybrid retrieval over text and tables. Stage 2 decomposes answers into atomic claims classified by a six-type financial taxonomy and verified with type-routed strategies including formula reconstruction. Stage 3 rewrites unsupported claims with paragraph- and table-cell-level citations. To cleanly isolate verification value from retrieval quality, we propose retrieval-equalized evaluation as standard methodology for RAG verification research: when all systems receive identical retrieval, FinGround still reduces hallucination rates by 68% over the strongest baseline ($p < 0.01$). The full pipeline achieves a 78% reduction relative to GPT-4o. An 8B distilled detector retains 91.4% F1 at 18x lower per-claim latency, enabling $0.003/query deployment, supported by qualitative signals from a four-week analyst pilot.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.