AI in Markets Benchmarks

LLM Reliability in Finance Statistics

Hallucination rates of 5% to over 20% on finance-specific factual tasks have been measured across leading models; on FinQA numerical reasoning, the best systems still trailed human expert accuracy by a wide margin. Those are the failure modes the figures below document. Each datapoint comes from peer-reviewed benchmarks and published evaluations, with source and year; none was generated by this site. Model versions change, but the structural finding holds: finance is unusually unforgiving of small numerical and citation errors, so reliability has to be built into the surrounding system.

6 STATSPublished May 26, 2026Live Content

By AI Fin Hub Research · AI Fin Hub Team

On This Page

6 stats Takeaways Methodology

Statistics

The numbers worth quoting

A purpose-built benchmark measured hallucination rates of 5% to over 20% across leading models on finance-specific factual questions

Error rates varied widely by model and question type, with abstract conceptual questions and time-sensitive figures producing the most fabrications.

Source Kang and Liu, Deficiency of Large Language Models in Finance

On the FinQA numerical-reasoning benchmark, the best published systems trailed expert human accuracy by a wide margin

FinQA pairs questions with earnings-report tables and text. The gap between model and expert performance highlights that multi-step financial arithmetic remains a weak point.

Source Chen et al., FinQA: A Dataset of Numerical Reasoning over Financial Data

ConvFinQA extended financial QA to multi-turn conversations, where chained reasoning errors compounded across turns

Accuracy degraded as dialogues lengthened, because an early arithmetic mistake propagated into every dependent later answer.

Source Chen et al., ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance

Across general benchmarks, retrieval-augmented generation reduces but does not eliminate hallucination, with grounded systems still producing unsupported claims

The survey documents that grounding answers in retrieved context lowers fabrication rates yet leaves residual faithfulness errors, including citing sources that do not support the claim.

Source Ji et al., Survey of Hallucination in Natural Language Generation

The IMF cautioned that opaque AI models in finance raise model-risk and explainability concerns that current governance frameworks only partly address

The report tied reliability and explainability gaps directly to financial-stability risk when models drive trading or credit decisions at scale.

Source International Monetary Fund, Global Financial Stability Report

Domain-adapted financial language models such as BloombergGPT outperformed comparable general models on finance tasks without sacrificing performance on general-language benchmarks, while still trailing far larger general models

Domain pre-training improved financial-task accuracy, confirming that finance reliability benefits from specialization but is not solved by it.

Source Wu et al., BloombergGPT: A Large Language Model for Finance

Key Takeaways

Measured hallucination rates on finance questions span the high single digits to over 20 percent depending on model and task.

Multi-step numerical reasoning over financial tables is a persistent weak point versus expert humans.

Conversational finance compounds errors across turns, so early mistakes propagate.

Retrieval grounding lowers but does not remove fabrication and source-misattribution.

Domain adaptation helps finance accuracy but does not by itself make a model reliable enough to trust unverified.

Methodology

Figures are drawn from peer-reviewed benchmarks, an ACM survey, and a regulator report, each reported with its original source and year. Benchmark numbers reflect the model versions evaluated in the cited work and will shift as models are updated. No statistic on this page is derived from data collected by this site.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Hallucination Detector

Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.

Launch toolOpen ->

PlaygroundsCalculator

Prompt Regression Tester

Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.

Launch toolOpen ->

ComparatorsCalculator

Model Selector for Finance

Input task, latency budget, cost budget, context size, and quality sensitivity; get ranked model recommendations with rationale — grounded in published.

Launch toolOpen ->

Sources & References

Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination — Haoqiang Kang and Xiao-Yang Liu (2023)
FinQA: A Dataset of Numerical Reasoning over Financial Data — Zhiyu Chen et al., EMNLP (2021)
Survey of Hallucination in Natural Language Generation — Ziwei Ji et al., ACM Computing Surveys (2023)

Keep the topic connected

AI in Markets1 FAQS

LLM Hallucination Detection in Finance

How to detect LLM hallucinations in financial outputs: citation grounding, verifiable-claim checks, and cross-model agreement that flag fabricated data.

Keep readingRead ->

AI in Markets1 FAQS

Model Drift

Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.

Keep readingRead ->

AI in Markets7 STATS

AI in Finance Adoption Statistics

Sourced statistics on AI and generative-AI adoption across financial services: deployment rates, use cases, governance gaps, and where the money is going.

Keep readingRead ->