AI in Markets Benchmarks

LLM Accuracy on Financial Benchmarks Statistics

On FinBen, GPT-4 led on quantification, extraction, and numerical reasoning, but models broadly struggled with forecasting and generation tasks. That pattern, strong extraction, weak prediction, runs across the figures below. Each datapoint is from a peer-reviewed benchmark paper, with source and year; none was generated by this site. Numbers reflect the model versions tested in the cited work; models update, so use these to understand the shape of relative performance rather than as a current leaderboard.

5 STATSPublished May 26, 2026Live Content

By AI Fin Hub Research · AI Fin Hub Team

On This Page

5 stats Takeaways Methodology

Statistics

The numbers worth quoting

FinBen evaluated 15 leading LLMs across 36 datasets and 24 financial tasks spanning extraction, analysis, QA, generation, risk, forecasting, and decision-making

FinBen is among the most comprehensive open financial benchmarks and was the first to add stock-trading evaluation, giving a broad picture of where models succeed and fail.

Source Xie et al., FinBen: A Holistic Financial Benchmark for Large Language Models

On FinBen, GPT-4 led on quantification, extraction, numerical reasoning, and stock trading, while models broadly struggled with text generation and forecasting

The benchmark found strength concentrated in structured extraction and analysis and weakness in open-ended generation and prediction, a pattern that holds across the models tested.

Source Xie et al., FinBen: A Holistic Financial Benchmark for Large Language Models

A fine-tuned FinBERT reached about 0.88 accuracy and 0.87 F1 on the Financial PhraseBank sentiment task, competitive with or ahead of few-shot GPT-4o

The study found GPT-4o with few-shot examples could match a well fine-tuned FinBERT, indicating a small specialized model still holds its own on a focused finance classification task.

Source Shen and Zhang, Financial Sentiment Analysis on News and Reports Using LLMs and FinBERT

On the FinQA numerical-reasoning benchmark, the best published systems trailed expert human accuracy by a wide margin

FinQA pairs questions with earnings-report tables and text. The model-versus-expert gap shows multi-step financial arithmetic remains a structural weak point, even as newer models improve.

Source Chen et al., FinQA: A Dataset of Numerical Reasoning over Financial Data

Domain-adapted financial models such as BloombergGPT outperformed comparable general models on finance tasks without sacrificing performance on general-language benchmarks, while still trailing far larger general models

Domain pre-training improved finance-task accuracy, confirming that specialization helps but does not by itself close the gap on harder reasoning.

Source Wu et al., BloombergGPT: A Large Language Model for Finance

Key Takeaways

LLM accuracy on finance is task-dependent: strong on extraction and analysis, weak on forecasting and generation.

On FinBen's 24 tasks, GPT-4 led structured tasks but generation and forecasting remained hard across models.

A fine-tuned FinBERT stays competitive with few-shot GPT-4o on focused sentiment classification.

Multi-step numerical reasoning (FinQA) still trails expert humans by a wide margin.

Domain adaptation (BloombergGPT) helps finance accuracy but does not solve general reasoning.

Methodology

Figures are drawn from peer-reviewed benchmark papers (FinBen, FinQA, BloombergGPT) and a published sentiment-analysis study, each reported with its source and year. Benchmark numbers reflect the model versions evaluated in the cited work and will shift as models update. No statistic on this page is derived from data collected by this site.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Hallucination Detector

Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.

Launch toolOpen ->

ComparatorsCalculator

Model Selector for Finance

Input task, latency budget, cost budget, context size, and quality sensitivity; get ranked model recommendations with rationale — grounded in published.

Launch toolOpen ->

PlaygroundsCalculator

Prompt Regression Tester

Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.

Launch toolOpen ->

Sources & References

FinBen: A Holistic Financial Benchmark for Large Language Models — Qianqian Xie et al., NeurIPS Datasets and Benchmarks (2024)
Financial Sentiment Analysis on News and Reports Using Large Language Models and FinBERT — Yanxin Shen and Pulin Kirin Zhang (2024)
FinQA: A Dataset of Numerical Reasoning over Financial Data — Zhiyu Chen et al., EMNLP (2021)
BloombergGPT: A Large Language Model for Finance — Shijie Wu et al. (2023)

Keep the topic connected

AI in Markets6 STATS

LLM Reliability in Finance Statistics

Sourced statistics on large language model reliability for financial tasks: hallucination rates, numerical-reasoning accuracy, citation faithfulness, and benchmark gaps.

Keep readingRead ->

AI in Markets1 FAQS

LLM Hallucination Detection in Finance

How to detect LLM hallucinations in financial outputs: citation grounding, verifiable-claim checks, and cross-model agreement that flag fabricated data.

Keep readingRead ->

AI in Markets9 MIN READ

How to Evaluate an LLM for 10-K Extraction

Evaluate an LLM for 10-K extraction: build a labeled gold set, score field accuracy and faithfulness, test edge cases, weigh cost against quality.

Keep readingRead ->