The short answer

Four public benchmarks measure long-context financial QA over SEC filings: FinQA (short-passage numeric, human experts 91.16%), FinanceBench (open-book whole-filing QA, GPT-4-Turbo+retrieval failed 81%), DocFinQA (full filings, avg ~123K words), and Fin-RATE 2026 (cross-entity and longitudinal, best model ~43% vs human ~83-92%). A whole-filing read costs ~$0.018 on DeepSeek V4 Flash, ~$0.21 on Gemini 3.5 Flash, and ~$0.68 on Claude Opus 4.8 (1M context, verified 2026-06-16).

Four public benchmarks decide whether an LLM can answer questions over SEC filings, annual reports, and earnings calls: FinQA, FinanceBench, DocFinQA, and the 2026 Fin-RATE release. They differ on the axis that matters most for production work — document length. FinQA tests one short passage; FinanceBench and DocFinQA test whole filings; Fin-RATE tests reasoning that spans multiple filings and multiple years. This guide reads each benchmark honestly, reports the published model and human numbers, and then prices what it costs today to run a whole-filing QA pass on the three 1M-context models a 2026 stack actually reaches for: DeepSeek V4 Flash, Gemini 3.5 Flash, and Claude Opus 4.8.

Why "best LLM for financial analysis" routes through a benchmark

The question "which model is best for financial analysis" has no answer without a task and a scoring rule. A model that tops a short-passage numeric benchmark can still fail on a 100,000-token filing because position bias and retrieval errors only appear at length. The four benchmarks below are the published scoring rules. Pick the one whose document length and reasoning shape match your workload, run your candidate models against it, and the "best model" question resolves itself into a number you can defend.

The four split cleanly on two axes: how long the context is, and whether the answer needs reasoning inside one document or across several.

Benchmark Context length Reasoning span Year
FinQA Short (one passage + table) Single document 2021
FinanceBench Long (full filing, open-book) Single document 2023
DocFinQA Very long (avg ~123K words) Single document 2024
Fin-RATE Chunked, multi-source Single, cross-entity, and longitudinal 2026

FinQA: the short-context numeric floor

FinQA is the oldest of the four and still the cleanest test of numeric reasoning. It holds 8,281 expert-written question-answer pairs built from S&P 500 earnings reports, each paired with a gold reasoning program that makes the arithmetic auditable.1 The context per question is a single passage plus one table — under 700 words on average — so it isolates calculation from retrieval entirely.

The human numbers are the reason FinQA still gets cited. Financial experts reach 91.16 percent execution accuracy on it; general crowd workers reach about 50.68 percent.1 That 40-point gap is the headline: the dataset is solvable, but only by people who read filings for a living. A model that clears the crowd baseline is doing real work; a model near the expert line is rare.

Use FinQA when your task is a bounded numeric question against a single known table. It tells you nothing about whole-filing behavior, because there is no whole filing in the prompt. Treat a strong FinQA score as necessary, not sufficient.

FinanceBench: open-book QA over real filings

FinanceBench moved the test to full documents. Patronus AI built it with fifteen financial-industry domain experts: 10,231 question-answer-evidence triplets covering publicly traded companies, drawn from SEC 10-Ks, 10-Qs, 8-Ks, earnings reports, and earnings-call transcripts.2 Every question ships with an evidence string, so a grader can check whether the model found the right passage, not just the right number.

The published result is the one to internalize. The authors evaluated 16 model configurations — including GPT-4-Turbo, Llama 2, and Claude 2, run with vector stores and with long-context prompts — on a 150-case sample reviewed by hand. GPT-4-Turbo paired with a retrieval system answered incorrectly or refused on 81 percent of questions.2 The open-source slice is 150 annotated examples; the full 10,231 set is held to limit contamination.2

FinanceBench is the benchmark to run when your pipeline is open-book QA over filings a user actually holds. The 81 percent failure figure is dated to a 2023 model generation, so re-run it on your own stack rather than quoting it as a current ceiling.

DocFinQA: long-context, by construction

DocFinQA exists to break the assumption that a relevant snippet is already in the prompt. The authors took 7,437 questions from FinQA and re-attached each to its full source document, pushing the average context from under 700 words to roughly 123,000 words.3 Filings in the set often run past 150 pages. The question and the gold reasoning program are unchanged; only the haystack grows.

This is the realistic setting for whole-10-K prompting, and it is where retrieval and long-context models both struggle. The authors report that DocFinQA is a significant challenge for state-of-the-art retrieval pipelines and long-context models alike — the model now has to find the evidence before it can reason over it, and position within a 100,000-token window measurably degrades that search.3

Run DocFinQA when your design choice is "stuff the whole filing in context" versus "retrieve then answer." It is the benchmark that most directly measures the failure mode that whole-filing prompting introduces.

Fin-RATE: the 2026 multi-document benchmark

Fin-RATE, presented at KDD 2026 by a Yale-and-Goldman-Sachs author group, is the newest and the most demanding.4 It is built from 15,311 document chunks across 2,472 SEC filings, spanning 43 companies in 36 industries over 2020 to 2025, and splits 7,500 question instances evenly across three task types.4

Fin-RATE task What it tests GPT-5 + web search Human expert
DR-QA (detail and reasoning) One filing, deep read 42.96% 92.40%
EC-QA (enterprise comparison) Across companies 43.64% 82.68%
LT-QA (longitudinal tracking) Across reporting periods 43.52% 85.88%

The paper's headline finding is the degradation pattern: accuracy drops 18.60 percent and 14.35 percent as the task shifts from single-document reasoning toward longitudinal and cross-entity analysis.4 The human-versus-model gap is the wider story — the strongest configuration tested sits in the low 40s while domain experts clear 82 percent on every task type.4 Fin-RATE is the benchmark to cite when someone claims an LLM can replace a filings analyst; the published numbers say it assists one.

What it costs to run a whole-filing pass today

A DocFinQA-style or FinanceBench-style pass means putting a full filing in context and asking a question. A large-cap 10-K resolves to roughly 120,000 input tokens after HTML-to-text normalization, and a grounded answer is a few thousand output tokens. The three models below all carry a 1M-token window, so the filing fits in one call with no chunking. List prices verified 2026-06-16 against each vendor's official page.

Model Input $/Mtok Output $/Mtok Context ~Cost per 120K-token read + 3K out
DeepSeek V4 Flash $0.14 $0.28 1M ~$0.018
Gemini 2.5 Flash $0.30 $2.50 1M ~$0.044
Gemini 3.5 Flash $1.50 $9.00 1M ~$0.207
Claude Opus 4.8 $5.00 $25.00 1M ~$0.675

DeepSeek V4 Flash sits at $0.14 input and $0.28 output per million tokens with a 1M-token window, making it the floor for a budget whole-filing read.5 Gemini 3.5 Flash, Google's current frontier model, runs $1.50 / $9.00; the older Gemini 2.5 Flash stays at $0.30 / $2.50.6 Claude Opus 4.8 is $5 / $25, ships the full 1M window at standard per-token pricing, and adds a fast mode at $10 / $50 plus a Batch API rate of $2.50 / $12.50.7

The spread is roughly 37x from DeepSeek V4 Flash to Opus 4.8 on the same filing. That gap is the cost side of the accuracy bet the benchmarks describe: the cheap model reads the filing for under two cents, but the benchmark numbers warn that whole-filing QA accuracy is fragile, so the right move is to price both lanes and run an eval on your own filings before committing the cheap one to production.

Prompt caching changes the arithmetic for repeated reads

Whole-filing QA is input-heavy: a long filing in, a short answer out. When the same filing or the same system prompt is queried repeatedly — an analyst asking eight questions of one 10-K — prompt caching collapses the input cost. Anthropic caches a block at 0.1x the base input price after a 1.25x write for the five-minute window, so the break-even is under three reads.7 For a single filing read once, caching does not help; for an interactive session over one document, it cuts the dominant cost term by roughly 90 percent. Model both paths in the Token-Cost Optimizer rather than assuming the cached lane.

Picking a model by task, not by leaderboard

The benchmarks map onto a tiered model stack. For high-volume extraction that feeds a QA pipeline, the cheapest capable model wins because the work is mechanical; DeepSeek V4 Flash or Gemini 2.5 Flash read a filing for cents. For the answer step where the benchmark gap lives — cross-document and longitudinal reasoning, the Fin-RATE EC-QA and LT-QA tasks — a frontier model earns its rate because errors compound. The disciplined pattern is a cheap reader and a strong reasoner, sized per task in the Model Selector for Finance.

  • High-volume filing reads and extraction: DeepSeek V4 Flash or Gemini 2.5 Flash.
  • Single-filing open-book QA at quality: Gemini 3.5 Flash or Claude Opus 4.8.
  • Cross-entity and multi-year synthesis: Claude Opus 4.8, with an eval on Fin-RATE-style tasks.
  • Repeated questions over one filing: layer prompt caching on the chosen model.

How to wire a benchmark into your own pipeline

Quoting a published number is not the same as running the benchmark on your stack. The minimum harness is 50 to 200 of your own labeled question-answer-evidence triplets, scored on exact match for numeric answers and span overlap for evidence retrieval, with every prompt version regressed against the prior one before shipping. FinanceBench's evidence strings and FinQA's reasoning programs are the templates for that label schema. The evaluation harness methodology for finance LLMs covers the harness shape, and the hub's LLM financial benchmark accuracy statistics collects sourced accuracy numbers across the wider benchmark set so you start from real baselines.

Connects to

References

Footnotes

  1. Chen, Z., Chen, W., Smiley, C., Shah, S., Borova, I., Langdon, D., Moussa, R., Beane, M., Huang, T.-H., Routledge, B., and Wang, W. Y. (2021). "FinQA: A Dataset of Numerical Reasoning over Financial Data." EMNLP 2021. https://arxiv.org/abs/2109.00122. 8,281 expert-written QA pairs over S&P 500 earnings reports with gold reasoning programs; reported execution accuracy of 91.16% for financial experts and ~50.68% for general crowd workers. 2

  2. Islam, P., Kannappan, A., Kiela, D., Qian, R., Scherrer, N., and Vidgen, B. (2023). "FinanceBench: A New Benchmark for Financial Question Answering." https://arxiv.org/abs/2311.11944. 10,231 question-answer-evidence triplets across 10-K, 10-Q, 8-K, earnings reports, and earnings-call transcripts; 16 model configurations evaluated on a 150-case sample, where GPT-4-Turbo with a retrieval system answered incorrectly or refused on 81% of questions. Open-source sample: 150 annotated examples (github.com/patronus-ai/financebench). 2 3

  3. Reddy, V., Koncel-Kedziorski, R., Lai, V. D., Krumdick, M., Lovering, C., and Tanner, C. (2024). "DocFinQA: A Long-Context Financial Reasoning Dataset." ACL 2024 (short). https://arxiv.org/abs/2401.06915. Augments 7,437 FinQA questions with full-document SEC-filing context, raising average context length from under 700 words to ~123,000 words; reported as a significant challenge for both retrieval pipelines and long-context models. 2

  4. Jiang, Y., Chen, J., Makri, E., Chen, J., Li, P., Maatouk, A., Tassiulas, L., Brenner, E., Xiang, B., and Ying, R. (2026). "Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings." KDD 2026. https://arxiv.org/abs/2602.07294. 7,500 QA instances (2,500 each across DR-QA, EC-QA, LT-QA) over 15,311 chunks from 2,472 SEC filings, 43 companies in 36 industries (2020-2025); 17 LLMs evaluated. Reported accuracy drop of 18.60% and 14.35% across task shifts; human-expert baselines of 92.40% (DR-QA), 82.68% (EC-QA), and 85.88% (LT-QA). 2 3 4

  5. DeepSeek. "Models & Pricing." https://api-docs.deepseek.com/quick_start/pricing, verified 2026-06-16. DeepSeek V4 Flash: $0.14 cache-miss input / $0.28 output per million tokens, 1M-token context window.

  6. Google. "Gemini API Pricing." https://ai.google.dev/gemini-api/docs/pricing, verified 2026-06-16. Gemini 3.5 Flash $1.50 / $9.00; Gemini 2.5 Flash $0.30 / $2.50 (1M context); Gemini 2.5 Pro $1.25 / $10.00 for prompts up to 200K tokens, $2.50 / $15.00 above.

  7. Anthropic. "Pricing." https://platform.claude.com/docs/en/about-claude/pricing, verified 2026-06-16. Claude Opus 4.8 $5 / $25 with the full 1M context window at standard pricing; 5-minute cache write 1.25x and cache read 0.1x base input; fast mode $10 / $50; Batch API $2.50 / $12.50. Dollar figures are subject to change; treat each as valid for the date checked. 2

Frequently asked questions

What is the best benchmark for long-context financial QA over SEC filings?
DocFinQA tests whole-filing QA directly (avg ~123K words), and Fin-RATE 2026 adds cross-entity and longitudinal reasoning. FinanceBench is the open-book whole-filing standard; FinQA is the short-context numeric floor. Match the benchmark's document length to your workload.
What is the best LLM for financial analysis in 2026?
It is task-tiered, not one model. Use DeepSeek V4 Flash or Gemini 2.5 Flash for high-volume filing reads, and Gemini 3.5 Flash or Claude Opus 4.8 for the reasoning step where benchmark accuracy gaps appear. Validate the pick on a benchmark, not a leaderboard.
How well do LLMs score on financial QA benchmarks?
On Fin-RATE 2026, the strongest configuration tested scored about 43% across task types while human experts cleared 82-92% (prices and results per the cited papers). On FinanceBench, GPT-4-Turbo with retrieval failed or refused 81% of a 150-case sample in 2023.
What does it cost to run a whole 10-K through an LLM for QA?
A ~120K-token filing plus a 3K-token answer costs about $0.018 on DeepSeek V4 Flash, $0.044 on Gemini 2.5 Flash, $0.21 on Gemini 3.5 Flash, and $0.68 on Claude Opus 4.8, all with 1M context (verified 2026-06-16). Prompt caching cuts repeated reads of one filing by roughly 90%.