Should I fine-tune a model for 10-K extraction?

Usually start with a capable general model plus a precise schema and verification, and only fine-tune if the evaluation shows a specific, high-volume field that the general model cannot handle reliably. Fine-tuning helps for narrow, stable extraction formats but adds a maintenance burden and can degrade when the provider updates the base model. Measure where the general model falls short before paying that cost.

What accuracy is good enough for production extraction?

It depends on the downstream use and whether a human reviews the output. For a figure that feeds an automated decision, the bar is very high and a verification layer is mandatory regardless of model accuracy. For a draft that a human reviews, a lower bar is tolerable. Define the bar from the cost of an error in your specific use, not from a generic benchmark number.

How do I keep the evaluation current as models update?

Turn the gold set into a regression suite and re-run it on every model update and prompt change. Providers change model behavior between versions, and a model that passed last quarter can quietly regress. Logging the model version alongside results lets you localize a drop to a specific update, which is far cheaper than discovering the regression through bad production output.

AI in Markets Guide

How to Evaluate an LLM for 10-K Extraction

Extracting structured fields from 10-K filings is a task where a model can look impressive on a clean example and fail quietly on the messy reality of footnotes, segment tables, and restatements. Choosing a model by vibe or by one demo is how silent errors reach a decision. The reliable approach is a measured evaluation against a labeled set. How to build that evaluation and what to score so the model you choose earns its place is laid out below.

9 MIN READPublished May 26, 2026Live Content

By AI Fin Hub Research · AI Fin Hub Team

On This Page

Before you start 6 steps Common mistakes FAQ

Before You Start

Set up the inputs that make the next steps easier

A defined extraction schema: the exact fields you need, their types, and their units.

A set of representative filings, including hard ones, with human-verified correct values for each field.

A short list of candidate models and your own API access to run them.

Guide Steps

Move through it in order

Each step focuses on one decision so you can keep momentum without losing the thread.

1

Define the extraction schema precisely

Before evaluating any model, pin down exactly what you are extracting: each field name, its data type, its unit, and how ambiguous cases are resolved. A vague schema makes evaluation impossible because you cannot tell a wrong answer from a different-but-valid one. A precise schema also lets you validate the model's output structurally before you even check the values, catching malformed responses cheaply.

Specify units and sign conventions explicitly. Most extraction disagreements on filings come from thousands-versus-millions and parenthesis-means-negative ambiguity, not from the model missing the number.

Use The ToolPlaygrounds
Structured Schema Validator for Finance
Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.
ToolOpen ->
2

Build a labeled gold set

Assemble a set of filings and have a human record the correct value for every field in your schema. Include variety on purpose: different industries, filing sizes, and presentation styles. The gold set is the ground truth your evaluation scores against, so its quality bounds the quality of every conclusion. A gold set of a handful of clean filings will overstate accuracy; one that spans the real distribution will not.

Have a second person spot-check the labels. Gold-set errors masquerade as model errors and quietly corrupt the entire evaluation.
3

Score field-level accuracy and faithfulness

Run each candidate model over the gold set and score two things separately: did it get each field's value right, and is each value actually supported by the source text it cited. A model can be accurate by luck while citing the wrong passage, which is fragile. Field-level accuracy tells you how often the answer is correct; faithfulness tells you whether you can trust the reason. Report per-field accuracy, not just an overall average, since some fields are far harder than others.

Track which specific fields fail. A model that is excellent on the income statement but weak on segment footnotes needs a targeted fix, not a wholesale replacement.
4

Stress-test the hard cases

The average filing is easy; the value of the evaluation is in the hard cases. Deliberately test footnote-buried figures, restated prior-year numbers, multi-segment tables, and unusual formatting. These are where models silently err and where the cost of an error is highest. A model that handles the clean cases but collapses on footnotes is not ready for production extraction, and only a stress test reveals that.

Restatements are a classic trap: the model must extract the figure as filed, not the corrected one, unless your schema says otherwise. Test this case explicitly.
5

Verify the numbers the winner extracts

Even the best model in your evaluation will make transcription and unit errors at some rate. Build a per-number verification step that checks each extracted figure against the source text and flags mismatches for review. This is not a substitute for choosing a good model; it is the production safety net that catches the residual errors the chosen model still makes. Evaluation picks the model; verification protects the output in the field.

Surface disagreements rather than silently overriding them. A flagged mismatch routed to a human is safe; a silent correction can hide a real problem in the source.

Use The ToolPlaygrounds
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
ToolOpen ->
6

Compare the survivors on cost

Only after the accuracy and faithfulness scores narrow the field to acceptable models should cost decide. Estimate the token cost per filing for each survivor, including the input size of a full 10-K and any cache benefit from a stable prompt. The right model is the cheapest one that clears your accuracy bar on the hard cases, not the cheapest one overall and not the most accurate regardless of price.

A 10-K is a large input, so per-filing cost is dominated by input tokens. Caching a stable extraction prompt and schema can change the cost ranking of the survivors.

Use The ToolCalculators
Financial Document Token Estimator
Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across ten frontier LLMs, with cache-hit toggle.
ToolOpen ->

Common Mistakes

The misses that undo good inputs

Judging a model on one clean demo filing

A single well-formatted filing hides every failure mode that matters. The model that aces a demo can fail on footnotes, restatements, and segment tables, which is exactly where extraction errors are most costly.

Scoring accuracy but not source faithfulness

A model can produce a correct value while citing the wrong passage, which is luck rather than reliability. Without a faithfulness score you cannot tell a robust extractor from one that will fail unpredictably on the next filing.

Letting cost decide before accuracy does

Choosing the cheapest model first and hoping accuracy is adequate inverts the priority. In extraction a wrong figure can be far more expensive than the token savings, so accuracy on hard cases must gate the choice before cost is considered.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Prompt Regression Tester

Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

Large enough to span the real distribution of filings you will process, including the hard cases, rather than a fixed count. A gold set that is all clean, large-cap, single-segment filings will overstate accuracy. Prioritize variety over volume: a few dozen filings chosen to cover different industries, sizes, formats, and known traps is more informative than hundreds of similar easy ones.

Sources & References

Holistic Evaluation of Language Models (HELM) — Liang et al., Stanford CRFM (2022)
Form 10-K — U.S. Securities and Exchange Commission

Keep the topic connected

AI in Markets1 FAQS

LLM Hallucination Detection in Finance

How to detect LLM hallucinations in financial outputs: citation grounding, verifiable-claim checks, and cross-model agreement that flag fabricated data.

Keep readingRead ->

AI in Markets1 FAQS

Model Drift

Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.

Keep readingRead ->

AI in Markets2 FAQS

MCP (Model Context Protocol)

Model Context Protocol: Anthropic's open standard for letting LLMs discover and call tools — the interface, why it matters, and finance MCP server checks.

Keep readingRead ->

AI in Markets14 ITEMS

LLM for Finance Deployment Checklist

A pre-flight checklist for putting a large language model into a finance workflow: scoping, grounding, input security, numerical verification, and drift monitoring.

Keep readingRead ->

Set up the inputs that make the next steps easier

Move through it in order

Define the extraction schema precisely

Build a labeled gold set

Score field-level accuracy and faithfulness

Stress-test the hard cases

Verify the numbers the winner extracts

Compare the survivors on cost

The misses that undo good inputs

Judging a model on one clean demo filing

Scoring accuracy but not source faithfulness

Letting cost decide before accuracy does

Run the numbers next

Prompt Regression Tester

Questions people ask next

Keep the topic connected

LLM Hallucination Detection in Finance

Model Drift

MCP (Model Context Protocol)

LLM for Finance Deployment Checklist