How to Evaluate an LLM for 10-K Extraction
Extracting structured fields from 10-K filings is a task where a model can look impressive on a clean example and fail quietly on the messy reality of footnotes, segment tables, and restatements. Choosing a model by vibe or by one demo is how silent errors reach a decision. The reliable approach is a measured evaluation against a labeled set. How to build that evaluation and what to score so the model you choose earns its place is laid out below.
On This Page
Before You Start
Set up the inputs that make the next steps easier
Guide Steps
Move through it in order
Each step focuses on one decision so you can keep momentum without losing the thread.
- 1
Define the extraction schema precisely
Before evaluating any model, pin down exactly what you are extracting: each field name, its data type, its unit, and how ambiguous cases are resolved. A vague schema makes evaluation impossible because you cannot tell a wrong answer from a different-but-valid one. A precise schema also lets you validate the model's output structurally before you even check the values, catching malformed responses cheaply.
Specify units and sign conventions explicitly. Most extraction disagreements on filings come from thousands-versus-millions and parenthesis-means-negative ambiguity, not from the model missing the number.
Use The ToolPlaygroundsStructured Schema Validator for Finance
Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.
ToolOpen -> - 2
Build a labeled gold set
Assemble a set of filings and have a human record the correct value for every field in your schema. Include variety on purpose: different industries, filing sizes, and presentation styles. The gold set is the ground truth your evaluation scores against, so its quality bounds the quality of every conclusion. A gold set of a handful of clean filings will overstate accuracy; one that spans the real distribution will not.
Have a second person spot-check the labels. Gold-set errors masquerade as model errors and quietly corrupt the entire evaluation.
- 3
Score field-level accuracy and faithfulness
Run each candidate model over the gold set and score two things separately: did it get each field's value right, and is each value actually supported by the source text it cited. A model can be accurate by luck while citing the wrong passage, which is fragile. Field-level accuracy tells you how often the answer is correct; faithfulness tells you whether you can trust the reason. Report per-field accuracy, not just an overall average, since some fields are far harder than others.
Track which specific fields fail. A model that is excellent on the income statement but weak on segment footnotes needs a targeted fix, not a wholesale replacement.
- 4
Stress-test the hard cases
The average filing is easy; the value of the evaluation is in the hard cases. Deliberately test footnote-buried figures, restated prior-year numbers, multi-segment tables, and unusual formatting. These are where models silently err and where the cost of an error is highest. A model that handles the clean cases but collapses on footnotes is not ready for production extraction, and only a stress test reveals that.
Restatements are a classic trap: the model must extract the figure as filed, not the corrected one, unless your schema says otherwise. Test this case explicitly.
- 5
Verify the numbers the winner extracts
Even the best model in your evaluation will make transcription and unit errors at some rate. Build a per-number verification step that checks each extracted figure against the source text and flags mismatches for review. This is not a substitute for choosing a good model; it is the production safety net that catches the residual errors the chosen model still makes. Evaluation picks the model; verification protects the output in the field.
Surface disagreements rather than silently overriding them. A flagged mismatch routed to a human is safe; a silent correction can hide a real problem in the source.
Use The ToolPlaygroundsHallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
ToolOpen -> - 6
Compare the survivors on cost
Only after the accuracy and faithfulness scores narrow the field to acceptable models should cost decide. Estimate the token cost per filing for each survivor, including the input size of a full 10-K and any cache benefit from a stable prompt. The right model is the cheapest one that clears your accuracy bar on the hard cases, not the cheapest one overall and not the most accurate regardless of price.
A 10-K is a large input, so per-filing cost is dominated by input tokens. Caching a stable extraction prompt and schema can change the cost ranking of the survivors.
Use The ToolCalculatorsFinancial Document Token Estimator
Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across eight frontier LLMs, with cache-hit toggle.
ToolOpen ->
Common Mistakes
The misses that undo good inputs
Judging a model on one clean demo filing
A single well-formatted filing hides every failure mode that matters. The model that aces a demo can fail on footnotes, restatements, and segment tables, which is exactly where extraction errors are most costly.
Scoring accuracy but not source faithfulness
A model can produce a correct value while citing the wrong passage, which is luck rather than reliability. Without a faithfulness score you cannot tell a robust extractor from one that will fail unpredictably on the next filing.
Letting cost decide before accuracy does
Choosing the cheapest model first and hoping accuracy is adequate inverts the priority. In extraction a wrong figure can be far more expensive than the token savings, so accuracy on hard cases must gate the choice before cost is considered.
Try These Tools
Run the numbers next
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- Holistic Evaluation of Language Models (HELM) — Liang et al., Stanford CRFM (2022)
- Form 10-K — U.S. Securities and Exchange Commission
Related Content
Keep the topic connected
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Model Drift
Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.
MCP (Model Context Protocol)
Model Context Protocol: Anthropic's open standard for letting LLMs discover and call tools — the interface, why it matters, and finance MCP server checks.
LLM for Finance Deployment Checklist
A pre-flight checklist for putting a large language model into a finance workflow: scoping, grounding, input security, numerical verification, and drift monitoring.