Skip to main content
aifinhub
AI in Markets Guide

How to Evaluate an LLM for 10-K Extraction

Extracting structured fields from 10-K filings is a task where a model can look impressive on a clean example and fail quietly on the messy reality of footnotes, segment tables, and restatements. Choosing a model by vibe or by one demo is how silent errors reach a decision. The reliable approach is a measured evaluation against a labeled set. How to build that evaluation and what to score so the model you choose earns its place is laid out below.

By AI Fin Hub Research · AI Fin Hub Team

On This Page

Before You Start

Set up the inputs that make the next steps easier

A defined extraction schema: the exact fields you need, their types, and their units.
A set of representative filings, including hard ones, with human-verified correct values for each field.
A short list of candidate models and your own API access to run them.

Guide Steps

Move through it in order

Each step focuses on one decision so you can keep momentum without losing the thread.

  1. 1

    Define the extraction schema precisely

    Before evaluating any model, pin down exactly what you are extracting: each field name, its data type, its unit, and how ambiguous cases are resolved. A vague schema makes evaluation impossible because you cannot tell a wrong answer from a different-but-valid one. A precise schema also lets you validate the model's output structurally before you even check the values, catching malformed responses cheaply.

    Specify units and sign conventions explicitly. Most extraction disagreements on filings come from thousands-versus-millions and parenthesis-means-negative ambiguity, not from the model missing the number.

    Use The ToolPlaygrounds

    Structured Schema Validator for Finance

    Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.

    ToolOpen ->
  2. 2

    Build a labeled gold set

    Assemble a set of filings and have a human record the correct value for every field in your schema. Include variety on purpose: different industries, filing sizes, and presentation styles. The gold set is the ground truth your evaluation scores against, so its quality bounds the quality of every conclusion. A gold set of a handful of clean filings will overstate accuracy; one that spans the real distribution will not.

    Have a second person spot-check the labels. Gold-set errors masquerade as model errors and quietly corrupt the entire evaluation.

  3. 3

    Score field-level accuracy and faithfulness

    Run each candidate model over the gold set and score two things separately: did it get each field's value right, and is each value actually supported by the source text it cited. A model can be accurate by luck while citing the wrong passage, which is fragile. Field-level accuracy tells you how often the answer is correct; faithfulness tells you whether you can trust the reason. Report per-field accuracy, not just an overall average, since some fields are far harder than others.

    Track which specific fields fail. A model that is excellent on the income statement but weak on segment footnotes needs a targeted fix, not a wholesale replacement.

  4. 4

    Stress-test the hard cases

    The average filing is easy; the value of the evaluation is in the hard cases. Deliberately test footnote-buried figures, restated prior-year numbers, multi-segment tables, and unusual formatting. These are where models silently err and where the cost of an error is highest. A model that handles the clean cases but collapses on footnotes is not ready for production extraction, and only a stress test reveals that.

    Restatements are a classic trap: the model must extract the figure as filed, not the corrected one, unless your schema says otherwise. Test this case explicitly.

  5. 5

    Verify the numbers the winner extracts

    Even the best model in your evaluation will make transcription and unit errors at some rate. Build a per-number verification step that checks each extracted figure against the source text and flags mismatches for review. This is not a substitute for choosing a good model; it is the production safety net that catches the residual errors the chosen model still makes. Evaluation picks the model; verification protects the output in the field.

    Surface disagreements rather than silently overriding them. A flagged mismatch routed to a human is safe; a silent correction can hide a real problem in the source.

    Use The ToolPlaygrounds

    Hallucination Detector

    Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.

    ToolOpen ->
  6. 6

    Compare the survivors on cost

    Only after the accuracy and faithfulness scores narrow the field to acceptable models should cost decide. Estimate the token cost per filing for each survivor, including the input size of a full 10-K and any cache benefit from a stable prompt. The right model is the cheapest one that clears your accuracy bar on the hard cases, not the cheapest one overall and not the most accurate regardless of price.

    A 10-K is a large input, so per-filing cost is dominated by input tokens. Caching a stable extraction prompt and schema can change the cost ranking of the survivors.

    Use The ToolCalculators

    Financial Document Token Estimator

    Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across eight frontier LLMs, with cache-hit toggle.

    ToolOpen ->

Common Mistakes

The misses that undo good inputs

1

Judging a model on one clean demo filing

A single well-formatted filing hides every failure mode that matters. The model that aces a demo can fail on footnotes, restatements, and segment tables, which is exactly where extraction errors are most costly.

2

Scoring accuracy but not source faithfulness

A model can produce a correct value while citing the wrong passage, which is luck rather than reliability. Without a faithfulness score you cannot tell a robust extractor from one that will fail unpredictably on the next filing.

3

Letting cost decide before accuracy does

Choosing the cheapest model first and hoping accuracy is adequate inverts the priority. In extraction a wrong figure can be far more expensive than the token savings, so accuracy on hard cases must gate the choice before cost is considered.

Try These Tools

Run the numbers next

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

Large enough to span the real distribution of filings you will process, including the hard cases, rather than a fixed count. A gold set that is all clean, large-cap, single-segment filings will overstate accuracy. Prioritize variety over volume: a few dozen filings chosen to cover different industries, sizes, formats, and known traps is more informative than hundreds of similar easy ones.

Sources & References

Related Content

Keep the topic connected

Planning estimates only — not financial, tax, or investment advice.