TL;DR
LLM filing extraction reaches high recall on raw numerics and low reliability on the metadata that makes a number usable. Six traps account for almost every downstream error: scale ("dollars in thousands"), reporting currency, GAAP versus non-GAAP, diluted versus basic share counts, prior-period restatements, and rounded-versus-exact reporting. Better prose prompting does not close the gap. The pattern that does is structured output with explicit unit, scale, basis, share_class, and is_restated fields, paired with a post-extraction sanity check that compares each value against the three-year moving mean from prior filings. Below: the traps, a runnable schema, and a numeric-drift flag in about seventy lines of Python.
The shape of the problem
A practitioner running a 10-K pipeline will observe the following split. Gross extraction recall on numeric tokens — "what numbers does this filing contain" — lands near the top of the measurable distribution. Commitment to the right number, expressed in the right unit, reconciled to the right accounting basis, for the right share class, from the right reporting period, is where accuracy falls apart. The model confidently returns revenue = 150.3 from a filing whose header reads "dollars in thousands, except per-share data" and whose cover page is denominated in EUR. Downstream code multiplies that 150.3 by a peer-group multiple and prints a fair-value estimate off by roughly four orders of magnitude.
The fix is not a longer system prompt. The fix is structural: force the model to commit to metadata on every numeric claim, then verify the commitments against the filing and against prior filings before any number is allowed into a downstream calculation. LLM Prompt Patterns for 10-K and 8-K Extraction covers the general structure; this piece narrows to the six numeric traps that produce the majority of observed errors.
The six traps
Each trap is illustrated with a synthetic excerpt from "SYNTHETIC_A Corp," a fictional large-cap filer. Every failure pattern below has been observed across current-generation frontier models during filings extraction work; the fix in each case is the same shape — promote the hidden assumption into an explicit structured field.
Trap 1: Unit and scale handling
10-K and 10-Q filings almost always carry a scale declaration at the top of each financial statement. Typical forms:
SYNTHETIC_A Corp — Consolidated Statements of Operations
(In millions of U.S. dollars, except per-share data and share counts)
FY2025 FY2024
Revenue 12,430 11,210
Cost of revenue 7,102 6,488
Operating income 2,914 2,605
The number "12,430" means 12.43 billion USD, not 12,430 USD. A model asked "what is SYNTHETIC_A's revenue?" under a naive prompt will often return 12430 with no scale annotation, or the string "12,430 million," which a downstream numeric parser treats as 12,430. Both are wrong in ways that are invisible until a dashboard or valuation model produces absurd output.
The fix is to require a scale field with an enumerated set of values (units, thousands, millions, billions) and a value_numeric field that holds only the raw number as printed. The post-processor multiplies value_numeric by the scale factor before use. The model is never asked to do the multiplication itself — multiplication is deterministic code, not a generative task.
Trap 2: Currency handling
Filers domiciled outside the US report in local currency on the primary statements and provide a translation note for the USD equivalent. European issuers file 20-F in EUR; Japanese issuers in JPY. A typical excerpt:
SYNTHETIC_A AG — Consolidated Income Statement
(In millions of EUR, except per-share amounts)
FY2025 FY2024
Revenue 10,840 9,720
Note 34 — Exchange rates
Closing rate at year-end 2025: 1 EUR = 1.082 USD
Average rate for FY2025: 1 EUR = 1.074 USD
Naive prompts produce revenue = 10840 with no currency flag. Downstream comparison against a USD-denominated peer set silently mixes currencies. Even when the filing disambiguates, the LLM often strips the annotation in the output, because "10,840" is the answer to "what is the number."
The fix is a unit field with values such as USD, EUR, GBP, JPY. Currency conversion happens in a separate, auditable step — never inside the extraction model. Conversion-rate sources should be pinned to the filing's own note (Note 34 above) rather than a present-day FX feed, because the filing's disclosed peer comparisons are computed at historical rates.
Trap 3: GAAP versus non-GAAP
SEC Regulation G requires any non-GAAP measure presented to investors to be reconciled to the most directly comparable GAAP measure within the same filing. The reconciliation table is mandatory. The trap is that 10-K and 10-Q filings frequently emphasise the non-GAAP figure in the narrative and in MD&A, while reporting a different — usually lower — GAAP figure in the audited statements. Excerpt:
SYNTHETIC_A Corp — Management Discussion and Analysis
Adjusted EBITDA for the fiscal year totaled $3,214 million, an increase
of 14.1% over the prior year. GAAP operating income was $2,914 million,
including $180 million of restructuring charges and $120 million of
stock-based compensation not included in the adjusted measure.
A prompt asking "what was operating income?" reliably returns one of the two, with no flag. Which one depends on which passage the model attended to, which depends on document order, chunking, and context window pressure. Two consecutive runs against the same filing can return different numbers.
The fix is a basis field with values gaap, non_gaap, or adjusted, and a rule that forbids emitting a claim without a basis commitment. Any valuation pipeline consuming the extraction should filter on basis == "gaap" by default, and only use non-GAAP measures where the downstream model is explicitly calibrated on them.
Trap 4: Diluted versus basic share count
Per-share metrics — EPS, book value per share, free cash flow per share — come in two flavours. Basic EPS uses the weighted-average share count during the period. Diluted EPS adds the dilutive effect of options, RSUs, convertibles, and warrants. For growth-stage filers the two can differ materially:
SYNTHETIC_A Corp — Earnings Per Share
FY2025 FY2024
Basic EPS $2.41 $2.18
Diluted EPS $2.19 $2.02
Weighted-average shares outstanding — basic (millions) 1,209
Weighted-average shares outstanding — diluted (millions) 1,331
A valuation using diluted shares produces one fair value; using basic shares produces another. A PEG ratio mixing basic EPS from one filing with diluted from another is arithmetically correct and financially meaningless.
The fix is a share_class field with values basic, diluted, or null (for non-per-share figures). Downstream pipelines enforce a consistent class across comparisons.
Trap 5: Restatements
Prior-period numbers in a current 10-K are not always identical to the numbers reported in that period's original 10-K. Filers restate for discontinued operations, segment reporting changes, accounting-standard transitions, and material errors. The current 10-K shows the restated figure; the archived prior 10-K shows the original. Naive extraction across a historical corpus mixes the two.
SYNTHETIC_A Corp — Note 2: Restatement
Prior-year comparative amounts have been restated to reflect the
adoption of ASU 2024-03 on revenue from customer contracts. The
restatement reduced previously reported FY2024 revenue from
$11,485 million to $11,210 million.
Running the same extraction against the archived FY2024 10-K returns 11,485; running against the FY2025 10-K returns 11,210 for the same period. A time series stitched from raw extractions shows a spurious jump. Worse, the naive output gives no hint that a restatement occurred.
The fix is an is_restated boolean plus a source_snippet that captures the verbatim text the number was read from, including any restatement footnote. Pipelines should prefer the most recent filing's restated figures for historical comparisons, and should log every case where a value differs from prior extractions.
Trap 6: Rounding and significant figures
The MD&A frequently rounds: "revenue of $12.4 billion," "approximately $1.2B in capital expenditure." The income statement gives the exact figure to the nearest million: 12,430, 1,234. An extractor that pulls the MD&A text loses precision irreversibly. When the rounded figure is compared against a prior-period exact figure, artificial drift appears.
The rule is simple: always extract from the audited statements when possible, fall back to MD&A only with a flag, and never present a rounded figure as exact. A precision annotation on the structured output — "reported to nearest million" versus "reported to one decimal billion" — preserves the information for downstream code to decide whether a comparison is meaningful.
A structured output schema
The following schema bundles the five metadata fields (unit, scale, basis, share_class, is_restated) plus the two that make errors auditable (as_of_date, source_snippet). The example uses Anthropic's tool-use interface on Claude Sonnet 4.6; the same shape works with JSON mode on any modern frontier model.1
from __future__ import annotations
from dataclasses import dataclass, asdict
from typing import Literal, Optional
from anthropic import Anthropic
Unit = Literal["USD", "EUR", "GBP", "JPY", "CHF", "CNY", "OTHER"]
Scale = Literal["units", "thousands", "millions", "billions"]
Basis = Literal["gaap", "non_gaap", "adjusted"]
ShareCls = Literal["basic", "diluted"]
@dataclass
class NumericFact:
field_name: str
value_numeric: float
unit: Unit
scale: Scale
basis: Basis
share_class: Optional[ShareCls]
as_of_date: str # ISO YYYY-MM-DD, period-end date
is_restated: bool
source_snippet: str # verbatim, <= 240 chars
EXTRACTION_TOOL = {
"name": "emit_numeric_fact",
"description": (
"Emit one numeric fact from the filing. Every field is required. "
"source_snippet must be copied verbatim from the filing text."
),
"input_schema": {
"type": "object",
"properties": {
"field_name": {"type": "string"},
"value_numeric": {"type": "number"},
"unit": {"type": "string", "enum": list(Unit.__args__)},
"scale": {"type": "string", "enum": list(Scale.__args__)},
"basis": {"type": "string", "enum": list(Basis.__args__)},
"share_class": {"type": ["string", "null"],
"enum": list(ShareCls.__args__) + [None]},
"as_of_date": {"type": "string"},
"is_restated": {"type": "boolean"},
"source_snippet": {"type": "string", "maxLength": 240},
},
"required": ["field_name", "value_numeric", "unit", "scale",
"basis", "share_class", "as_of_date",
"is_restated", "source_snippet"],
},
}
SYSTEM = (
"Extract numeric facts from the attached SEC filing. For each fact, "
"call the emit_numeric_fact tool exactly once. Use scale=units when "
"the value is stated without a scale declaration. Set share_class=null "
"for non-per-share values. source_snippet must be an exact substring "
"of the filing; never paraphrase. Do not infer values not stated."
)
def extract_facts(filing_text: str, fields: list[str]) -> list[NumericFact]:
client = Anthropic()
user = f"Fields to extract: {fields}\n\nFiling:\n{filing_text}"
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=SYSTEM,
tools=[EXTRACTION_TOOL],
tool_choice={"type": "any"},
messages=[{"role": "user", "content": user}],
)
facts = []
for block in resp.content:
if block.type == "tool_use" and block.name == "emit_numeric_fact":
facts.append(NumericFact(**block.input))
return facts
Three properties of this setup matter. First, the model cannot emit a number without committing to the six metadata fields — the tool schema is enforced by the API. Second, the source_snippet field makes every extraction auditable against the filing text by substring check. Third, scale multiplication never happens inside the model; it happens in deterministic code against the enumerated scale value.
Sanity check against history
Structured extraction eliminates most of the metadata errors but not all of the numeric ones. A model can still misread "12,430" as "1,243" or transpose digits. The cheapest defence is a sanity check against the filer's own history: any value that diverges by more than two standard deviations from the three-year moving mean gets flagged for human review.
from __future__ import annotations
import numpy as np
import pandas as pd
from dataclasses import dataclass
@dataclass
class DriftFlag:
field_name: str
new_value: float
historical_mean: float
historical_std: float
z_score: float
requires_review: bool
def drift_check(
field_name: str,
new_value: float,
history: pd.Series, # prior 3-12 fiscal years, oldest first
z_threshold: float = 2.0,
min_history: int = 3,
) -> DriftFlag:
h = history.dropna().astype(float)
if len(h) < min_history:
return DriftFlag(field_name, new_value, float("nan"),
float("nan"), float("nan"), requires_review=True)
mu = float(h.mean())
sd = float(h.std(ddof=1)) if len(h) > 1 else 0.0
if sd == 0.0:
z = 0.0 if new_value == mu else float("inf")
else:
z = (new_value - mu) / sd
return DriftFlag(
field_name=field_name,
new_value=new_value,
historical_mean=mu,
historical_std=sd,
z_score=float(z),
requires_review=abs(z) > z_threshold,
)
# Example usage on a synthetic revenue history:
history = pd.Series([9_200, 9_740, 10_380, 11_210], name="revenue_millions")
flag = drift_check("revenue_millions", new_value=12_430, history=history)
# flag.z_score ~ 2.4 -> requires_review=True
This is deliberately conservative. A two-sigma gate catches digit transpositions, scale-shift errors (a thousand-fold jump is off the charts), and wrong-currency returns on non-US filers. It does not catch subtle errors within the typical range of year-over-year variation; those require the contradiction-triangle pattern from LLM Prompt Patterns for 10-K and 8-K Extraction.
Two-sigma is a starting point, not a universal threshold. Growth-stage filers with structurally high variance will over-flag; mature slow-movers will under-flag. Practitioners in production calibrate the threshold per industry or per filer.
Qualitative model priors on the six traps
Published head-to-head benchmarks on SEC filings numeric extraction across current frontier models are not available at the time of writing. The table below reflects qualitative priors based on practitioner consensus and on capabilities documented in vendor model cards, not measured benchmark values. Use it as a starting hypothesis, not a ranking. Any production pipeline should validate on its own filing corpus before relying on one model over another.
| Trap | Frontier A | Frontier B | Frontier C |
|---|---|---|---|
| Unit / scale | LOW | LOW | MEDIUM |
| Currency | LOW | MEDIUM | MEDIUM |
| GAAP vs non-GAAP | MEDIUM | MEDIUM | MEDIUM |
| Diluted vs basic | MEDIUM | MEDIUM | HIGH |
| Restatements | HIGH | HIGH | HIGH |
| Rounding / significant figures | LOW | LOW | LOW |
"LOW / MEDIUM / HIGH" refers to qualitative error frequency on the underlying trap when prompted without structured output. All three columns tested with structured output and enforced schemas collapse to LOW across every row except restatements, which is a data problem the model cannot solve on its own (the filing order matters).
Two patterns hold across models. First, the single largest lift comes from schema enforcement, not from model choice; any frontier model with reliable tool use is close to parity once the schema is in place. Second, restatements require access to the current filing's footnotes and to prior filings' text. A single-document prompt cannot catch them regardless of model.
Connects to
- Reading Financial Filings with LLMs in 2026 — pillar context on how filings extraction fits into a broader LLM-in-markets stack.
- LLM Prompt Patterns for 10-K and 8-K Extraction — general prompt patterns (field-by-field, citation-required, contradiction-triangle) that compose with the schema here.
- 5 Failure Modes of LLM Trading Agents — numeric precision errors are one of the failure modes in production extraction loops.
- Prompt Patterns for Earnings Calls — adjacent source format with its own numeric traps (forward-looking guidance, analyst-cited figures).
- Structured Schema Validator for Finance — browser tool that validates
NumericFact-shaped outputs against the rules above. - Hallucination Detector — pairs with this schema to check that every
source_snippetactually occurs in the filing text. - Agent Skill Tester — replay harness for running the extraction schema against a corpus of filings and scoring drift.
References
- U.S. Securities and Exchange Commission. "Form 10-K." General Instructions and Item specifications, sec.gov/files/form-10-k.pdf. Authoritative source for the structure and disclosure requirements referenced in the traps above.
- U.S. Securities and Exchange Commission. "Form 10-Q." General Instructions and Item specifications, sec.gov/files/form-10-q.pdf. Governs quarterly disclosures and interim restatement rules.
- U.S. Securities and Exchange Commission. "Regulation G — Disclosure of Non-GAAP Financial Measures." 17 CFR Parts 228, 229, and 244. Governs the reconciliation requirement referenced in Trap 3.
- Financial Accounting Standards Board. "ASC 260 — Earnings Per Share." Specifies basic and diluted share-count calculations referenced in Trap 4.
- U.S. Securities and Exchange Commission. "EDGAR Filer Manual (Volume II)." sec.gov/info/edgar/edgarfm-vol2-v68.pdf. Mechanical specification for filing structure and amendment / restatement handling referenced in Trap 5.
Footnotes
-
Anthropic. (2026). "Tool use with the Messages API." Anthropic documentation, docs.anthropic.com. Describes the
toolsparameter, enforced input schemas, and thetool_choicefield used above. ↩