Reading Financial Filings With LLMs: 2026 Playbook

TL;DR

Reading SEC filings is the single largest AI-in-finance workload in 2026. A 10-K runs 15 to 25 thousand tokens; a 10-Q runs 4 to 8 thousand; an 8-K is often under 1 thousand. Eight distinct tasks sit on top of that corpus: extraction, summarization, peer comparison, document Q and A, classification, tone-shift detection, forecast-feature generation, and disclosure-change detection. Each has its own sweet spot in the model-by-pattern-by-cost matrix. There is no single best model and no single best pattern. This guide is the map: what each task actually demands, which model tier and prompting pattern fits, and which supporting tools and articles cover the precision, cost, and evaluation concerns that cut across every workload.

The filings workload, sized honestly

SEC filings form the largest public text corpus an AI-in-finance engineer will ever touch. EDGAR holds roughly 18 million filings and accepts about 1.2 million new submissions per year.¹ A single large-cap 10-K is commonly 80 to 140 printed pages. After HTML-to-text normalization it resolves to 15,000 to 25,000 tokens on the Anthropic tokenizer and a comparable count on OpenAI's o200k_base.² Item 1A risk factors alone can exceed 8,000 tokens.

Token counts matter because they set the economic floor. At published April-2026 rates, Claude Sonnet 4.6 processes input at roughly $3 per million tokens; a single 20,000-token 10-K costs about $0.06 per read before any output.³ GPT-5 standard input sits in a similar band; Gemini 2.5 Pro is priced below both for inputs under 200,000 tokens. Haiku 4.5 processes the same 10-K for about $0.015 per read, and Opus 4.1 for about $0.30. Pricing has moved twice in 2026 already, so treat any specific dollar figure as valid only for the date it was checked.

The workload naturally separates into eight tasks. They share a corpus and share infrastructure, but they do not share a sweet spot. Treating them as a single "read filings with LLM" problem is the most common architectural mistake in finance AI stacks.

The eight tasks

Task A: Structured extraction

Structured extraction pulls specific fields out of filings into typed records: revenue by segment, auditor name, going-concern language flag, number of board members, a named risk factor's presence, cash-flow sub-items. The output is a JSON object matching a fixed schema.

Latency matters here because extraction usually runs as a batch over thousands of filings and the wall-clock sets how fresh downstream tables stay. Accuracy sensitivity is high for numeric fields and moderate for categorical ones. The sweet spot is a Haiku or Sonnet tier model with structured-output mode on, paired with a deterministic validator that rejects outputs failing JSON schema or unit checks. Long context is wasted: extraction rarely needs more than the single filing section the field lives in, so chunk-then-route outperforms whole-document prompting by a factor of three to eight on cost.

For numeric fields the model should return both the value and the span it was extracted from. Span-anchored extractions are the only kind a downstream system can audit. The LLM Prompt Patterns for 10-K Extraction article covers the three extraction prompt shapes that survive contact with real filings.

A realistic extraction throughput target on Haiku 4.5 is 40 to 80 filings per minute per API key, limited more by rate limits than by model inference time. At that rate, a weekly extraction pass over the 500 largest US issuers' quarterly filings completes in under fifteen minutes and costs roughly $0.40 to $1.20 total. The same pass on Sonnet 4.6 costs eight to ten times more and rarely produces measurably better numeric extractions when the prompt is well-constructed and the validator is deterministic. Practitioners in regulated settings routinely cross-check a random 2 to 5 percent sample against a second model tier as a drift detector rather than as the primary extractor.

Task B: Summarization

Summarization produces a human-readable digest: an executive summary of a 10-K, a compressed Item 1A, a one-paragraph take on a proxy statement. Unlike extraction, the output is prose and its quality is judged qualitatively.

The dominant cost driver is the input, not the output, because the system prompt and filing body are long while the summary is short. Prompt caching collapses this: Anthropic caches a system prompt for five minutes after the last read at 10 percent of the uncached input price.³ A one-time system prompt of 6,000 tokens, read against 200 filings in a batch window, costs 1 write plus 199 cached reads rather than 200 full writes. Haiku and Sonnet tiers are the common picks. Opus is only justified when the summary feeds a regulated deliverable.

Summarization quality degrades in predictable ways. Generic prompts produce generic summaries; summaries that reference specific dollar figures, segment names, and named risks are worth twice as much to a human reader. A prompt that requires the model to cite at least three numeric anchors and at least two named risk factors produces outputs that a research analyst actually reads rather than skims. The same prompt applied to an 8-K, where the whole filing is often under 1,000 tokens, wastes structure; short filings deserve a different summarization prompt that emphasizes classification of the event type over narrative compression.

Task C: Peer comparison

Peer comparison takes five to fifteen filings for a synthetic peer set and asks for synthesis: which issuers called out supply-chain risk, which changed revenue-recognition language, which lengthened their going-concern paragraph. This is the task that most visibly rewards longer context.

Three patterns work, in rising order of cost. First, pre-summarize each filing into a 500-token brief and concatenate briefs into the comparison prompt. Second, retrieve the relevant sections per question (risk factors only, MD and A only) and concatenate at section granularity. Third, place full filings in a 200K or 1M context window and ask directly. Pattern one costs about $0.05 per comparison; pattern three can exceed $1.50. The accuracy delta between one and three is smaller than most teams assume once the briefs are well-structured. Sonnet or GPT-5 standard handles one and two cleanly; Opus or GPT-5 Pro is reserved for three when cross-document reasoning demands it.

Peer-comparison prompts fail most often at the aggregation step rather than the per-document step. A common failure mode is the model answering four of five comparative questions from the first filing in context and degrading as the context position grows, which published needle-in-a-haystack results predict. Mitigation is structural: require a per-issuer table as intermediate output, with one row per filing, before the cross-issuer synthesis. The table forces the model to touch every document and makes downstream validation trivial. Pattern one's pre-summarization step does this implicitly; pattern three needs the discipline imposed explicitly.

Task D: Document Q and A

Q and A is single-user, interactive, and latency-bound. A research analyst types "what did the issuer say about data-center capex guidance," and the system retrieves and answers in under five seconds.

Two architectures dominate. For a single filing, direct prompting with a mid-tier model and the filing in context is simplest and accurate. For a corpus (an issuer's last eight years, or a sector's quarterly filings), retrieval-augmented generation is required: a vector index over chunked filings, top-k retrieval, and a mid-tier model assembling the answer. Chunk size and overlap dominate RAG quality; the SEC Filing Chunk Optimizer surfaces the tradeoff explicitly for 10-K, 10-Q, and 8-K structures.

A sensible default chunk configuration for SEC filings is 800 to 1,200 tokens per chunk with 150 to 200 tokens of overlap, segmented on section boundaries where possible. Risk-factor sections deserve smaller chunks (400 to 600 tokens) because individual risks are short and retrieval should land on the specific risk, not a cluster of three. Financial-statement tables are the hardest case: naive chunking destroys row-column alignment. A production stack extracts tables separately into a structured store and retrieves them by metadata rather than by text similarity. This hybrid pattern, documented in the Lewis et al. RAG follow-ups, outperforms pure text RAG on filings by 20 to 40 points on exact-match numeric questions.⁴

Task E: Classification

Classification assigns a taxonomy label: risk-factor category (cybersecurity, supply chain, regulatory, FX), sentiment band on an earnings call paragraph, disclosure-material flag. The output space is small and fixed.

Classification is the one task where fine-tuning still earns its keep. A few-thousand-example fine-tune on Haiku-tier or an open-weights 7B model can match Sonnet zero-shot at one tenth the inference cost. The break-even depends on volume: below 10,000 classifications per month, few-shot prompting with a cheaper model is the right choice; above 100,000, fine-tuning pays back the dataset-curation work within a quarter. The fine-tuning versus RAG versus long-context piece covers the break-even math in full.

Task F: Tone-shift detection across earnings calls

Tone-shift detection compares consecutive earnings-call transcripts for the same issuer and flags directional changes: optimism about a product line that softened, a hedge that tightened, a category that stopped being mentioned. This is structured sentiment, not summarization.

The pattern is a two-pass structured output. Pass one extracts a feature vector per transcript: topic presence, sentiment polarity per topic, hedging density, forward-looking statement count. Pass two computes the delta between consecutive vectors. Sonnet-tier handles pass one; the delta is a deterministic computation, not an LLM task. The prompt patterns for earnings calls article covers the feature-vector schema that works across GICS sectors.

Task G: Forecast-feature generation

Forecast-feature generation converts filings into inputs for downstream forecasting models: a numeric score for capex intensity, a binary flag for pending litigation, a categorical label for business-model stage. The feature vector feeds an econometric model, not a human.

Precision is paramount because downstream models amplify feature noise. This is the one task where Opus-tier or GPT-5 Pro earns its price per token: errors at this stage propagate multiplicatively. The pattern is extraction plus transformation, with schema validation at both the LLM boundary and the numeric-unit boundary. Span anchoring is mandatory; features without provenance cannot be debugged when a downstream model behaves unexpectedly.

Task H: Disclosure-change detection

Disclosure-change detection compares a new filing to the issuer's prior filing of the same type and flags material additions, removals, or rewrites. Compliance and legal teams use this to surface disclosures that merit review.

The first pass is not an LLM task: it is a diff over normalized text. The LLM's job is to classify each non-trivial diff as material or boilerplate, and to summarize material changes. This keeps token cost linear in the diff size, not the filing size. Sonnet-tier suffices; the downstream consumer is a human, so a false positive is cheap and a false negative is expensive. Threshold the model toward recall.

Normalization before the diff is the underappreciated step. Boilerplate variations (auditor signature blocks, date stamps, page numbers, whitespace) generate noise diffs that swamp signal. A pre-diff pipeline that strips these patterns, normalizes number formats, and sentence-tokenizes the output reduces LLM classification calls by 70 to 90 percent. The remaining diffs are almost all substantive, which also improves classification accuracy because the model is no longer being asked to distinguish formatting from substance.

Master selection table

The eight tasks summarized against five axes:

Task	Model band	Latency class	Cost band per doc	Accuracy sensitivity	Pattern
A. Extraction	Haiku / Sonnet	Batch (seconds)	$0.01 - $0.08	High for numeric	Direct, chunk-routed, structured output
B. Summarization	Haiku / Sonnet	Batch (seconds)	$0.02 - $0.10	Medium	Direct with prompt caching
C. Peer comparison	Sonnet / Opus	Interactive (10-60s)	$0.05 - $1.50	High	Pre-summary, RAG, or long-context
D. Document Q and A	Sonnet	Interactive (<5s)	$0.01 - $0.10	Medium-high	Direct for single doc, RAG for corpus
E. Classification	Fine-tuned Haiku / Sonnet few-shot	Batch	$0.001 - $0.02	Medium	Fine-tuning above 100K/month, few-shot below
F. Tone-shift	Sonnet	Batch	$0.05 - $0.20	Medium-high	Two-pass: extract features, diff numerically
G. Forecast features	Opus / GPT-5 Pro	Batch	$0.10 - $0.50	Very high	Span-anchored extraction with validator
H. Disclosure change	Sonnet	Batch	$0.02 - $0.15	Medium (recall-biased)	Text diff then LLM classify and summarize

Dollar bands assume published April-2026 rates and do not include prompt-cache savings, which can cut input cost by 60 to 90 percent for any task where the same system prompt is reused across many documents.

Cross-cutting concern 1: precision and unit handling

LLMs hallucinate numbers in predictable ways. A 10-K that reports revenue of $8,234 million can be surfaced by a careless extraction as $8.234 billion or, worse, as $8,234. Numeric precision failures are the single highest-severity bug class in finance AI. The fix is not prompting harder; it is span-anchored extraction plus deterministic unit validation at the boundary. A well-engineered pipeline rejects any numeric field whose rendered span cannot be re-parsed to match the model's output. The numeric precision in LLM filings piece covers the validator shapes that catch over 95 percent of these failures before they reach downstream tables.

Unit handling is a separate failure surface. SEC filings report numbers "in thousands" and "in millions" in footnote conventions that the model must carry through accurately. A typical validator keeps an explicit unit field on every numeric extraction and cross-checks it against the filing's Item 8 footer conventions. The Structured Schema Validator for Finance codifies this pattern for the most common filing types.

Cross-cutting concern 2: cost management and caching

Filings workloads are input-heavy: thousands of tokens in, tens to hundreds out. This shape is exactly what prompt caching was designed for. Anthropic's five-minute cache stores a block at 10 percent of the uncached price; writes cost 25 percent more than uncached input.³ The break-even is under three reads. For any task B through H above, batch windows of 20 to 200 filings against a stable system prompt should always be cached. The prompt caching economics for finance article works through the math; the Financial Document Token Estimator gives per-filing token counts so budgets are not guessed. Batch API submissions are a second lever: Anthropic and OpenAI both offer 50 percent discounts on batch jobs completing within 24 hours.

The interaction between caching and batching matters. Batch jobs do not share a cache across submissions, so the cache discount and the batch discount are not additive for the same tokens. A common configuration sends real-time and near-real-time workloads through the cached synchronous path and sends nightly reconciliation workloads through the batch path. The Token Cost Optimizer models both paths side by side so the right lane is selected per workload rather than by habit.

Cross-cutting concern 3: evaluating reliability

Without an evaluation harness, model-upgrade decisions are guesses. The minimum viable harness for a filings workload is 50 to 200 human-labeled records per task, scored on exact match for numeric extractions, span overlap for text extractions, and rubric scoring for summarization and Q and A. Version every prompt; regress every harness run against the prior prompt version before shipping. The evaluation harness methodology for finance LLMs piece covers the harness shape, the scoring rubrics, and the continuous-integration pattern that keeps prompt changes from silently degrading production.

Architecture selection: fine-tuning, RAG, or long context

The three patterns are not substitutes; they solve different problems.

Fine-tuning pays back when the task has stable inputs, a small output space, and high volume. Classification is the clearest case. It does not help when the knowledge base is dynamic: a model fine-tuned on 2024 filings does not know what a 2026 filing says.

RAG pays back when the corpus is large, queries are diverse, and the knowledge base updates continuously. Document Q and A against a multi-year corpus is the canonical case. RAG quality is dominated by chunking and retrieval, not by the generator model; cheap retrieval with a mid-tier generator usually beats expensive retrieval with a top-tier generator.

Long context pays back when cross-document reasoning matters and the document count per query is small. Peer comparison against ten filings is the canonical case. Long context is expensive per query and rewards aggressive caching of anything reused across queries.

Most production filings stacks combine all three: fine-tuned classifiers for high-volume tagging, RAG for analyst-facing Q and A, long context for peer-comparison deliverables, with extraction and summarization layered across them. The fine-tuning versus RAG versus long-context piece is the full decision framework.

The decision reduces to three axes per task:

Pattern	Best when	Weakness	Typical filings task
Fine-tuning	Stable input shape, small output space, volume above 100K/month	Knowledge goes stale; costly to re-train	Risk-factor taxonomy, sentiment bands
RAG	Large dynamic corpus, diverse queries, interactive latency	Chunking quality dominates; hard on tables	Multi-year Q and A, corpus-wide search
Long context	Small document count per query, cross-document reasoning	Expensive per query; position bias	5-15 filing peer comparison

What this pillar connects to

This piece is the map. Each of the cross-cutting concerns and each of the eight tasks has a dedicated companion that goes deeper than this guide can. The 2026 engineer's guide to AI in markets is the sister pillar covering the full trading stack; this pillar drills into the filings layer specifically.

Connects to

Fine-Tuning vs RAG vs Long Context for Filings - the architecture decision framework referenced throughout this guide.
Prompt Caching Economics for Finance - cost math for input-heavy filings workloads.
Numeric Precision in LLM Filing Extraction - the validator patterns that catch unit and precision errors.
Prompt Patterns for Earnings Calls - the two-pass feature-extraction shape for tone-shift detection.
LLM Prompt Patterns for 10-K Extraction - the three extraction prompt shapes that survive production.
The 2026 Engineer's Guide to AI in Markets - sister pillar on the broader trading stack.
Evaluation Harness Methodology for Finance LLMs - the harness shape that prevents silent regression.
Token-Cost Reality for LLM Trading Research - the wider cost model for LLM finance loops.
Financial Document Token Estimator - per-filing token counts for Anthropic, OpenAI, and Google tokenizers.
SEC Filing Chunk Optimizer - chunking strategies tuned to 10-K, 10-Q, and 8-K structures.
Structured Schema Validator for Finance - JSON-schema plus unit validation for extraction outputs.
Hallucination Detector - span-anchored fact-checking for extracted numerics.
Token-Cost Optimizer - cost calibration across model tiers and caching strategies.

References

Google DeepMind. (2025). "Gemini 2.5 Technical Report." https://deepmind.google/technologies/gemini/. Covers the 1M-token context window and long-document benchmarks relevant to whole-filing prompting.
Lewis, P., Perez, E., Piktus, A., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems 33. The foundational RAG paper; still the clearest statement of when retrieval beats larger context.
Anthropic. "Message Batches API." https://docs.anthropic.com/en/docs/build-with-claude/batch-processing. OpenAI. "Batch API." https://platform.openai.com/docs/guides/batch. Both vendors document the 50 percent batch discount referenced in the cost section.

U.S. Securities and Exchange Commission. "EDGAR Full-Text Search and Filer Manual." https://www.sec.gov/edgar. Statistics on filing volume are published in the annual EDGAR Operations report and the SEC FY2025 Agency Financial Report. ↩
Anthropic. "Models and Context Windows." https://docs.anthropic.com/en/docs/about-claude/models. OpenAI. "Models Reference." https://platform.openai.com/docs/models. Tokenizer specifications for Claude and the o200k_base encoding used by GPT-5. ↩
Anthropic. "Prompt Caching." https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching. Pricing rates accessed April 2026; all dollar figures in this article reflect published rates on that date and are subject to change. ↩ ↩² ↩³
Izacard, G., and Grave, E. (2021). "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." EACL 2021. Hybrid retrieval over structured tables plus unstructured text consistently outperforms pure text RAG on numeric questions; the filings-specific adaptation is covered in the chunking-optimizer companion tool. ↩