How to Build a RAG Pipeline Over SEC Filings
SEC filings are long, structured, and dense with numbers, which makes them a strong fit for retrieval-augmented generation and an unforgiving one for mistakes. A pipeline that retrieves the right passage and grounds its answer there is far more reliable than a model answering from memory. But retrieval alone does not stop fabrication, and filings carry restatement and point-in-time traps. The pipeline is covered end to end, from ingestion through numerical verification.
On This Page
Before You Start
Set up the inputs that make the next steps easier
Guide Steps
Move through it in order
Each step focuses on one decision so you can keep momentum without losing the thread.
- 1
Ingest filings with point-in-time discipline
Pull the filings and store both the document and the date it was filed, so a query about a past period retrieves what was knowable then rather than a later restatement. SEC filings get amended, and fundamentals get restated; answering a historical question with restated figures is a subtle form of look-ahead bias. Normalize the document structure (items, sections, tables) during ingestion so later chunking can respect those boundaries.
Store the filing type and date as metadata on every chunk. It lets you filter retrieval to the right period and to the right document, which matters when a company has dozens of filings.
- 2
Chunk on structural boundaries
Split filings into chunks that respect the document's structure rather than cutting at arbitrary character counts. A chunk that spans the boundary between two unrelated items dilutes retrieval relevance, and one that splits a table from its header loses the meaning of the numbers. Tune chunk size and overlap so each chunk is self-contained: large enough to carry context, small enough to stay on one topic. Compare fixed-size, recursive, and structure-aware strategies on your filings.
Watch for chunks that split financial tables. A retrieved half-table is worse than useless because it gives the model numbers without their labels.
Use The ToolGeneratorsSEC Filing Chunk Optimizer
Pick a filing archetype, tune chunk size and overlap, and see chunk count, embedding cost, and structural-boundary warnings across three chunking strategies.
ToolOpen -> - 3
Embed, index, and estimate the cost
Embed each chunk with your embedding model and store the vectors in an index alongside the chunk text and metadata. The embedding step has a real and predictable cost that scales with total token count, so estimate it before running the full corpus. For a large filing archive this is the dominant one-time cost, and the chunk size you chose directly drives it: smaller chunks mean more chunks and more embeddings.
Estimate the token count and embedding cost for one representative filing first, then multiply by your corpus size. It catches budget surprises before you embed thousands of documents.
- 4
Retrieve the relevant passages
At query time, embed the question and retrieve the top matching chunks, filtered by the metadata that scopes the query to the right company, filing, and period. Retrieve enough chunks to cover the answer but few enough to fit the context window without burying the relevant passage in noise. Quality here is decisive: if retrieval misses the passage that contains the answer, no amount of clever prompting will recover it.
Always scope retrieval by company and period metadata before semantic ranking. Semantic similarity alone will happily return the right concept from the wrong year.
- 5
Generate with mandatory citations
Prompt the model to answer only from the retrieved passages and to cite the specific passage backing each claim. Then check that the cited passage actually supports the statement, and reject or flag answers where it does not. Grounding lowers fabrication but does not eliminate it: models still occasionally state claims their own cited source does not contain. The citation-faithfulness check is what turns grounding into something you can rely on.
Treat an uncited claim as a failed answer, not a stylistic lapse. If the model cannot point to a passage, it is answering from memory, which is what RAG exists to prevent.
- 6
Verify every extracted number
Do not trust a number just because it appears in a retrieved passage and the model repeated it. Models transcribe figures incorrectly, mix up units, and pull the wrong line from a table. Run each numeric claim in the output through a check against the source text, and surface any mismatch rather than silently accepting the model's value. For derived figures like ratios, compute them deterministically and have the model present the verified result.
Numeric transcription errors are the most common and most damaging failure in filing extraction. A per-number check catches them before they reach a decision.
Use The ToolPlaygroundsHallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
ToolOpen ->
Common Mistakes
The misses that undo good inputs
Chunking by character count and ignoring structure
Arbitrary cuts split tables from headers and merge unrelated sections, which degrades retrieval relevance and feeds the model numbers stripped of their labels. Structure-aware chunking is the single highest-leverage fix.
Trusting retrieved numbers without verification
Retrieval gets the right passage in front of the model, but the model can still transcribe a figure wrong or read the wrong table row. Without a numeric check, these errors flow straight into the answer with full confidence.
Ignoring restatements and point-in-time dates
Answering a historical question with later restated figures is look-ahead bias. A backtest or analysis built on it will look better than reality, because it used numbers that were not available at the time.
Try These Tools
Run the numbers next
Financial Document Token Estimator
Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across eight frontier LLMs, with cache-hit toggle.
Structured Schema Validator for Finance
Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al., NeurIPS (2020)
- EDGAR Full-Text Search and Filing Access — U.S. Securities and Exchange Commission
Related Content
Keep the topic connected
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Look-Ahead Bias
Look-ahead bias: when a backtest accidentally uses data the strategy wouldn't have had at decision time. The most common variants and how to catch them.
MCP (Model Context Protocol)
Model Context Protocol: Anthropic's open standard for letting LLMs discover and call tools — the interface, why it matters, and finance MCP server checks.
LLM for Finance Deployment Checklist
A pre-flight checklist for putting a large language model into a finance workflow: scoping, grounding, input security, numerical verification, and drift monitoring.