How SEC Filing Chunk Optimizer works

How the SEC Filing Chunk Optimizer estimates chunk count, token distribution, and embedding cost for SEC filing archetypes — and why the three chunking strategies have different failure modes on numeric tables.

What the tool computes

Given an archetype (10-K body, MD&A, Notes to Financials, earnings call transcript), a target chunk size in tokens, an overlap percentage, a chunking strategy (structural / recursive / semantic), and an embedding model, the tool returns an estimated chunk count, average and min/max tokens per chunk, the cost to embed the filing once, the cost for a batch of query re-embeds, and a list of warnings if the configuration is likely to damage retrieval quality.

The three chunking strategies

Structural chunkers walk the document's own skeleton. On a Form 10-K that means Item 1, Item 1A, Item 7, Item 7A, Item 8, and so on. On a Notes section it means walking the numbered footnotes. On a transcript it means speaker turns. Structural chunkers keep a heading and its immediate table together, which preserves the semantic context of the numbers. The cost is chunk-size variance: a two-sentence Item 9B becomes a tiny chunk, and a sprawling risk-factor section is split further by the target size. Retrieval is high-fidelity but index statistics look uneven.

Recursive chunkers (the LangChain RecursiveCharacterTextSplitter family) take a target size and a hierarchy of separators — paragraph, sentence, word — and walk them until the chunk fits. They are fast, deterministic, and structure-blind. On narrative-heavy documents that is fine. On a 10-K Item 8 financial statement, recursive splitting will cheerfully bisect a tabular row between its label and its numbers, which is the single most expensive retrieval failure in finance RAG: the table fragment retrieves without its header, and the model hallucinates what the number refers to.

Semantic chunkers embed every sentence, then merge adjacent sentences until the cosine distance crosses a threshold. They excel on MD&A prose and earnings-call Q&A because they group by topic drift, not by arbitrary token count. They struggle on dense numeric tables (every cell looks semantically similar to its neighbors) and on small target sizes below roughly 1,000 tokens, where the similarity signal becomes noise. Cost is higher: every ingestion run needs an extra pass of sentence-level embeddings.

Table splitting is the real pitfall

The single largest source of incorrect answers in finance RAG is a chunk that contains a numeric value whose row header landed in a different chunk. The model retrieves "$2,341" with no anchor for what it means, then confidently asserts it is revenue when it is in fact an accrued-liability balance. Structural and recursive chunkers fail here in opposite ways: structural chunkers pass a flag when they detect a table block and keep it whole; recursive chunkers do not. The tool's warnings block flags the table-heavy archetypes (10-K body, MD&A, Notes) when a table-blind strategy is paired with them.

Chunk size versus retrieval quality

There is no universal optimum. Small chunks (256–512 tokens) give sharp retrievals but lose the surrounding context an analyst actually reads: the adjective "net of the $400 million goodwill impairment" needs the sentence it modifies. Large chunks (4K+ tokens) give broad context but dilute the embedding — a 4K-token chunk that talks about five different topics embeds as an average of five topics and matches none of them well. For 10-K body and Notes, 800–1,500 tokens with 10–15% overlap tends to win retrieval benchmarks; MD&A and transcripts can go larger because the content is more narrative.

Inputs and assumptions

Archetype token counts are representative medians from public EDGAR filings tokenized with OpenAI's cl100k-base. Real filings vary by a factor of 3×.
Strategy chunk-count formulas are heuristics derived from LangChain / LlamaIndex defaults; they are not calls to a live chunker.
Per-query cost assumes a 40-token average embedding query. Long hypothetical-document queries inflate this.
Cache / batch discounts on embedding endpoints are not modeled.

Formulas and sources

Pricing verified 2026-04-23. Sources:

OpenAI API pricing — text-embedding-3-small at $0.02 per 1M input tokens.
Voyage AI pricing — voyage-finance-2 at $0.12 per 1M input tokens.
Cohere pricing — embed-english-v3.0 at $0.10 per 1M input tokens.

stride       = chunk_size × (1 − overlap_pct)
base_count   = ceil(total_tokens / stride)
structural   → max(boundary_count, base_count)
recursive    → base_count
semantic     → ceil(base_count × 0.95), wider variance
ingest_cost  = tokens_embedded × $/M_tokens
query_cost   = (40 × n_queries) × $/M_tokens

Limitations

This is a planning tool, not investment advice. It does not pick a winning strategy for your filings — it shows tradeoffs.
Numbers reflect published rates as of 2026-04-23; vendors change pricing frequently.
The warnings block uses heuristics. A retrieval eval on your own filings is the only reliable signal.
Neg-risk table handling (keeping a header-plus-table block intact) is not modeled chunk-for-chunk; structural chunking assumes it is available.

Fine-tune vs RAG vs long context on filings — when chunking stops being the bottleneck.
Reading financial filings with LLMs (2026) — the structural skeleton of a 10-K and why chunkers should honor it.
Numeric precision in LLM filings — why a split table row causes the biggest hallucinations.