SEC Filing Chunk Optimizer: Examples
Embedding cost is trivial; the warnings are not. These scenarios pair a chunk size, overlap percentage, and chunking strategy with a document archetype, then embed using OpenAI's text-embedding-3-small at $0.02/M tokens. The 10-K body is roughly 120,000 tokens and table-heavy. The output that matters is the chunk count and any warnings flagging configurations that split tables, fragment sections, or inflate cost without improving recall.
Worked Examples
See the inputs and outcome together
Each scenario keeps the starting point, the outcome, and the actual lesson in one place so the page reads like a decision notebook, not a data dump.
- 1
Structural chunking of a 10-K body
A 1,024-token chunk with 10 percent overlap using a structural splitter that respects section and table boundaries. The recommended starting config for a dense filing.
131 chunks, average 1,017 tokens, one-time embedding cost $0.00266, zero warnings.
Document
10-K full body (~120K tokens)
Chunk size
1,024 tokens
Overlap
10%
Strategy
Structural
Embedding model
text-embedding-3-small
Structural chunking on a 10-K produces zero warnings because it keeps tables and section headers intact. At under a third of a cent to embed the whole filing, cost is irrelevant; the value is the clean, warning-free split that protects retrieval quality.
- 2
Recursive splitter on the same filing
Identical chunk size and overlap, but a recursive character splitter that ignores document structure. The popular default that misbehaves on tables.
131 chunks at the same $0.00266 cost, but one warning: recursive splitters are table-blind on a table-heavy filing.
Document
10-K full body (~120K tokens)
Chunk size
1,024 tokens
Overlap
10%
Strategy
Recursive
Embedding model
text-embedding-3-small
The cost and chunk count are identical to the structural plan, yet the tool raises a warning. A table-blind splitter severs numeric rows from their headers, so retrieval can return half a table. Same price, worse retrieval, which is exactly the trap a cost-only comparison hides.
- 3
Chunks too small on a dense filing
Shrinking to 512-token chunks with a recursive splitter, which more than doubles the chunk count and risks slicing risk-factor paragraphs and tables.
261 chunks, average 511 tokens, cost $0.00267, table-blind warning still active.
Document
10-K full body (~120K tokens)
Chunk size
512 tokens
Overlap
10%
Strategy
Recursive
Embedding model
text-embedding-3-small
Halving the chunk size doubles the chunk count to 261 while the embedding cost barely moves, because total tokens are fixed. The cost is flat but small chunks plus a table-blind splitter on a dense filing is the worst of both worlds for relevance.
- 4
Overlap set too high
Back to structural chunking but with overlap raised to 30 percent, above the level where extra overlap stops helping recall.
168 chunks, ingested tokens rise to 171,192, cost $0.00342, warning: overlap above 25% inflates cost without recall gain.
Document
10-K full body (~120K tokens)
Chunk size
1,024 tokens
Overlap
30%
Strategy
Structural
Embedding model
text-embedding-3-small
Thirty percent overlap pushes ingested tokens from 133K to 171K, a 28 percent jump in embedding and storage cost, and the tool flags that published RAG benchmarks show no recall benefit past about 25 percent. This is pure waste the warning catches before you pay for it at scale.
Patterns
Try These Tools
Run the numbers next
Financial Document Token Estimator
Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across eight frontier LLMs, with cache-hit toggle.
Token-Cost Optimizer
Compute the dollar cost of a trading research loop across Claude, GPT, and Gemini. Prompt length × model × retry × call volume → cost per idea and per.
Sources & References
- OpenAI Embeddings Pricing — OpenAI (2026)
- Dense Passage Retrieval for Open-Domain Question Answering — Karpukhin et al., EMNLP (2020)
Related Content
Keep the topic connected
MCP (Model Context Protocol)
Model Context Protocol: Anthropic's open standard for letting LLMs discover and call tools — the interface, why it matters, and finance MCP server checks.
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Look-Ahead Bias
Look-ahead bias: when a backtest accidentally uses data the strategy wouldn't have had at decision time. The most common variants and how to catch them.