Generator
SEC Filing Chunk Optimizer
SEC filing chunk sizing + 10-K chunking cost calculator. Pick archetype, chunk size, overlap, strategy, and embedding model. Browser-only. Free.
- Inputs
- Configuration
- Runtime
- Instant
- Privacy
- Client-side · no upload
- API key
- Not required
- Methodology
- Open →
1 · Configure chunk strategy
Chunk geometry
1,024 tok
15% overlap · structural · 138 chunks of avg 1,021 tok
Ingest $0.002818 · 100 queries $0.000080 · text-embedding-3-small
Strategy note
Respects Items / section headers / speaker turns. Preserves table blocks by keeping heading+table together. Chunk sizes are uneven but semantically clean.
No structural warnings at these settings. Still run a retrieval eval before production — heuristics can't replace ground truth.
Archetype reference: Form 10-K business + risk + MD&A + financials. ~12 Items. Dense tables in Item 7 / 8.
Detail
Total chunks
138
Avg tokens/chunk
1,021
min 1,024 · max 1,024
Ingest cost (once)
$0.002818
text-embedding-3-small
Query cost (100 re-embeds)
$0.000080
Tokens embedded
140,898
3 · Compare strategies (same archetype + chunk size)
| Strategy | Chunks | Avg tok | Min / Max | Ingest cost | Tradeoff |
|---|---|---|---|---|---|
| structuralselected | 138 | 1,021 | 1,024 / 1,024 | $0.002818 | Highest fidelity; uneven chunk sizes. |
| recursive | 138 | 1,021 | 614 / 1,024 | $0.002818 | Cheap + deterministic; blind to tables. |
| semantic | 132 | 1,061 | 409 / 1,433 | $0.002801 | Coherent prose groups; variable sizes, higher compute. |
How the estimate works
stride = chunk_size × (1 − overlap_pct) base_count = ceil(total_tokens / stride) structural → max(boundary_count, base_count) recursive → base_count semantic → ceil(base_count × 0.95), wider size variance ingest_cost = tokens_embedded × $/M_tokens query_cost = (40 × n_queries) × $/M_tokens
Pricing verified 2026-04-23. See methodology for archetype sources and the table-splitting pitfall.
How to use
Step-by-step
- 1
Upload the filing (10-K, 10-Q, 8-K) or paste the SEC EDGAR URL.
- 2
Pick chunk strategy: retrieval (4K tokens, for embedding-search), summarization (16K tokens, for long-context), or structured (XBRL fields extracted separately).
- 3
Run the optimizer. It splits respecting Item boundaries and MD&A sub-sections — coherent chunks, not arbitrary character cuts.
- 4
Download chunks as JSON. Each chunk includes section path, character offsets, and token count.
- 5
For Q&A use cases, pair with hierarchical retrieval (filing → item → paragraph) — outperforms flat retrieval on filing-Q&A benchmarks.
For agents
Use in an agent
Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.
import { compute } from "https://aifinhub.io/engines/sec-filing-chunk-optimizer.js"; Contract: /contracts/sec-filing-chunk-optimizer.json Full agent guide →
Glossary references
Terms used by this tool
Questions people ask next
FAQ
What chunks does the tool produce?
Sections of an SEC filing (10-K, 10-Q, 8-K) split for downstream LLM ingestion. Splits respect document structure — Item boundaries (Item 1, Item 1A, etc.), MD&A sub-sections — rather than naive character-count splits. The methodology page documents the parsing rules.
Why not use a generic text splitter?
Generic splitters break in the middle of tables, footnotes, or risk-factor sentences, producing chunks that lose meaning. SEC filings have machine-readable structure (HTML/XBRL) that the tool exploits. The result is fewer but more coherent chunks.
How big are the chunks?
Configurable, with sensible defaults: 4,000 tokens for retrieval (most embedding models cap there), 16,000 tokens for long-context summarization. The tool overlaps chunks by 200 tokens to preserve context across boundaries. Larger chunks mean fewer retrieval hops but each hop is more expensive.
Does it handle XBRL?
Yes — XBRL-tagged numerical data is extracted as structured fields (revenue, net income, R&D spend by year), separate from the narrative text chunks. This is critical for finance use cases where you want exact numbers, not LLM-paraphrased ones. The methodology page lists supported XBRL taxonomies.
What's the optimal chunk strategy for filing Q&A?
Documented on the methodology page: large embedding-search chunks (4K) for retrieval, then re-rank with a smaller model, then send only the top-3 chunks to the answering model. This keeps cost under control while preserving recall. Hierarchical (filing → item → paragraph) retrieval beats flat retrieval on most filing-Q&A benchmarks.
Related deep dive
All articles →Read further
Long-form context behind the tool output.
- Comparison · Benchmark·10 min
Financial QA LLM Benchmarks 2026: FinanceBench & Fin-RATE
Financial QA LLM benchmarks 2026: FinQA, FinanceBench, DocFinQA, and Fin-RATE leaderboard scores, plus whole-filing read costs verified 2026-06-17.
Read - Pillar · Guide·14 min
Reading Financial Filings With LLMs: 2026 Playbook
A map of eight filing tasks — extraction, summarization, peer comparison, Q&A, classification, sentiment, forecasting input, compliance — with model.
Read - Methodology · Opinion·12 min
Fine-Tuning vs RAG vs Long-Context for Filings
Decision matrix for finance LLMs: when RAG wins, when long-context wins, and when fine-tuning makes sense. Cost math from published 2026-04 vendor rates.
Read
Complementary tools
Users of this tool often explore
Financial Document Token Estimator
Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across eight frontier LLMs, with cache-hit toggle.
Structured Schema Validator for Finance
Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.