Generator

SEC Filing Chunk Optimizer

Name: SEC Filing Chunk Optimizer
Author: AI Fin Hub Research

SEC filing chunk sizing + 10-K chunking cost calculator. Pick archetype, chunk size, overlap, strategy, and embedding model. Browser-only. Free.

AI Fin Hub Research Published Apr 23, 2026 Methodology Corrections

Inputs: Configuration
Runtime: Instant
Privacy: Client-side · no upload
API key: Not required
Methodology: Open →

Education · Not investment advice. BaFin/EU framework. Past performance does not indicate future results. Editorial standards Sponsor disclosure Corrections

1 · Configure chunk strategy

Filing archetypeEmbedding model

Chunking strategy

Chunk size (tokens)1,024 tok

Overlap15%

Query re-embed count100

Chunk geometry

1,024 tok

15% overlap · structural · 138 chunks of avg 1,021 tok

Ingest $0.002818 · 100 queries $0.000080 · text-embedding-3-small

Strategy note

Respects Items / section headers / speaker turns. Preserves table blocks by keeping heading+table together. Chunk sizes are uneven but semantically clean.

No structural warnings at these settings. Still run a retrieval eval before production — heuristics can't replace ground truth.

Archetype reference: Form 10-K business + risk + MD&A + financials. ~12 Items. Dense tables in Item 7 / 8.

Detail

Total chunks

138

Avg tokens/chunk

1,021

min 1,024 · max 1,024

Ingest cost (once)

$0.002818

text-embedding-3-small

Query cost (100 re-embeds)

$0.000080

Tokens embedded

140,898

3 · Compare strategies (same archetype + chunk size)

Strategy	Chunks	Avg tok	Min / Max	Ingest cost	Tradeoff
structuralselected	138	1,021	1,024 / 1,024	$0.002818	Highest fidelity; uneven chunk sizes.
recursive	138	1,021	614 / 1,024	$0.002818	Cheap + deterministic; blind to tables.
semantic	132	1,061	409 / 1,433	$0.002801	Coherent prose groups; variable sizes, higher compute.

How the estimate works

stride        = chunk_size × (1 − overlap_pct)
base_count    = ceil(total_tokens / stride)
structural    → max(boundary_count, base_count)
recursive     → base_count
semantic      → ceil(base_count × 0.95), wider size variance
ingest_cost   = tokens_embedded × $/M_tokens
query_cost    = (40 × n_queries) × $/M_tokens

Pricing verified 2026-04-23. See methodology for archetype sources and the table-splitting pitfall.

How to use

Step-by-step

Full calculator guide →

1
Upload the filing (10-K, 10-Q, 8-K) or paste the SEC EDGAR URL.
2
Pick chunk strategy: retrieval (4K tokens, for embedding-search), summarization (16K tokens, for long-context), or structured (XBRL fields extracted separately).
3
Run the optimizer. It splits respecting Item boundaries and MD&A sub-sections — coherent chunks, not arbitrary character cuts.
4
Download chunks as JSON. Each chunk includes section path, character offsets, and token count.
5
For Q&A use cases, pair with hierarchical retrieval (filing → item → paragraph) — outperforms flat retrieval on filing-Q&A benchmarks.

For agents

Use in an agent

Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.

import { compute } from "https://aifinhub.io/engines/sec-filing-chunk-optimizer.js";

Contract: /contracts/sec-filing-chunk-optimizer.json Full agent guide →

Glossary references

Terms used by this tool

All glossary →

Agent-cost envelope

Questions people ask next

FAQ

What chunks does the tool produce?

Sections of an SEC filing (10-K, 10-Q, 8-K) split for downstream LLM ingestion. Splits respect document structure — Item boundaries (Item 1, Item 1A, etc.), MD&A sub-sections — rather than naive character-count splits. The methodology page documents the parsing rules.

Why not use a generic text splitter?

Generic splitters break in the middle of tables, footnotes, or risk-factor sentences, producing chunks that lose meaning. SEC filings have machine-readable structure (HTML/XBRL) that the tool exploits. The result is fewer but more coherent chunks.

How big are the chunks?

Configurable, with sensible defaults: 4,000 tokens for retrieval (most embedding models cap there), 16,000 tokens for long-context summarization. The tool overlaps chunks by 200 tokens to preserve context across boundaries. Larger chunks mean fewer retrieval hops but each hop is more expensive.

Does it handle XBRL?

Yes — XBRL-tagged numerical data is extracted as structured fields (revenue, net income, R&D spend by year), separate from the narrative text chunks. This is critical for finance use cases where you want exact numbers, not LLM-paraphrased ones. The methodology page lists supported XBRL taxonomies.

What's the optimal chunk strategy for filing Q&A?

Documented on the methodology page: large embedding-search chunks (4K) for retrieval, then re-rank with a smaller model, then send only the top-3 chunks to the answering model. This keeps cost under control while preserving recall. Hierarchical (filing → item → paragraph) retrieval beats flat retrieval on most filing-Q&A benchmarks.

Related deep dive

All articles →

Read further

Long-form context behind the tool output.

Complementary tools

Financial Document Token Estimator

Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across eight frontier LLMs, with cache-hit toggle.

Calculators Open

Structured Schema Validator for Finance

Paste LLM JSON output and validate against four pre-built finance schemas — research output, trade decision, risk snapshot, peer comparison — with sanity.

Playgrounds Open

Hallucination Detector

Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.