Skip to main content
aifinhub

Generator

SEC Filing Chunk Optimizer

SEC filing chunk sizing + 10-K chunking cost calculator. Pick archetype, chunk size, overlap, strategy, and embedding model. Browser-only. Free.

Inputs
Configuration
Runtime
Instant
Privacy
Client-side · no upload
API key
Not required
Methodology
Open →

Education · Not investment advice. BaFin/EU framework. Past performance does not indicate future results. Editorial standards Sponsor disclosure Corrections

1 · Configure chunk strategy

Chunking strategy
1,024 tok
15%
100

Chunk geometry

1,024 tok

15% overlap · structural · 138 chunks of avg 1,021 tok

Ingest $0.002818  ·  100 queries $0.000080  ·  text-embedding-3-small

Strategy note

Respects Items / section headers / speaker turns. Preserves table blocks by keeping heading+table together. Chunk sizes are uneven but semantically clean.

No structural warnings at these settings. Still run a retrieval eval before production — heuristics can't replace ground truth.

Archetype reference: Form 10-K business + risk + MD&A + financials. ~12 Items. Dense tables in Item 7 / 8.

Detail

Total chunks

138

Avg tokens/chunk

1,021

min 1,024 · max 1,024

Ingest cost (once)

$0.002818

text-embedding-3-small

Query cost (100 re-embeds)

$0.000080

Tokens embedded

140,898

3 · Compare strategies (same archetype + chunk size)

StrategyChunksAvg tokMin / MaxIngest costTradeoff
structuralselected1381,0211,024 / 1,024$0.002818Highest fidelity; uneven chunk sizes.
recursive1381,021614 / 1,024$0.002818Cheap + deterministic; blind to tables.
semantic1321,061409 / 1,433$0.002801Coherent prose groups; variable sizes, higher compute.

How the estimate works

stride        = chunk_size × (1 − overlap_pct)
base_count    = ceil(total_tokens / stride)
structural    → max(boundary_count, base_count)
recursive     → base_count
semantic      → ceil(base_count × 0.95), wider size variance
ingest_cost   = tokens_embedded × $/M_tokens
query_cost    = (40 × n_queries) × $/M_tokens

Pricing verified 2026-04-23. See methodology for archetype sources and the table-splitting pitfall.

How to use

Step-by-step

Full calculator guide →
  1. 1

    Upload the filing (10-K, 10-Q, 8-K) or paste the SEC EDGAR URL.

  2. 2

    Pick chunk strategy: retrieval (4K tokens, for embedding-search), summarization (16K tokens, for long-context), or structured (XBRL fields extracted separately).

  3. 3

    Run the optimizer. It splits respecting Item boundaries and MD&A sub-sections — coherent chunks, not arbitrary character cuts.

  4. 4

    Download chunks as JSON. Each chunk includes section path, character offsets, and token count.

  5. 5

    For Q&A use cases, pair with hierarchical retrieval (filing → item → paragraph) — outperforms flat retrieval on filing-Q&A benchmarks.

For agents

Use in an agent

Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.

import { compute } from "https://aifinhub.io/engines/sec-filing-chunk-optimizer.js";

Contract: /contracts/sec-filing-chunk-optimizer.json Full agent guide →

Glossary references

Terms used by this tool

All glossary →

Questions people ask next

FAQ

What chunks does the tool produce?

Sections of an SEC filing (10-K, 10-Q, 8-K) split for downstream LLM ingestion. Splits respect document structure — Item boundaries (Item 1, Item 1A, etc.), MD&A sub-sections — rather than naive character-count splits. The methodology page documents the parsing rules.

Why not use a generic text splitter?

Generic splitters break in the middle of tables, footnotes, or risk-factor sentences, producing chunks that lose meaning. SEC filings have machine-readable structure (HTML/XBRL) that the tool exploits. The result is fewer but more coherent chunks.

How big are the chunks?

Configurable, with sensible defaults: 4,000 tokens for retrieval (most embedding models cap there), 16,000 tokens for long-context summarization. The tool overlaps chunks by 200 tokens to preserve context across boundaries. Larger chunks mean fewer retrieval hops but each hop is more expensive.

Does it handle XBRL?

Yes — XBRL-tagged numerical data is extracted as structured fields (revenue, net income, R&D spend by year), separate from the narrative text chunks. This is critical for finance use cases where you want exact numbers, not LLM-paraphrased ones. The methodology page lists supported XBRL taxonomies.

What's the optimal chunk strategy for filing Q&A?

Documented on the methodology page: large embedding-search chunks (4K) for retrieval, then re-rank with a smaller model, then send only the top-3 chunks to the answering model. This keeps cost under control while preserving recall. Hierarchical (filing → item → paragraph) retrieval beats flat retrieval on most filing-Q&A benchmarks.

Complementary tools

Planning estimates only — not financial, tax, or investment advice.