Skip to main content
aifinhub
AI in Markets Worked Examples

SEC Filing Chunk Optimizer: Examples

Embedding cost is trivial; the warnings are not. These scenarios pair a chunk size, overlap percentage, and chunking strategy with a document archetype, then embed using OpenAI's text-embedding-3-small at $0.02/M tokens. The 10-K body is roughly 120,000 tokens and table-heavy. The output that matters is the chunk count and any warnings flagging configurations that split tables, fragment sections, or inflate cost without improving recall.

By AI Fin Hub Research · AI Fin Hub Team
Best Next MoveGenerators

SEC Filing Chunk Optimizer

Pick a filing archetype, tune chunk size and overlap, and see chunk count, embedding cost, and structural-boundary warnings across three chunking strategies.

CalculatorOpen ->

On This Page

Worked Examples

See the inputs and outcome together

Each scenario keeps the starting point, the outcome, and the actual lesson in one place so the page reads like a decision notebook, not a data dump.

  1. 1

    Structural chunking of a 10-K body

    A 1,024-token chunk with 10 percent overlap using a structural splitter that respects section and table boundaries. The recommended starting config for a dense filing.

    131 chunks, average 1,017 tokens, one-time embedding cost $0.00266, zero warnings.

    Document

    10-K full body (~120K tokens)

    Chunk size

    1,024 tokens

    Overlap

    10%

    Strategy

    Structural

    Embedding model

    text-embedding-3-small

    Structural chunking on a 10-K produces zero warnings because it keeps tables and section headers intact. At under a third of a cent to embed the whole filing, cost is irrelevant; the value is the clean, warning-free split that protects retrieval quality.

  2. 2

    Recursive splitter on the same filing

    Identical chunk size and overlap, but a recursive character splitter that ignores document structure. The popular default that misbehaves on tables.

    131 chunks at the same $0.00266 cost, but one warning: recursive splitters are table-blind on a table-heavy filing.

    Document

    10-K full body (~120K tokens)

    Chunk size

    1,024 tokens

    Overlap

    10%

    Strategy

    Recursive

    Embedding model

    text-embedding-3-small

    The cost and chunk count are identical to the structural plan, yet the tool raises a warning. A table-blind splitter severs numeric rows from their headers, so retrieval can return half a table. Same price, worse retrieval, which is exactly the trap a cost-only comparison hides.

  3. 3

    Chunks too small on a dense filing

    Shrinking to 512-token chunks with a recursive splitter, which more than doubles the chunk count and risks slicing risk-factor paragraphs and tables.

    261 chunks, average 511 tokens, cost $0.00267, table-blind warning still active.

    Document

    10-K full body (~120K tokens)

    Chunk size

    512 tokens

    Overlap

    10%

    Strategy

    Recursive

    Embedding model

    text-embedding-3-small

    Halving the chunk size doubles the chunk count to 261 while the embedding cost barely moves, because total tokens are fixed. The cost is flat but small chunks plus a table-blind splitter on a dense filing is the worst of both worlds for relevance.

  4. 4

    Overlap set too high

    Back to structural chunking but with overlap raised to 30 percent, above the level where extra overlap stops helping recall.

    168 chunks, ingested tokens rise to 171,192, cost $0.00342, warning: overlap above 25% inflates cost without recall gain.

    Document

    10-K full body (~120K tokens)

    Chunk size

    1,024 tokens

    Overlap

    30%

    Strategy

    Structural

    Embedding model

    text-embedding-3-small

    Thirty percent overlap pushes ingested tokens from 133K to 171K, a 28 percent jump in embedding and storage cost, and the tool flags that published RAG benchmarks show no recall benefit past about 25 percent. This is pure waste the warning catches before you pay for it at scale.

Patterns

Embedding cost for an entire 10-K is under a cent, so config should optimize retrieval quality, not embedding price.
A structural splitter avoids the table-blind warning that recursive splitters trigger on table-heavy filings.
Halving chunk size doubles the chunk count but barely changes cost, since total tokens are fixed.
Overlap above about 25 percent inflates ingested tokens and storage with no measured recall gain.

Try These Tools

Run the numbers next

Sources & References

Related Content

Keep the topic connected

Planning estimates only — not financial, tax, or investment advice.