How does chunk overlap interact with these strategies?

Overlap, repeating some tokens between adjacent chunks, is a separate knob that helps either strategy by ensuring a sentence near a boundary appears in both neighboring chunks, so retrieval does not miss it. It is most relevant to recursive chunking, where boundaries are placed by size and can land mid-idea, since overlap softens those arbitrary cuts. Semantic chunking needs less overlap because its boundaries already fall at natural topic shifts, though a small overlap still guards against edge cases.

What about filings with tables and structured sections?

Neither generic strategy handles tables and itemized sections well, because both treat the document as flowing text. For filings, a structure-aware layer that recognizes headings, item numbers, and table boundaries usually matters more than the choice between semantic and recursive splitting. The best results often come from first segmenting by document structure, then applying recursive or semantic chunking within each structural unit, so that a table or a numbered item is never split across chunks.

AI in Markets Comparison

Semantic vs Recursive Chunking

Chunking decides how a long document like a 10-K is cut into pieces before embedding and retrieval, and bad chunks cap the quality of everything downstream. Recursive chunking is the common default: it tries to split on natural separators in order of preference and recurses to finer ones until each piece fits a token budget. Semantic chunking is meaning-aware: it embeds candidate units and places a boundary where the topic shifts, so chunks group related sentences regardless of length. One optimizes for predictable size and speed; the other for coherent boundaries. This matrix compares them for filing-style documents.

6 CRITERIAPublished May 26, 2026Live Content

By AI Fin Hub Research · AI Fin Hub Team

On This Page

Options 6 criteria Verdict FAQ

Semantic Chunking Option

Embeds sentences or small units and places chunk boundaries where embedding similarity drops, so each chunk is a topically coherent span rather than a fixed size.

Pros

Boundaries fall at genuine topic shifts, so chunks rarely split a single idea in half
Each chunk is topically coherent, which embeds cleanly and retrieves more precisely
Adapts chunk size to content, keeping a tight discussion together and breaking at real transitions
Reduces the dilution that hurts dense retrieval when a chunk mixes unrelated topics

Cons

Requires embedding every candidate unit up front, adding meaningful cost and latency
Variable chunk sizes complicate token budgeting and can produce awkwardly large or tiny chunks
Sensitive to the similarity threshold and the embedding model, adding tuning and nondeterminism
More complex to implement and to reproduce exactly across runs

Documents with topic drift inside sections, prose-heavy filings, and pipelines where boundary quality justifies the embedding cost

Recursive Chunking Option

Splits text using an ordered list of separators, paragraphs then sentences then characters, recursing to finer ones until each chunk fits a target token size.

Pros

Fast, deterministic, and cheap: no embeddings needed to decide the splits
Predictable chunk sizes that make token budgeting and overlap straightforward
Respects document structure when separators map to real boundaries like paragraphs
Simple to implement, reproduce, and debug, with broad library support

Cons

Can cut mid-thought when a topic spans the target size and the separator falls awkwardly
Ignores meaning: a boundary is placed by size and separator, not by where the idea ends
Fixed targets force unrelated content together or split related content apart
Quality depends on choosing good separators and a size that matches the document

Most filings as a strong default, structured documents with clear separators, and pipelines that need speed and determinism

Decision Table

See the tradeoffs side by side

Criterion	Semantic Chunking	Recursive Chunking
Boundary basis	Topic shift via embeddings	Separators and target size
Chunk size	Variable, content-driven	Predictable, near target
Cost to chunk	Higher, embeds every unit	Low, no embeddings
Determinism	Lower, threshold-sensitive	High, reproducible
Splits mid-thought	Rarely	Can happen
Default recommendation	When topic drift hurts retrieval	Strong general default

Verdict

Start with structure-aware recursive chunking, because for most filings it is a strong default: it is fast, deterministic, cheap, and when its separators map to real document structure like headings and paragraphs, its boundaries are perfectly serviceable. Recursive chunking earns its place as the baseline you should beat, not assume is inferior. Move to semantic chunking when you have evidence that topic drift inside sections is degrading retrieval, for example when chunks routinely mix two unrelated discussions and dense retrieval returns muddy results. Semantic chunking buys you coherent boundaries that embed and retrieve better, at the cost of embedding every unit, variable sizes that complicate budgeting, and a similarity threshold to tune. The pragmatic path is to measure retrieval quality on your own corpus with recursive chunking first, then test semantic chunking only on the document types where the baseline visibly fails, rather than paying its cost everywhere by default.

Try These Tools

Run the numbers next

GeneratorsCalculator

SEC Filing Chunk Optimizer

Pick a filing archetype, tune chunk size and overlap, and see chunk count, embedding cost, and structural-boundary warnings across three chunking strategies.

Launch toolOpen ->

CalculatorsCalculator

Financial Document Token Estimator

Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across ten frontier LLMs, with cache-hit toggle.

Launch toolOpen ->

CalculatorsCalculator

Token-Cost Optimizer

Compute the dollar cost of a trading research loop across Claude, GPT, and Gemini. Prompt length × model × retry × call volume → cost per idea and per.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

No. Semantic chunking produces more coherent boundaries in principle, but on well-structured documents where separators already align with topic boundaries, recursive chunking achieves similar retrieval quality for a fraction of the cost. Semantic chunking also adds variable chunk sizes, a tunable threshold, and embedding overhead, any of which can backfire if mis-set. The right comparison is empirical on your corpus, and recursive chunking is frequently good enough, which is why it remains the common default.

Sources & References

Lost in the Middle: How Language Models Use Long Contexts — Liu et al., Transactions of the ACL (2024)
Dense Passage Retrieval for Open-Domain Question Answering — Karpukhin et al., EMNLP (2020)

Keep the topic connected

AI in Markets2 FAQS

MCP (Model Context Protocol)

Model Context Protocol: Anthropic's open standard for letting LLMs discover and call tools — the interface, why it matters, and finance MCP server checks.

Keep readingRead ->

AI in Markets1 FAQS

LLM Hallucination Detection in Finance

How to detect LLM hallucinations in financial outputs: citation grounding, verifiable-claim checks, and cross-model agreement that flag fabricated data.

Keep readingRead ->

AI in Markets1 FAQS

Agent-Cost Envelope

The agent-cost envelope: the loop of (calls × tokens × retries × model_price) that determines the dollar cost of an LLM-driven trading agent per decision.

Keep readingRead ->

AI in Markets1 FAQS

Prompt Injection

Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.

Keep readingRead ->