Semantic vs Recursive Chunking
Chunking decides how a long document like a 10-K is cut into pieces before embedding and retrieval, and bad chunks cap the quality of everything downstream. Recursive chunking is the common default: it tries to split on natural separators in order of preference and recurses to finer ones until each piece fits a token budget. Semantic chunking is meaning-aware: it embeds candidate units and places a boundary where the topic shifts, so chunks group related sentences regardless of length. One optimizes for predictable size and speed; the other for coherent boundaries. This matrix compares them for filing-style documents.
On This Page
Embeds sentences or small units and places chunk boundaries where embedding similarity drops, so each chunk is a topically coherent span rather than a fixed size.
Pros
- Boundaries fall at genuine topic shifts, so chunks rarely split a single idea in half
- Each chunk is topically coherent, which embeds cleanly and retrieves more precisely
- Adapts chunk size to content, keeping a tight discussion together and breaking at real transitions
- Reduces the dilution that hurts dense retrieval when a chunk mixes unrelated topics
Cons
- Requires embedding every candidate unit up front, adding meaningful cost and latency
- Variable chunk sizes complicate token budgeting and can produce awkwardly large or tiny chunks
- Sensitive to the similarity threshold and the embedding model, adding tuning and nondeterminism
- More complex to implement and to reproduce exactly across runs
Documents with topic drift inside sections, prose-heavy filings, and pipelines where boundary quality justifies the embedding cost
Splits text using an ordered list of separators, paragraphs then sentences then characters, recursing to finer ones until each chunk fits a target token size.
Pros
- Fast, deterministic, and cheap: no embeddings needed to decide the splits
- Predictable chunk sizes that make token budgeting and overlap straightforward
- Respects document structure when separators map to real boundaries like paragraphs
- Simple to implement, reproduce, and debug, with broad library support
Cons
- Can cut mid-thought when a topic spans the target size and the separator falls awkwardly
- Ignores meaning: a boundary is placed by size and separator, not by where the idea ends
- Fixed targets force unrelated content together or split related content apart
- Quality depends on choosing good separators and a size that matches the document
Most filings as a strong default, structured documents with clear separators, and pipelines that need speed and determinism
Decision Table
See the tradeoffs side by side
| Criterion | Semantic Chunking | Recursive Chunking |
|---|---|---|
| Boundary basis | Topic shift via embeddings | Separators and target size |
| Chunk size | Variable, content-driven | Predictable, near target |
| Cost to chunk | Higher, embeds every unit | Low, no embeddings |
| Determinism | Lower, threshold-sensitive | High, reproducible |
| Splits mid-thought | Rarely | Can happen |
| Default recommendation | When topic drift hurts retrieval | Strong general default |
Verdict
Start with structure-aware recursive chunking, because for most filings it is a strong default: it is fast, deterministic, cheap, and when its separators map to real document structure like headings and paragraphs, its boundaries are perfectly serviceable. Recursive chunking earns its place as the baseline you should beat, not assume is inferior. Move to semantic chunking when you have evidence that topic drift inside sections is degrading retrieval, for example when chunks routinely mix two unrelated discussions and dense retrieval returns muddy results. Semantic chunking buys you coherent boundaries that embed and retrieve better, at the cost of embedding every unit, variable sizes that complicate budgeting, and a similarity threshold to tune. The pragmatic path is to measure retrieval quality on your own corpus with recursive chunking first, then test semantic chunking only on the document types where the baseline visibly fails, rather than paying its cost everywhere by default.
Try These Tools
Run the numbers next
SEC Filing Chunk Optimizer
Pick a filing archetype, tune chunk size and overlap, and see chunk count, embedding cost, and structural-boundary warnings across three chunking strategies.
Financial Document Token Estimator
Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across eight frontier LLMs, with cache-hit toggle.
Token-Cost Optimizer
Compute the dollar cost of a trading research loop across Claude, GPT, and Gemini. Prompt length × model × retry × call volume → cost per idea and per.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- Lost in the Middle: How Language Models Use Long Contexts — Liu et al., Transactions of the ACL (2024)
- Dense Passage Retrieval for Open-Domain Question Answering — Karpukhin et al., EMNLP (2020)
Related Content
Keep the topic connected
MCP (Model Context Protocol)
Model Context Protocol: Anthropic's open standard for letting LLMs discover and call tools — the interface, why it matters, and finance MCP server checks.
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Agent-Cost Envelope
The agent-cost envelope: the loop of (calls × tokens × retries × model_price) that determines the dollar cost of an LLM-driven trading agent per decision.
Prompt Injection
Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.