Embedding vs BM25 Retrieval
Retrieval is the part of a RAG pipeline that decides which passages the model gets to read, so its failures cap the whole system's accuracy. BM25 scores documents by exact term overlap weighted by term and document frequency; it is a strong, cheap, decades-old baseline. Dense retrieval encodes query and passages into vectors and ranks by similarity, capturing meaning rather than surface form. The two fail in opposite ways: BM25 misses synonyms and paraphrase, embeddings miss rare exact tokens like a specific ticker or footnote reference. In finance, where exact identifiers and numbers matter enormously, that contrast is decisive. This matrix compares them.
On This Page
Encodes queries and passages into dense vectors and retrieves by vector similarity, matching semantic meaning rather than exact words.
Pros
- Captures paraphrase and synonymy, finding relevant passages that share no keywords with the query
- Handles vague or conceptual queries where the user does not know the exact terminology
- Strong recall on semantically similar content across different phrasings
- Improves with better embedding models without changing the pipeline
Cons
- Can miss rare exact tokens like a ticker, CUSIP, or specific number that carry the real signal
- Requires building and serving a vector index, adding embedding cost and infrastructure
- Quality hinges on the embedding model and on chunking, both of which need tuning
- Out-of-domain or numeric-heavy text often embeds poorly, hurting precision on filings
Conceptual and paraphrased queries, semantic recall over prose, and finding relevant passages that lack shared keywords
A bag-of-words ranking function scoring documents by exact term overlap, weighted by term frequency and inverse document frequency. The classic strong baseline.
Pros
- Excellent at exact-term matching: tickers, CUSIPs, line items, and specific numbers land precisely
- Cheap, fast, and requires no model training or vector infrastructure
- Transparent and debuggable, since you can see exactly which terms matched
- A genuinely strong baseline that dense methods do not always beat, especially on rare terms
Cons
- Blind to paraphrase and synonymy: a query and a passage with the same meaning but no shared words miss
- Sensitive to vocabulary mismatch between how users ask and how filings are written
- No notion of semantic relevance beyond surface term overlap
- Struggles with conceptual queries that do not contain the document's exact wording
Exact identifiers and numbers, keyword-precise queries, and a cheap, transparent baseline for filing retrieval
Decision Table
See the tradeoffs side by side
| Criterion | Dense Embedding Retrieval | BM25 (Sparse Lexical) Retrieval |
|---|---|---|
| Matches on | Semantic meaning | Exact terms |
| Paraphrase and synonyms | Handled | Missed |
| Exact tickers and numbers | Can miss | Precise |
| Infrastructure cost | Higher, vector index | Lower, inverted index |
| Transparency | Opaque similarity | Visible term matches |
| Best on financial filings | Prose and concepts | Identifiers and figures |
Verdict
For financial retrieval the honest answer is rarely one or the other, because they fail in complementary ways. BM25 is unbeatable at the things finance cares about most, exact tickers, CUSIPs, specific line items, and numbers, and it is cheap, fast, and debuggable, so it should almost always be in the stack. Dense embeddings recover what BM25 misses, the relevant passage phrased entirely differently from the query, which matters for conceptual questions over prose. The strong default is hybrid retrieval: run both and fuse the rankings, for example with reciprocal rank fusion, so exact-term precision and semantic recall reinforce rather than compete. If you must pick one, BM25 is the safer baseline for filings precisely because losing a ticker or a number is usually worse than losing a paraphrase, but you give up real recall on conceptual queries. Whatever you choose, the retriever caps the whole RAG system's accuracy, so evaluate it on its own before blaming the generator.
Try These Tools
Run the numbers next
SEC Filing Chunk Optimizer
Pick a filing archetype, tune chunk size and overlap, and see chunk count, embedding cost, and structural-boundary warnings across three chunking strategies.
Financial Document Token Estimator
Paste a 10-K, 10-Q, 8-K or earnings transcript and see token count + one-pass extraction cost across eight frontier LLMs, with cache-hit toggle.
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- The Probabilistic Relevance Framework: BM25 and Beyond — Robertson and Zaragoza, Foundations and Trends in Information Retrieval (2009)
- Dense Passage Retrieval for Open-Domain Question Answering — Karpukhin et al., EMNLP (2020)
Related Content
Keep the topic connected
MCP (Model Context Protocol)
Model Context Protocol: Anthropic's open standard for letting LLMs discover and call tools — the interface, why it matters, and finance MCP server checks.
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Agent-Cost Envelope
The agent-cost envelope: the loop of (calls × tokens × retries × model_price) that determines the dollar cost of an LLM-driven trading agent per decision.
Prompt Injection
Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.