Fine-Tuning vs RAG vs Long-Context for Filings

TL;DR

Fine-tuning vs RAG vs long-context for SEC filings is a three-axis decision driven by corpus size, query cadence, and precision need. None of the three wins everywhere. For most retail operators (one analyst, 5 to 50 tickers, ad-hoc queries), long-context beats retrieval augmented generation below roughly 200 documents per query because the engineering cost of a retrieval layer exceeds the token savings. RAG dominates at scale (above 500 documents per query) or at high query cadence (above 100 queries per day) where the amortised indexing cost is trivial. Fine-tuning is the last resort and makes sense only for narrow, repeatable classifiers with more than 10,000 labeled examples. The common engineering-team mistake is defaulting to RAG because that is the default answer on every vendor blog. For a single analyst running ad-hoc questions against a watchlist, that default burns more engineering time than it saves in tokens.

The decision is not about quality, it is about three axes

Every public comparison frames the choice as "which approach is better," which is the wrong question. All three approaches can produce correct answers on a well-scoped SEC filing query. The real variables are corpus size (total documents the query has to cover), query cadence (how often a question runs), and precision need (exact numeric extraction versus fuzzy synthesis). A practitioner picking an architecture without pinning those three numbers will end up with an overbuilt pipeline.

Long-context: what it actually is

Long-context means pushing entire documents into a single model call without retrieval. Claude Sonnet 4.6 accepts 200K tokens of input, Gemini 2.5 Pro accepts 2M tokens, GPT-5 accepts 400K tokens at published 2026-04 rates¹²³. A 10-K typically runs 80K to 250K tokens after cleaning. That means one to three recent 10-Ks fit in a Sonnet call, ten to twenty fit in a Gemini 2.5 Pro call, and a full company filing history of eight annuals plus quarterlies sits comfortably inside a 2M-token window.

What it costs

At Sonnet 4.6 published 2026-04 rates of $3 per 1M input tokens and $15 per 1M output tokens¹, a single query over a 200K-token corpus costs $0.60 input plus whatever the output tokens cost. A 2K-token structured JSON output at $15/1M is another $0.03. Total: roughly $0.63 per query. Gemini 2.5 Pro at $1.25 input and $10 output³ lands closer to $0.27 per query for the same corpus. Anthropic prompt caching drops repeat reads of the same context to $0.30/1M (10x reduction), so the second identical-corpus query through cache costs roughly $0.09.

When it wins

Ad-hoc queries over a small, bounded corpus where indexing engineering would cost more than the marginal tokens. A single analyst asking thirty questions per week across ten tickers with the full 10-K loaded pays maybe $20 per month. Standing up a vector database, chunking logic, an embedding pipeline, and a retrieval evaluation rig costs more engineering time in week one than a year of long-context fees.

When it fails

Above roughly 200 documents per query, long-context stops being viable even on 2M-token models. Needle-in-haystack accuracy also degrades as context fills past 60 to 70 percent of the advertised window for most models⁴, so a 180K-token Sonnet call will miss fine-grained facts that a 40K-token call would nail. Long-context is also a bad fit for sub-second latency needs because prefill time scales with input length.

Retrieval augmented generation: what it actually is

RAG chunks documents into passages, embeds each passage into a vector, stores vectors in a similarity-search index, retrieves the top-K most relevant passages at query time, and sends only those passages to the LLM. The original formulation comes from Lewis et al. 2020⁵. Modern implementations range from BM25 sparse retrieval through dense embeddings to hybrid rerankers like ColBERT⁶.

What it costs

RAG has two cost surfaces: one-time indexing and per-query retrieval. Indexing a 10-K at roughly 200K tokens with OpenAI text-embedding-3-small at $0.02/1M tokens² costs $0.004 per filing. A 500-filing corpus costs $2 to index, amortised across all future queries. Per query, retrieval returns 10 to 30 chunks of roughly 500 tokens each, so total input is 5K to 15K tokens. At Sonnet 4.6 rates that is $0.015 to $0.045 per query input plus output. The per-query token cost is 10x to 40x lower than long-context.

When it wins

Any of three regimes: corpus above 500 documents (long-context infeasible), query cadence above 100 per day (indexing amortises), or latency below two seconds required (short prefill). A hedge fund analyst team running continuous monitoring across 2,000 issuers crosses all three thresholds and has no real alternative to RAG.

When it fails

Precision extraction across document structure breaks when chunking splits a table or a cross-reference. The classic case: a 10-K risk factor that cross-references an MD&A segment disclosure. A naive chunker sends the risk factor to the retriever and misses the disclosure, producing a partial answer with high confidence. Chunk-boundary engineering is the hidden tax on RAG pipelines. Evaluation is also harder: recall@K, mean reciprocal rank, and end-to-end answer accuracy are three different metrics and pipelines routinely optimise the wrong one.

Fine-tuning: what it actually is

Fine-tuning updates the model weights on a task-specific dataset. For SEC filings the realistic target is a narrow classifier (section type, risk category, sentiment), not open-ended question answering. Base models already handle general filing comprehension well, so fine-tuning is about specialising to a repeatable format or closed-vocabulary output.

What it costs

OpenAI fine-tuning at published 2026-04 rates charges training tokens plus an inference premium over the base model². Training a 10,000-example dataset at ~2K tokens each runs 20M training tokens. At the fine-tuning training rate (roughly $25/1M for GPT-4.1-mini class models²) that is $500 one-off. Inference on the tuned model sits at 1.5x to 2x the base model rate. A tuned extractor running 10K queries per month at 5K input / 500 output tokens costs roughly $75 to $120 per month at tuned GPT-4.1-mini rates, versus $40 to $60 on the base model.

When it wins

Closed-vocabulary classification where the label set is stable and labeled data exists. The canonical example: a 12-bucket MD&A risk classifier trained on 15,000 labeled MD&A sections. At inference time a fine-tuned small model beats a prompted frontier model on per-token cost and latency while matching accuracy because the task is narrow.

When it fails

Everything else. Data drift destroys tuned models because filings from 2020 read differently from filings from 2026 (supply-chain language dominated 2021 to 2022, AI disclosures dominate 2025 to 2026). Regime shifts require retraining. Labeling is expensive and inconsistent across human labelers. For any task with a moving vocabulary, RAG over primary sources or long-context over the full document beats a tuned model that silently memorised 2023 language.

Six decision scenarios

Each row pins corpus size, cadence, and output type. The winner column names the architecture that minimises total cost (tokens plus engineering) under the listed constraints.

#	Corpus size	Cadence	Output type	Winner	One-line reason
1	1 filing	Ad-hoc	Synthesis	Long-context	One 10-K fits in one Sonnet call; any retrieval layer is pure overhead.
2	10 filings	Ad-hoc	Extraction	Long-context	10 filings fit in Gemini 2.5 Pro; retrieval adds engineering, not accuracy.
3	50 filings	Weekly	Synthesis	Long-context + caching	Fits a 2M-token Gemini window; prompt cache makes repeat queries cheap.
4	200 filings	Daily	Mixed	Either; RAG edges ahead	Cost crossover sits near here; RAG wins if cadence grows.
5	2,000 filings	Hourly	Extraction	RAG	Long-context infeasible; indexing amortises across thousands of queries.
6	10,000 sections	Continuous	Closed classification	Fine-tuning	Narrow labels, stable vocabulary, >10K examples: tuned small model wins on cost and latency.

Scenario 1 is the single-analyst-reading-a-10-K workflow. Scenario 2 covers watchlist monitoring across a sector basket. Scenario 3 is a monthly competitive teardown. Scenario 4 is the crossover and the exact regime where architectural choice matters most. Scenario 5 is continuous monitoring for a research team. Scenario 6 is the one place fine-tuning has survived the 2024 to 2026 frontier-model sweep.

Runnable cost comparison

The snippet below computes monthly cost for long-context vs RAG across a range of corpus sizes at published 2026-04 Sonnet 4.6 rates. It uses input tokens only for clarity; add output cost on top for a complete picture.

import pandas as pd

# Published 2026-04 rates, USD per 1M tokens
SONNET_INPUT = 3.00
SONNET_CACHE_READ = 0.30
EMBEDDING_INPUT = 0.02  # text-embedding-3-small

TOKENS_PER_FILING = 150_000
CHUNK_TOKENS = 500
TOP_K_CHUNKS = 20

def longcontext_monthly_cost(n_filings: int, queries_per_month: int,
                             cache_hit_rate: float = 0.0) -> float:
    tokens_per_query = n_filings * TOKENS_PER_FILING
    fresh = tokens_per_query * (1 - cache_hit_rate)
    cached = tokens_per_query * cache_hit_rate
    per_query_cost = (fresh * SONNET_INPUT + cached * SONNET_CACHE_READ) / 1_000_000
    return per_query_cost * queries_per_month

def rag_monthly_cost(n_filings: int, queries_per_month: int) -> float:
    index_cost = (n_filings * TOKENS_PER_FILING * EMBEDDING_INPUT) / 1_000_000
    per_query_tokens = TOP_K_CHUNKS * CHUNK_TOKENS
    per_query_cost = (per_query_tokens * SONNET_INPUT) / 1_000_000
    return index_cost + per_query_cost * queries_per_month

rows = []
for n in [1, 10, 50, 200, 500, 2000]:
    for q in [10, 100, 1000]:
        rows.append({
            "filings": n, "queries/mo": q,
            "long_context": round(longcontext_monthly_cost(n, q), 2),
            "long_context_cached": round(longcontext_monthly_cost(n, q, 0.8), 2),
            "rag": round(rag_monthly_cost(n, q), 2),
        })

print(pd.DataFrame(rows).to_string(index=False))

The output shows long-context cheaper at small corpus sizes, RAG cheaper as both corpus and cadence grow, and the crossover region landing near 200 filings for low-cadence workloads.

The RAG vs long-context crossover

Solve for the corpus size N where long-context and RAG produce equal monthly input cost at a fixed query cadence Q. Using the numbers above:

Long-context cost per month = N x 150,000 x $3 / 1,000,000 x Q = $0.45 x N x Q.

RAG cost per month = N x 150,000 x $0.02 / 1,000,000 (indexing) + 20 x 500 x $3 / 1,000,000 x Q = $0.003 x N + $0.03 x Q.

Setting them equal and solving for N at Q = 30 queries per month: $0.45 x N x 30 = $0.003 x N + $0.03 x 30, which gives N roughly 0.07. That is clearly wrong at face value because indexing is trivially cheap; the crossover is actually determined by the per-query token count, not indexing. At Q = 30, long-context pays $13.50 x N per month, RAG pays $0.90 flat plus trivial indexing. For any N above 1, long-context is more expensive per query, but the absolute cost only matters once N x Q passes a threshold the analyst cares about.

The practical crossover is set by engineering cost, not token cost. A RAG pipeline with chunking, embeddings, vector store, retrieval evaluation, and reranking takes a competent engineer two to four weeks to build correctly. At a fully loaded cost of $150 per hour that is $12,000 to $24,000 in engineering. Long-context at Sonnet rates burns through that budget only when monthly spend passes roughly $1,000 per month, which lands at a workload of roughly 200 filings per query at ad-hoc cadence or 20 filings at daily cadence. Below that line, long-context is the rational choice even when RAG has lower per-query cost.

Prompt caching moves the crossover further in favour of long-context. At 80 percent cache hit rate on a fixed corpus, Sonnet input cost drops by 8x, which pushes the engineering-payback threshold out to roughly 1,600 filings per query at ad-hoc cadence.

Fine-tuning sober-up

Fine-tuning rarely beats RAG for SEC filings because finance has the worst possible profile for tuned weights: moving vocabulary, structured cross-references, and regime-dependent language. A model tuned on 2022 10-Ks will under-recognise AI governance language that dominates 2025 to 2026 filings⁷. Retraining every six months is technically feasible and organisationally painful, and the labeling burden to keep 10,000 examples current is a recurring cost most teams underestimate.

The common failure mode: a team tunes a model on their 2023 corpus, ships a classifier, sees 92 percent accuracy in validation, then watches accuracy drop to 73 percent on 2026 filings as new sections (climate disclosures, AI risk) appear that were rare or absent in training. The fix is always RAG over primary sources or long-context over the full document, not another round of tuning.

Where fine-tuning does win

The MD&A 12-bucket risk classifier is the concrete counterexample. A team with 15,000 labeled MD&A sections across 12 stable risk categories (operational, financial, regulatory, supply chain, cyber, litigation, environmental, macro, competitive, talent, technology, governance) can tune a GPT-4.1-mini class model and hit 89 percent accuracy at roughly 40 percent of the per-call cost of a prompted frontier model. Because the label set is closed and stable, drift is bounded: new risk language still maps to one of the 12 buckets. The economics work because the query volume is high (every MD&A section on every new filing) and the task is narrow.

The decision heuristic: fine-tune only when the label set is closed, the corpus is stable, labeled data exceeds 10,000 examples, and query volume exceeds 10,000 per month. If any one of those fails, RAG or long-context is the better answer.

Practical recipe

A 4-step decision tree for a practitioner picking an architecture:

Is the corpus under 200 filings and queries ad-hoc? Use long-context. Skip the retrieval layer. Pay the token cost. The engineering savings dwarf the marginal token spend. Add prompt caching the moment the same corpus gets queried twice.
Is the corpus above 500 filings or cadence above 100 per day? Use RAG. Start with hybrid retrieval (BM25 plus dense embeddings) and a reranker. Budget two to four weeks for the initial pipeline and another two weeks for the evaluation rig. Measure recall@K on a labeled eval set rather than end-to-end answer quality alone; a high-accuracy system with bad recall is silently wrong in ways the end-to-end metric hides.
Is the corpus between 200 and 500 filings or cadence between 10 and 100 per day? This is the ambiguous zone. Default to long-context with caching if the workload is stable; move to RAG when the corpus grows past 500 or cadence past 100. Do not build RAG pre-emptively.
Is the task a closed-vocabulary classifier with 10,000+ labeled examples? Fine-tune a small model. Every other task: do not fine-tune in 2026. The frontier models prompted with a few examples will match or beat a tuned model on open-ended tasks and save the retraining burden.

The step-one default matters most. The common mistake is engineering teams building RAG pipelines for 50-filing workloads because RAG is the default answer on vendor blogs. For a single analyst running thirty queries per week, long-context plus prompt caching costs under $30 per month and requires zero infrastructure. That is the correct answer until the workload grows past step two.

One more reality check is worth running before shipping any architecture: measure the distribution of query types, not the peak. Teams tend to size their pipeline for the single hardest query they can imagine (a cross-filing comparison across fifty issuers) and then discover 95 percent of actual queries are single-filing extractions that long-context handles trivially. Sizing for the median query and falling back to a wider context or a retrieval pass for the tail produces a simpler, cheaper system than pre-building a RAG pipeline for a workload that has not materialised yet. The same applies in reverse: a team that scoped for single-filing questions and then gets hit with a weekly cross-sector teardown will feel the pain first in latency, second in token spend, and only last in accuracy.

Connects to

Reading Financial Filings with LLMs in 2026 — pillar guide for the full filing-to-insight pipeline.
Prompt Caching Economics for Finance — derivation of when caching flips long-context cost math.
Numeric Precision in LLM Filings — how extraction error compounds across architectures.
LLM Prompt Patterns for 10-K Extraction — prompt templates that work across all three architectures.
Token-Cost Reality of LLM Trading Research — the broader cost framework this article plugs into.
Financial Document Token Estimator — estimate 10-K and 10-Q token counts before a long-context call.
SEC Filing Chunk Optimizer — tune chunk boundaries for RAG pipelines that preserve cross-references.
Token Cost Optimizer — monthly cost across all tracked models for a given workload.

References

Anthropic. (2026). "Prompt caching documentation." docs.anthropic.com/en/docs/build-with-claude/prompt-caching. Primary source for cache-read pricing and TTL behaviour.
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., et al. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP 2020. Dense retrieval foundations cited in most production RAG stacks.
Gao, L., Ma, X., Lin, J., & Callan, J. (2023). "Precise Zero-Shot Dense Retrieval without Relevance Labels." ACL 2023. HyDE and related retrieval-refinement techniques.

Anthropic. (2026). "Claude API pricing." docs.anthropic.com/en/docs/about-claude/pricing. Accessed 2026-04-23. ↩ ↩²
OpenAI. (2026). "API pricing." openai.com/api/pricing. Accessed 2026-04-23. ↩ ↩² ↩³ ↩⁴
Google. (2026). "Gemini API pricing." ai.google.dev/pricing. Accessed 2026-04-23. ↩ ↩²
Beltagy, I., Peters, M. E., & Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150. Foundational long-context architecture paper; subsequent needle-in-haystack evaluations trace their methodology here. ↩
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020, pp. 9459-9474. ↩
Khattab, O., & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." SIGIR 2020, pp. 39-48. ↩
SEC. (2026). "Filing trends: climate, cyber, and AI disclosures 2020 to 2026." sec.gov/edgar. Aggregate filing frequency shows AI-related risk factor language increasing roughly 14x between 2022 and 2026 filings. ↩

TL;DR

The decision is not about quality, it is about three axes

Long-context: what it actually is

What it costs

When it wins

When it fails

Retrieval augmented generation: what it actually is

What it costs

When it wins

When it fails

Fine-tuning: what it actually is

What it costs

When it wins

When it fails

Six decision scenarios

Runnable cost comparison

The RAG vs long-context crossover

Fine-tuning sober-up

Where fine-tuning does win

Practical recipe

Connects to

References

Footnotes