Why is Opus 4.7 so much cheaper than Opus 4.1 in the engine?

Anthropic dropped the flagship rate across the 4.5-4.7 generation: the prior-generation Opus 4.1 sits at $15/$75 per Mtok while Opus 4.7 prices at $5/$25. The engine surfaces both rows so the user sees the generational price drop and can pick by context window.

Why does Gemini 2.5 Flash cost so little on this workload?

Its list rate is among the lowest and its cache pricing is aggressive. On the 3-peer synthesis path it lands at $0.0169 per filing; reliability is a separate question the engine does not measure.

Should I always use the synthesis cost number, not the one-pass cost?

Use synthesisCost when the workflow includes peer comparison (peers > 0); use onePassCost for a single-filing extraction. Most retail workflows that publish anything beyond a single ticker need synthesis.

How sensitive is the estimate to cache hit rate input?

Linearly sensitive on the cache-read portion. A 10-point change in cache hit rate drops Anthropic input cost by about 9%; total cost by about 6% because output tokens are unaffected by cache.

Does the engine model context-window overflow?

Yes, it reports fitsInContext per model. All eight pass for the 3-peer 10-K case; Haiku starts to overflow at peers ≥ 8 because its 200k context becomes the binding constraint.

10-K Token Estimator: Cache Hit Rate as the Cost Driver

For a 10-K extraction with a 3-peer synthesis pass and a 2,000-token summary, the Financial Document Token Estimator returns per-filing cost on Claude Sonnet 4.6 of $0.0584 (one pass) and $0.1436 (with synthesis) at 60% cache hit rate. The same run with cache hit rate 0% costs $0.0917 and $0.2769, a 57% and 93% increase. On Claude Opus 4.8 the costs are $0.0973 and $0.2393 at 60% cache, or $0.1529 and $0.4614 at zero cache. Cache hit rate is the dominant cost driver on this workload, not model choice. The decision rule that follows is keyed to cache hit rate, peer-set size, and per-filing budget.

TL;DR

Ten models, three cache regimes, one workload (3-peer 10-K extraction with 2,000-token output), the engine returns:

Model	0% cache (synth)	60% cache (synth)	90% cache (synth)
Claude Haiku 4.5	$0.0923	$0.0479	$0.0256
Claude Sonnet 4.6	$0.2769	$0.1436	$0.0769
Claude Opus 4.1 (retired, prior generation)	$1.3843	$0.7178	$0.3845
Claude Opus 4.8	$0.4614	$0.2393	$0.1282
GPT-5.5	$0.4200	$0.3120	$0.2580
GPT-5.4 mini	$0.0630	$0.0468	$0.0387
Gemini 3.5 Flash	n/a	$0.0774	n/a
Gemini 2.5 Flash	n/a	$0.0169	n/a
Gemini 2.5 Flash-Lite	n/a	$0.0048	n/a
Gemini 2.5 Pro	n/a	$0.0695	n/a

Cache hit rate of 90% drops the Claude family cost by 72% on Sonnet, 72% on Opus. The cross-vendor pattern matters: Gemini 2.5 Pro at 60% cache ($0.0695) undercuts Sonnet at 90% cache ($0.0769). The cost structure of the workload is therefore set by cache architecture first, then model tier, then vendor.

The workload anatomy

A 10-K extraction has three token components in this engine's archetype taxonomy.

The body. The engine's 10-K archetype represents ~20,571 input tokens for the filing body itself (under the engine's chars-per-token conversion). That is below most retail expectations of "a 10-K is huge", because the engine measures the useful extraction surface (MD&A + Risk Factors + Notes), not the cover-to-cover token count. A literal end-to-end 10-K can be 80,000–120,000 tokens; the engine prices the working extraction at 20–21k.

The peers. With peers = 3 the engine runs synthesis on three peer filings, producing a synthesisInputTokens of ~82,284 (= 4 × 20,571 for one focal + three peers). The synthesis cost is therefore 4× the one-pass cost in input-token terms; in dollar terms it is more than 4× because cache hits on peers are less reliable than on the focal filing.

The output. 2,000 tokens out per call. At Claude Sonnet 4.6 output rate ($15/Mtok) the output contributes 2000 × 15 / 1,000,000 = $0.030 per call, about half the one-pass cost at 60% cache. Output tokens are not amortized by cache; they are pure marginal cost.

Why cache hit rate dominates model choice

Cache reads are 10× cheaper than full inputs on the Anthropic models (cacheReadMultiplier = 0.1) and 2× cheaper on the OpenAI models (cacheReadMultiplier = 0.5). For Anthropic models a 60% cache hit rate effectively re-prices the input portion of every call at 40% × 1.0 + 60% × 0.1 = 0.46, less than half the no-cache rate. The same arithmetic on OpenAI: 40% × 1.0 + 60% × 0.5 = 0.70, a 30% reduction. Gemini's cache pricing is more aggressive still on this engine's input-shape model.

For the canonical input, the consequence is that the four Anthropic models hold a fixed cost ratio across cache regimes. On the synthesis path in the TL;DR table, Sonnet at 0% cache ($0.2769) costs 60% as much as Opus 4.8 ($0.4614); at 90% cache Sonnet ($0.0769) still costs 60% as much as Opus 4.8 ($0.1282). The dollar gap shrinks at high cache (from about $0.18 to about $0.05 per filing) but the ratio is constant, which is exactly why the original thesis ("cache hit rate above 50% makes Opus cheaper than Sonnet on annual-cycle filing work") fails to materialize on these engine numbers. Opus 4.8 is consistently 1.67× the cost of Sonnet at every cache regime within this workload. The defensible thesis is the opposite: Opus only earns its premium when extraction-quality tolerance falls below ~96%, and the cost gap is durable across cache hit rates.

The synthesis multiplier

Synthesis costs about 2.46× the one-pass cost on every Anthropic model at 60% cache (Sonnet: $0.0584 → $0.1436 = 2.46×; Opus 4.8: $0.0973 → $0.2393 = 2.46×). On the OpenAI models it is 2.84× and on Gemini about 2.1×; the spread comes from the differing cache-read multipliers, not the model tier. The multiplier is near-constant within a vendor because the synthesisInputTokens (82,284) is the same per workload; it reflects how much of the synthesis input gets cache-hit on peers vs the focal.

For peer-set sizing: a 3-peer extraction's synthesis cost is 2.5× the one-pass cost on Claude Sonnet at 60% cache. A 1-peer extraction (focal + 1 peer = 2 × 20,571 = 41,142 input tokens) would synthesize at about 1.5× the one-pass cost. A 5-peer extraction would synthesize at about 4× the one-pass cost. The marginal cost of an extra peer is non-trivial, adding a fourth peer to a 3-peer extraction adds about 35% to the synthesis cost on the canonical input.

The annual-cycle budget

For a 250-ticker watchlist run quarterly (1,000 extractions per year):

Model / cache	One-pass annual ($)	Synthesis annual ($)
Sonnet @ 60%	58	144
Sonnet @ 90%	42	77
Opus 4.8 @ 60%	97	239
Gemini 2.5 Pro @ 60%	32	70

Under $100/year is the realistic budget for a solo retail extraction workflow with synthesis. Sonnet at 60% cache lands at $144 annual synthesis, over budget but within an order of magnitude. Gemini 2.5 Pro at 60% cache lands at $70, well within budget. Opus 4.8 lands at $239 — defensible only if the quality differential against Sonnet is measurable on the buyer's own documents.

For the 250-ticker scenario the Earnings Call Summarization Cost tool produces a parallel calculation specific to call transcripts; see /articles/earnings-call-summarization-cost-tickers/ for the cross-check.

The cache-architecture decision

Three cache regimes drive cost in this workload:

Per-filing prompt cache. The 5–10 prompts that ask different extractions from the same filing share the filing body in the prompt prefix. Anthropic prompt cache amortizes that prefix across calls if calls hit within the cache TTL. Hit rate 60–80% is typical for a same-day run.
Per-section system prompt cache. The instruction block (extraction schema, format requirements) is identical across thousands of filings. Hit rate >95% is typical because it never changes within a calendar quarter.
Per-peer body cache. Synthesis re-uses peer filings across many focal extractions. Hit rate depends on the peer set's stability; for a fixed sector universe it is high (80%+), for an ad-hoc peer pick it is low (10–30%).

The engine's single cache_hit_rate parameter is a blended estimate across all three; the architecture decision is what mix of the three the workflow can sustain. The Token Cost Optimizer lets the user separate cache-write and cache-read costs and produce a per-call effective cost from per-component hit rates.

Where the estimator breaks

The engine assumes the extraction is a single call per filing-section pair. A research workflow with retry logic, structured-output validation failures, or thinking-token expansion can produce 1.3–2.5× the engine's reported cost. The retry path is the largest source of underestimation in production; structured-output failures on the Opus tier are rare but expensive.

The 20,571 input tokens for the 10-K archetype is a representative median. Real 10-Ks vary from 12,000 to 45,000 useful tokens depending on issuer and year. The estimator does not model this dispersion; reports should treat the engine output as the median expectation and budget for 2× on the worst quartile.

The engine also does not model batch-API discounts. For a 250-ticker quarterly extraction, Anthropic's 50% batch discount cuts the Sonnet 60% cache cost from $0.1436 to $0.0718 per filing. The Batch vs Realtime Cost Calculator handles that case directly.

Connects to

LLM Prompt Patterns for 10-K and 8-K Extraction — three structured patterns the cost model assumes.
Reading Financial Filings with LLMs 2026 — the broader workflow this cost analysis fits into.
Prompt Caching Economics for Finance — deeper dive on the three cache regimes.
Financial Document Token Estimator — the engine.
Token Cost Optimizer — companion calculator for the cost-per-validated-trade view.
SEC Filing Chunk Optimizer — chunk-strategy cost feed for the same workload.

References

Anthropic. "Pricing." docs.anthropic.com, accessed 2026-05-21. https://www.anthropic.com/pricing
OpenAI. "API Pricing." platform.openai.com, accessed 2026-05-21. https://openai.com/api/pricing/
Anthropic. "Prompt caching." docs.anthropic.com/en/docs/build-with-claude/prompt-caching, accessed 2026-05-21.
Google. "Gemini API pricing." ai.google.dev, accessed 2026-05-21. https://ai.google.dev/pricing
SEC. "Financial Reporting Manual." sec.gov, accessed 2026-05-21. Reference for 10-K archetype section composition. https://www.sec.gov/divisions/corpfin/cffinancialreportingmanual.shtml

Verified engine output

Show the recompute-verified inputs and outputs

Inputs
mode	archetype
archetype_id	10-k
output_tokens	2000
peers	3
cache_hit_rate	0.6

Result
(10 items)	[...]

Computed live at build time.

Frequently asked questions

Why is Opus 4.7 so much cheaper than Opus 4.1 in the engine?: Anthropic dropped the flagship rate across the 4.5-4.7 generation: the prior-generation Opus 4.1 sits at $15/$75 per Mtok while Opus 4.7 prices at $5/$25. The engine surfaces both rows so the user sees the generational price drop and can pick by context window.
Why does Gemini 2.5 Flash cost so little on this workload?: Its list rate is among the lowest and its cache pricing is aggressive. On the 3-peer synthesis path it lands at $0.0169 per filing; reliability is a separate question the engine does not measure.
Should I always use the synthesis cost number, not the one-pass cost?: Use synthesisCost when the workflow includes peer comparison (peers > 0); use onePassCost for a single-filing extraction. Most retail workflows that publish anything beyond a single ticker need synthesis.
How sensitive is the estimate to cache hit rate input?: Linearly sensitive on the cache-read portion. A 10-point change in cache hit rate drops Anthropic input cost by about 9%; total cost by about 6% because output tokens are unaffected by cache.
Does the engine model context-window overflow?: Yes, it reports fitsInContext per model. All eight pass for the 3-peer 10-K case; Haiku starts to overflow at peers ≥ 8 because its 200k context becomes the binding constraint.