Claude Sonnet 4.6 caches reads at $0.30 per million tokens against a base input rate of $3.00 — a 90% discount on cache hits. GPT-5.5's token-level cache is also a 90% discount, dropping its $5.00/M base input to $0.50/M cached. Same workload, same prompt structure, the cost cross-over depends on idle gap between calls: Anthropic's 5-minute TTL forces refreshes on workloads with idle windows over 5 minutes, while OpenAI's token-level cache survives longer. The Token Cost Optimizer ran on 18k input + 1.2k output, 5 calls per idea, 40 ideas/day, 18% validation rate, 65% cache hit rate. On Sonnet 4.6 the effective cost per call is $0.040, per idea $0.222, per validated trade $1.235, annual $3,245.

TL;DR

  • Claude Sonnet 4.6: 5-minute TTL cache at $0.30/M cached read (90% off base $3.00).
  • GPT-5.5: token-level cache at 90% off a $5/M base input rate → $0.50/M cached.
  • On cost, Sonnet still wins across every idle-gap regime: its base ($3/M) and cached ($0.30/M) rates both sit below GPT-5.5's base ($5/M) and cached ($0.50/M). The gap is now ~1.67x on the rates, not the wide margin the older $10/M GPT-5 generation showed.
  • Token Cost Optimizer on Sonnet 4.6 at 18k/1.2k, 5 calls/idea, 40 ideas/day, 65% hit rate, 10% retry: $0.040/call, $0.222/idea, $1.235/validated trade, $3,245/year. The engine prices the same workload on GPT-5.5 at $831.60/month against Sonnet's $266.71, because the optimiser models the cache discount on the Anthropic path only.
  • Cache hit rate is a major cost lever; the ~1.67x base-rate gap between the two providers is the smaller, durable one at typical retail volumes.

The two cache mechanics

Anthropic and OpenAI publish caching as a cost-reduction lever but the mechanics differ12:

Provider Discount Lifetime / TTL Granularity
Anthropic Claude 4.6 90% off base input (cached read at $0.30/M vs $3.00/M base) 5 minutes from last access Prefix-cached: stable system prompt + early user turns
OpenAI GPT-5.5 90% off input rate (cached at $0.50/M vs $5.00/M base) Indefinite while warm (typically hours) Token-level (model-managed)

Both apply only to input tokens. Output tokens are charged at full rate from both providers. Cache writes on Anthropic carry a 25% premium on first write ($3.75/M vs $3.00/M base): the cost amortises as long as the workload makes 4+ reads within 5 minutes.

The scenario

A research loop processes annual filings extract loops with 18k tokens of stable system prompt + retrieved filing context, 1.2k tokens of structured output per call, 5 calls per investment idea, 40 ideas per day, 18% of ideas pass validation. Cache hit rate is 65%.

The Token Cost Optimizer returns:

Metric Value
Effective cost per call $0.040
Cost per idea (5 calls) $0.222
Cost per validated trade $1.235
Cost per day (40 ideas) $8.89
Cost per month $266.71
Cost per year $3,244.92

Where Claude wins

Workloads with tight call clustering — 5+ calls per idea executed within 5 minutes — get full Anthropic cache benefit. The 90% read discount drops the effective input cost from $3.00/M to $0.30/M plus the small write premium amortised across the cached reads.

Example: a research idea that takes 5 sequential LLM calls in a 90-second window. The system prompt + retrieval context is cached on call 1, read 4 more times within the 5-minute TTL. Effective input cost is roughly $0.40/M including the write premium.

The same workload on GPT-5.5: token-level cache gives 90% off the input rate. With GPT-5.5's $5/M base input rate, the effective cached cost is $0.50/M. Anthropic's cached path (roughly $0.40/M including the write premium) still edges it on this shape, though the gap is narrow now that both providers cache at a 90% discount.

Where GPT-5.5's cache survival matters (but not on cost)

The headline mechanic favours GPT-5.5 in one dimension only: cache survival. Workloads with idle windows longer than 5 minutes lose the Anthropic cache; each cache-cold call pays the write premium plus full read on first access. GPT-5.5's token-level cache survives longer — practical hit rates of 60-80% on stable prefixes across hour-long windows are common, where Anthropic's prefix cache has gone cold.

But survival is not the same as cost. On the verified pricing table GPT-5.5's base input rate is $5/M3, about 1.67x Sonnet 4.6's $3/M4. With GPT-5.5's 90% cache discount the cached input rate is $0.50/M, still above Sonnet's cached $0.30/M, and its cache-cold $5/M sits above Sonnet's cache-cold $3/M. So GPT-5.5 does not win on per-token input cost in any of these regimes; what it buys is fewer cache misses, not cheaper tokens — and the margin is far tighter than the older $10/M GPT-5 generation implied.

Example: a research loop that processes 1 idea every 30 minutes. Each idea's 5 calls span 1 minute, so within-idea caching works on both. The next idea, 30 minutes later, has lost the Anthropic cache and pays a write again. GPT-5.5's cache likely survives, but a surviving $0.50/M cached token is still pricier than a cold $3/M Sonnet token only when Sonnet is cold; on a warm Sonnet cache ($0.30/M) Sonnet stays ahead.

The cross-over rule

Because GPT-5.5's base rate is ~1.67x Sonnet's, the cost-minimising answer on this workload is Sonnet 4.6 across essentially every idle-gap regime, though the margins are narrower than the prior GPT-5 generation showed:

  • Idle gap < 5 minutes (always cache-warm): Sonnet wins, effective input ~$0.40/M vs GPT-5.5's $0.50/M cached.
  • Idle gap 5-30 minutes (Anthropic cache misses, OpenAI cache hits): closer now. Sonnet's cache-cold $3/M base is above GPT-5.5's $0.50/M cached rate, so a warm GPT-5.5 cache can undercut a cold Sonnet here; Sonnet regains the edge whenever its cache is warm. This is the one regime where GPT-5.5's longer cache survival can flip the cost answer.
  • Idle gap > 30 minutes (both miss): compare base rates. Sonnet at $3/M is under GPT-5.5's $5/M. Sonnet wins.

GPT-5.5 enters the conversation on capability, cache survival, and a now-competitive cached rate. For a pure cost optimisation on this finance research loop, Sonnet 4.6 is still the answer when its cache stays warm; the decision tightens once idle gaps exceed the 5-minute TTL.

What the engine doesn't model

The Token Cost Optimizer prices on:

  • Input tokens × input rate.
  • Output tokens × output rate.
  • Cache hit rate × cached-read rate.

It does not model:

  1. Capability gap. Sonnet 4.6 and GPT-5 are not identical on finance tasks. Run an eval before optimising for cost — see Model Selection Framework for Finance.
  2. Cache eviction. Anthropic's 5-minute TTL is a hard cliff; OpenAI's eviction is model-managed and probabilistic. The engine assumes the configured hit rate; real-world rates may differ.
  3. Multi-region pricing. Some workloads run in EU regions at different rates.
  4. Long-context tier pricing. Anthropic's 1M-token Sonnet tier and OpenAI's 400k GPT-5.5 are priced separately.

The retail accounting

For the scenario above ($3,245/year), the strategy needs to generate $5-10k/year in net P&L to justify the LLM line alone. On 40 ideas/day × 18% validation × 1 trade per validated idea = ~2,600 trades/year, the per-trade cost target is roughly $2-4 in expected profit after slippage and fees.

That sets a minimum bar on the strategy's edge per trade — under $1 expected profit per trade, the LLM cost line dominates the strategy economics. The Cost-Per-Validated-Trade Framework walks the full accounting.

What about Anthropic's 1M-context tier?

For workloads that benefit from single-prompt large-context analysis (multi-filing peer comparison without retrieval), the Sonnet 1M tier prices input at $6.00/M and output at $22.50/M — 2x the base Sonnet rate. The cache mechanics are the same.

GPT-5.5 ships a 400k context window. For multi-filing analysis up to 200-400k tokens of context, GPT-5.5's $5/M base now sits just under the Sonnet 1M tier's $6/M, so input cost no longer decides it on its own; the choice turns on whether the workload needs GPT-5.5's specific capability and whether 400k context is enough versus the Sonnet 1M tier or Gemini Pro's 2M.

The right choice depends on whether the workload benefits from the larger context window. See Finetune vs RAG vs Long Context for Filings for the design-pattern comparison.

Provider-level reliability factors

Cost is one input; reliability is another. Anthropic's prompt-cache performance has been stable through 2025-2026 with no documented cache-eviction regressions in the public incident history. OpenAI's token-level cache, being model-managed, occasionally surprises users — a workload that achieved 70% hit rate one week may drop to 50% after a model version bump, without any deployer-side change. The Fallback Chain Simulator handles the reliability side; the cost optimiser does not.

For workloads that need predictable cache economics, Anthropic's explicit 5-minute TTL is easier to reason about than OpenAI's opaque token-level mechanic. The trade-off is the harder TTL cliff. For workloads that need maximum cache survival across long idle periods, OpenAI wins on practical hit rate but at the cost of cache-state opacity.

A worked monthly comparison

A research loop running 8 hours per day, with calls clustered into 3 batches per day of ~45 minutes each:

  • Within-batch: 45 minutes of repeated calls. Both providers' caches stay warm. Anthropic's 5-minute TTL renews on each access.
  • Across-batch gap: 1.5-3 hours between batches. Anthropic cache cold; OpenAI cache likely warm.
  • Across-day gap: 16 hours. Both caches cold.

Blended hit rate on Anthropic Sonnet: ~55% (within-batch warm, across-batch always cold). Effective input cost: ~$1.50/M token (blend of $3/M cold and $0.30/M cached). Blended hit rate on OpenAI GPT-5.5: ~70% (within-batch warm, across-batch mostly warm, across-day cold). Effective input cost: ~$1.85/M token (blend of $5/M cold and $0.50/M cached).

For this access pattern Sonnet still wins on blended cost, but only by roughly 1.2x — its lower base and cached rates keep it ahead, while GPT-5.5's higher hit rate nearly closes the gap. The earlier $10/M GPT-5 generation lost this comparison by ~4x; the refreshed $5/M GPT-5.5 makes it a near-tie. The exact numbers depend on the batch geometry; the optimiser computes the Sonnet side with cache applied, and prices GPT-5.5 at its uncached base rate (the engine models the cached-read discount on the Anthropic path only).

Failure modes

  • Quoting headline cache discount as if it always applies. Cache hit rate is workload-specific. Measure it; don't assume.
  • Ignoring capability differences. The cheaper option that fails on extraction fidelity costs more in downstream errors than the saving.
  • Treating cache write premiums as ignorable. On workloads with low cache reuse (low hit rate), the write premium can net negative.
  • Comparing on cache discount percentage instead of absolute rate. GPT-5.5 and Anthropic both discount cached reads by 90%, but the discount applies to different bases ($5/M vs $3/M); the effective cached rates are $0.50/M vs $0.30/M, so the absolute rate, not the headline percentage, is what decides cost.

Connects to

References

Footnotes

  1. Anthropic (2024). "Prompt caching for the Claude API." docs.anthropic.com

  2. OpenAI (2024). "Token-level prompt caching guide." platform.openai.com

  3. OpenAI Pricing (2026). openai.com/api/pricing

  4. Anthropic Pricing (2026). anthropic.com/pricing

Verified engine output

Show the recompute-verified inputs and outputs
Claude Sonnet 4.6: 18k input + 1.2k output, 5 calls/idea, 40 ideas/day, 65% cache hit, 18% validation, 10% retry
Inputs
input_tokens_per_call18000
output_tokens_per_call1200
calls_per_idea5
retry_rate0.1
ideas_per_day40
validation_rate0.18
cache_hit_rate0.65
Result
model › idclaude-sonnet-4-6
model › provideranthropic
model › nameClaude Sonnet 4.6
model › input usd per mtoken3
model › output usd per mtoken15
model › cache write usd per mtoken3.75
model › cache read usd per mtoken0.3
model › context window500000
model › notesBest price/performance for bulk research loops.
effective cost per call0.04041
cost per idea0.222255
cost per validated trade1.23475
cost per day8.8902
cost per month266.706
cost per year3244.9230000000002

Computed live at build time.

GPT-5.5 on the same workload (base input rate $5/M, no cached-read rate modelled in the engine table)
Inputs
model_idgpt-5
input_tokens_per_call18000
output_tokens_per_call1200
calls_per_idea5
retry_rate0.1
ideas_per_day40
validation_rate0.18
cache_hit_rate0.65
Result
model › idgpt-5
model › provideropenai
model › nameGPT-5.5
model › input usd per mtoken5
model › output usd per mtoken30
model › context window400000
model › notesOpenAI frontier model (GPT-5.5).
effective cost per call0.126
cost per idea0.6930000000000001
cost per validated trade3.8500000000000005
cost per day27.720000000000002
cost per month831.6
cost per year10117.800000000001

Computed live at build time.

Frequently asked questions

What hit rate should I assume in cost planning?
For research loops with stable system prompts: 50-75% typical, 80%+ achievable with careful prompt design. Lower (20-40%) for ad-hoc analytical workloads where prompts vary per call.
Does the Anthropic 5-minute TTL renew with each access?
Yes. Anthropic's TTL is from last access — every read pushes the expiry 5 minutes forward. Workloads with frequent accesses stay warm indefinitely.
Can I cache across both providers?
Each provider's cache is independent. A workload that runs on Anthropic and falls back to OpenAI doesn't share cache state. The fallback path always pays cache-cold rates on the first call.