The short answer

Prompt caching cuts a finance LLM agent's input cost only. On an output-heavy loop, the Token Cost Optimizer takes Claude Opus 4.7 from $990.00 to $348.48/month at 90% cache, but Gemini 3.5 Flash uncached costs $308.88. Opus never wins on cost; caching closes an input gap, not an output gap.

Prompt caching cuts the input cost of a finance LLM agent, and only the input cost. On an output-heavy research loop (40k input, 2k output, 6 calls per idea, 20 ideas a day), the Token Cost Optimizer takes Claude Opus 4.7 from $990.00/month at 0% cache to $348.48/month at 90% cache1, a 65% cut. But Gemini 3.5 Flash on the identical workload, with no cache benefit applied, costs $308.88/month2. Opus cannot beat it even at 90% cache, because caching discounts input and Opus's output rate ($25/Mtok) is what dominates. Claude Sonnet 4.6 does cross under Gemini 3.5 Flash, but only above roughly a 70% cache hit rate. That crossover is the real ROI question, and the answer is a cache-rate threshold, not a yes/no.

TL;DR

Model Cache hit rate Cost / month
Claude Opus 4.7 0% $990.00
Claude Opus 4.7 50% $633.60
Claude Opus 4.7 90% $348.48
Claude Sonnet 4.6 0% $594.00
Claude Sonnet 4.6 70% $294.62
Gemini 3.5 Flash no cache benefit $308.88

Same workload for every row: 40,000 input + 2,000 output tokens per call, 6 calls per idea, 10% retry, 20 ideas a day, an 0.5 validation rate. The Token Cost Optimizer applies cache pricing to Anthropic input only (reads at 0.1x base input); Google input is priced at full list rate regardless of cache, which is why the Gemini row is cache-invariant in this engine.

What "caching ROI" actually means

The naive question is "does caching save money." The answer is always yes when input is repeated, so it is the wrong question. The decision question is: at what cache hit rate does the cached premium model become cheaper than the uncached budget alternative I would otherwise use. That is a break-even on the cache hit rate, and it depends entirely on the workload's output ratio.

The agent loop here is deliberately output-heavy relative to a filing-extraction job: 2,000 output tokens against 40,000 input, a 1:20 ratio. A multi-step research agent that reasons, calls tools, and writes intermediate analysis generates a lot of output per call. That is exactly the regime where caching helps least, because caching never touches the output bill.

Opus 4.7 cannot cache its way under Gemini 3.5 Flash

Run the numbers. Claude Opus 4.7 starts at $990.00/month with no cache. Caching the 40k input prefix drives it down: $633.60 at 50%, $562.32 at 60%, $491.04 at 70%, $419.76 at 80%, $348.48 at 90%. The floor at 90% cache is $348.48.

Gemini 3.5 Flash on the same loop costs $308.88/month. Opus 4.7 never gets there. Even a near-perfect 90% cache hit leaves Opus 13% more expensive than uncached Gemini 3.5 Flash, because the part of Opus's bill caching cannot reach, the $25/Mtok output rate on 2,000 tokens times six calls times every idea, is larger than the entire input saving. Caching closes an input gap. It does not close an output gap.

This is the finding worth carrying: if you are choosing Opus on an output-heavy loop and justifying it with "we will cache aggressively," the cache math does not rescue the choice against a cheaper frontier model. Caching is a reason to keep Opus where you already want Opus for its reasoning, not a reason to pick it over a cheaper model on cost.

Sonnet 4.6 does cross over, around 70% cache

Claude Sonnet 4.6 tells the opposite story. It starts at $594.00/month uncached, already below Opus, and its lower $15/Mtok output rate means caching can actually carry it under the Gemini line. At a 70% cache hit rate Sonnet 4.6 lands at $294.62/month, under Gemini 3.5 Flash's $308.88. Below roughly 70% cache, Gemini 3.5 Flash is cheaper; above it, cached Sonnet wins.

So the break-even is real and model-specific: for Sonnet 4.6 on this loop the cache hit rate has to clear about 70% before the Anthropic-plus-caching path beats the uncached Gemini path. For Opus 4.7 there is no break-even at all on this output ratio; the crossover would require a cache rate above 100%, which is to say it does not exist.

When the break-even flips in caching's favor

The crossover moves toward caching as the input-to-output ratio rises. On a filing-extraction job (130k input, 6k output, a 1:22 input ratio that is heavier on input than this agent loop on a per-call basis), a larger share of the bill is cacheable input, so a given cache hit rate buys a bigger absolute saving. The lesson is not "caching is weak", it is "caching's ROI is a function of how input-dominated the workload is." Map your workload's output ratio first; that ratio, not the cache rate alone, decides whether caching changes the model ranking.

Decision guidance

  1. Compute your output ratio before you bank on caching. Output-dominated loops (agents, long syntheses) get little from caching; input-dominated jobs (full-context extraction, long-document QA) get a lot.
  2. Use caching to keep the model you want, not to justify a pricier one. If Opus earns its place on reasoning, caching makes it cheaper to run. It will not make it cheaper than a budget frontier model on an output-heavy loop.
  3. Find the crossover cache rate, not just the saving. The number that matters is the cache hit rate at which your premium-plus-cache path undercuts your budget alternative. For Sonnet 4.6 here it is about 70%; measure yours.
  4. Measure real cache hit rate in production. Anthropic returns cache_read_input_tokens and cache_creation_input_tokens per response; aggregate over 24 hours to get the true blended rate, then plug it into the Token Cost Optimizer.

Connects to

References

Footnotes

  1. Anthropic. "Pricing." platform.claude.com, verified 2026-05-26. https://platform.claude.com/docs/en/about-claude/pricing

  2. Google. "Gemini Developer API pricing." ai.google.dev, verified 2026-05-26. https://ai.google.dev/gemini-api/docs/pricing

Verified engine output

Show the recompute-verified inputs and outputs
Agent loop — Claude Opus 4.7 at 0% cache
Inputs
input_tokens_per_call40000
output_tokens_per_call2000
calls_per_idea6
retry_rate0.1
ideas_per_day20
validation_rate0.5
model_idclaude-opus-4-7
cache_hit_rate0
Result
model › idclaude-opus-4-7
model › provideranthropic
model › nameClaude Opus 4.7
model › input usd per mtoken5
model › output usd per mtoken25
model › cache write usd per mtoken6.25
model › cache read usd per mtoken0.5
model › context window1000000
model › notesFlagship reasoning model — 1M context.
effective cost per call0.25
cost per idea1.6500000000000001
cost per validated trade3.3000000000000003
cost per day33
cost per month990
cost per year12045

Computed live at build time.

Agent loop — Claude Opus 4.7 at 50% cache
Inputs
input_tokens_per_call40000
output_tokens_per_call2000
calls_per_idea6
retry_rate0.1
ideas_per_day20
validation_rate0.5
model_idclaude-opus-4-7
cache_hit_rate0.5
Result
model › idclaude-opus-4-7
model › provideranthropic
model › nameClaude Opus 4.7
model › input usd per mtoken5
model › output usd per mtoken25
model › cache write usd per mtoken6.25
model › cache read usd per mtoken0.5
model › context window1000000
model › notesFlagship reasoning model — 1M context.
effective cost per call0.16
cost per idea1.056
cost per validated trade2.112
cost per day21.12
cost per month633.6
cost per year7708.8

Computed live at build time.

Agent loop — Claude Opus 4.7 at 90% cache (floor)
Inputs
input_tokens_per_call40000
output_tokens_per_call2000
calls_per_idea6
retry_rate0.1
ideas_per_day20
validation_rate0.5
model_idclaude-opus-4-7
cache_hit_rate0.9
Result
model › idclaude-opus-4-7
model › provideranthropic
model › nameClaude Opus 4.7
model › input usd per mtoken5
model › output usd per mtoken25
model › cache write usd per mtoken6.25
model › cache read usd per mtoken0.5
model › context window1000000
model › notesFlagship reasoning model — 1M context.
effective cost per call0.088
cost per idea0.5808
cost per validated trade1.1616
cost per day11.616
cost per month348.48
cost per year4239.84

Computed live at build time.

Agent loop — Claude Sonnet 4.6 at 0% cache
Inputs
input_tokens_per_call40000
output_tokens_per_call2000
calls_per_idea6
retry_rate0.1
ideas_per_day20
validation_rate0.5
model_idclaude-sonnet-4-6
cache_hit_rate0
Result
model › idclaude-sonnet-4-6
model › provideranthropic
model › nameClaude Sonnet 4.6
model › input usd per mtoken3
model › output usd per mtoken15
model › cache write usd per mtoken3.75
model › cache read usd per mtoken0.3
model › context window500000
model › notesBest price/performance for bulk research loops.
effective cost per call0.15
cost per idea0.99
cost per validated trade1.98
cost per day19.8
cost per month594
cost per year7227

Computed live at build time.

Agent loop — Claude Sonnet 4.6 at 70% cache (crossover)
Inputs
input_tokens_per_call40000
output_tokens_per_call2000
calls_per_idea6
retry_rate0.1
ideas_per_day20
validation_rate0.5
model_idclaude-sonnet-4-6
cache_hit_rate0.7
Result
model › idclaude-sonnet-4-6
model › provideranthropic
model › nameClaude Sonnet 4.6
model › input usd per mtoken3
model › output usd per mtoken15
model › cache write usd per mtoken3.75
model › cache read usd per mtoken0.3
model › context window500000
model › notesBest price/performance for bulk research loops.
effective cost per call0.0744
cost per idea0.49104
cost per validated trade0.98208
cost per day9.8208
cost per month294.624
cost per year3584.592

Computed live at build time.

Agent loop — Gemini 3.5 Flash (no cache benefit applied; the budget-frontier line to beat)
Inputs
input_tokens_per_call40000
output_tokens_per_call2000
calls_per_idea6
retry_rate0.1
ideas_per_day20
validation_rate0.5
model_idgemini-3-5-flash
cache_hit_rate0
Result
model › idgemini-3-5-flash
model › providergoogle
model › nameGemini 3.5 Flash
model › input usd per mtoken1.5
model › output usd per mtoken9
model › context window1000000
model › notesFrontier agent-tier at Flash speed — not a budget model (output ~3.6x Gemini 2.5 Flash).
effective cost per call0.078
cost per idea0.5148
cost per validated trade1.0296
cost per day10.296000000000001
cost per month308.88000000000005
cost per year3758.0400000000004

Computed live at build time.

Frequently asked questions

Does prompt caching make Claude Opus 4.7 cheaper than Gemini 3.5 Flash for finance agents?
No, not on an output-heavy loop. The Token Cost Optimizer takes Opus 4.7 from $990.00/month at 0% cache to $348.48/month at 90% cache, but Gemini 3.5 Flash uncached costs $308.88/month on the same workload. Caching discounts input only; Opus's $25/Mtok output rate is what keeps it above the Gemini line.
What cache hit rate does Claude Sonnet 4.6 need to beat Gemini 3.5 Flash on cost?
About 70%. Sonnet 4.6 starts at $594.00/month uncached and falls to $294.62/month at a 70% cache hit rate, just under Gemini 3.5 Flash's $308.88/month. Below roughly 70% cache, Gemini is cheaper; above it, cached Sonnet wins.
Why does caching help input-heavy workloads more than agent loops?
Caching discounts repeated input tokens (Anthropic reads at 0.1x base input) and never touches output. An agent loop generates a lot of output per call, so a large part of its bill is uncacheable. A full-context extraction job is input-dominated, so a given cache hit rate buys a bigger absolute saving there.
How do I measure my real prompt-cache hit rate?
Anthropic's API returns cache_read_input_tokens and cache_creation_input_tokens on every response. Aggregate them over 24 hours and compute reads divided by total input tokens to get the blended hit rate, then enter that rate in the Token Cost Optimizer to price your actual workload.
Is prompt caching ever a reason to choose a more expensive model?
Caching is a reason to run a model you already want more cheaply, not a reason to pick a pricier model over a budget one on cost. On output-heavy loops, even a near-perfect cache hit will not bring a premium model under a budget frontier model, because caching cannot reach the output bill.