The short answer
Prompt caching cuts a finance LLM agent's input cost only. On an output-heavy loop, the Token Cost Optimizer takes Claude Opus 4.7 from $990.00 to $348.48/month at 90% cache, but Gemini 3.5 Flash uncached costs $308.88. Opus never wins on cost; caching closes an input gap, not an output gap.
Prompt caching cuts the input cost of a finance LLM agent, and only the input cost. On an output-heavy research loop (40k input, 2k output, 6 calls per idea, 20 ideas a day), the Token Cost Optimizer takes Claude Opus 4.7 from $990.00/month at 0% cache to $348.48/month at 90% cache1, a 65% cut. But Gemini 3.5 Flash on the identical workload, with no cache benefit applied, costs $308.88/month2. Opus cannot beat it even at 90% cache, because caching discounts input and Opus's output rate ($25/Mtok) is what dominates. Claude Sonnet 4.6 does cross under Gemini 3.5 Flash, but only above roughly a 70% cache hit rate. That crossover is the real ROI question, and the answer is a cache-rate threshold, not a yes/no.
TL;DR
| Model | Cache hit rate | Cost / month |
|---|---|---|
| Claude Opus 4.7 | 0% | $990.00 |
| Claude Opus 4.7 | 50% | $633.60 |
| Claude Opus 4.7 | 90% | $348.48 |
| Claude Sonnet 4.6 | 0% | $594.00 |
| Claude Sonnet 4.6 | 70% | $294.62 |
| Gemini 3.5 Flash | no cache benefit | $308.88 |
Same workload for every row: 40,000 input + 2,000 output tokens per call, 6 calls per idea, 10% retry, 20 ideas a day, an 0.5 validation rate. The Token Cost Optimizer applies cache pricing to Anthropic input only (reads at 0.1x base input); Google input is priced at full list rate regardless of cache, which is why the Gemini row is cache-invariant in this engine.
What "caching ROI" actually means
The naive question is "does caching save money." The answer is always yes when input is repeated, so it is the wrong question. The decision question is: at what cache hit rate does the cached premium model become cheaper than the uncached budget alternative I would otherwise use. That is a break-even on the cache hit rate, and it depends entirely on the workload's output ratio.
The agent loop here is deliberately output-heavy relative to a filing-extraction job: 2,000 output tokens against 40,000 input, a 1:20 ratio. A multi-step research agent that reasons, calls tools, and writes intermediate analysis generates a lot of output per call. That is exactly the regime where caching helps least, because caching never touches the output bill.
Opus 4.7 cannot cache its way under Gemini 3.5 Flash
Run the numbers. Claude Opus 4.7 starts at $990.00/month with no cache. Caching the 40k input prefix drives it down: $633.60 at 50%, $562.32 at 60%, $491.04 at 70%, $419.76 at 80%, $348.48 at 90%. The floor at 90% cache is $348.48.
Gemini 3.5 Flash on the same loop costs $308.88/month. Opus 4.7 never gets there. Even a near-perfect 90% cache hit leaves Opus 13% more expensive than uncached Gemini 3.5 Flash, because the part of Opus's bill caching cannot reach, the $25/Mtok output rate on 2,000 tokens times six calls times every idea, is larger than the entire input saving. Caching closes an input gap. It does not close an output gap.
This is the finding worth carrying: if you are choosing Opus on an output-heavy loop and justifying it with "we will cache aggressively," the cache math does not rescue the choice against a cheaper frontier model. Caching is a reason to keep Opus where you already want Opus for its reasoning, not a reason to pick it over a cheaper model on cost.
Sonnet 4.6 does cross over, around 70% cache
Claude Sonnet 4.6 tells the opposite story. It starts at $594.00/month uncached, already below Opus, and its lower $15/Mtok output rate means caching can actually carry it under the Gemini line. At a 70% cache hit rate Sonnet 4.6 lands at $294.62/month, under Gemini 3.5 Flash's $308.88. Below roughly 70% cache, Gemini 3.5 Flash is cheaper; above it, cached Sonnet wins.
So the break-even is real and model-specific: for Sonnet 4.6 on this loop the cache hit rate has to clear about 70% before the Anthropic-plus-caching path beats the uncached Gemini path. For Opus 4.7 there is no break-even at all on this output ratio; the crossover would require a cache rate above 100%, which is to say it does not exist.
When the break-even flips in caching's favor
The crossover moves toward caching as the input-to-output ratio rises. On a filing-extraction job (130k input, 6k output, a 1:22 input ratio that is heavier on input than this agent loop on a per-call basis), a larger share of the bill is cacheable input, so a given cache hit rate buys a bigger absolute saving. The lesson is not "caching is weak", it is "caching's ROI is a function of how input-dominated the workload is." Map your workload's output ratio first; that ratio, not the cache rate alone, decides whether caching changes the model ranking.
Decision guidance
- Compute your output ratio before you bank on caching. Output-dominated loops (agents, long syntheses) get little from caching; input-dominated jobs (full-context extraction, long-document QA) get a lot.
- Use caching to keep the model you want, not to justify a pricier one. If Opus earns its place on reasoning, caching makes it cheaper to run. It will not make it cheaper than a budget frontier model on an output-heavy loop.
- Find the crossover cache rate, not just the saving. The number that matters is the cache hit rate at which your premium-plus-cache path undercuts your budget alternative. For Sonnet 4.6 here it is about 70%; measure yours.
- Measure real cache hit rate in production. Anthropic returns cache_read_input_tokens and cache_creation_input_tokens per response; aggregate over 24 hours to get the true blended rate, then plug it into the Token Cost Optimizer.
Connects to
- Token Cost Optimizer: the engine behind every cache-regime cost here. Recompute with your own output ratio.
- The LLM-in-Finance Economics Report 2026: the full four-workload report this spoke feeds into.
- Prompt Caching Economics for Finance: the mechanics of Anthropic, OpenAI, and Gemini caching, breakpoints, and failure modes.
- Token Cost Optimizer Cache Amortization: the cache-write-tax break-even on a single model.
References
Footnotes
-
Anthropic. "Pricing." platform.claude.com, verified 2026-05-26. https://platform.claude.com/docs/en/about-claude/pricing ↩
-
Google. "Gemini Developer API pricing." ai.google.dev, verified 2026-05-26. https://ai.google.dev/gemini-api/docs/pricing ↩
Verified engine output
Show the recompute-verified inputs and outputs
| input_tokens_per_call | 40000 |
|---|---|
| output_tokens_per_call | 2000 |
| calls_per_idea | 6 |
| retry_rate | 0.1 |
| ideas_per_day | 20 |
| validation_rate | 0.5 |
| model_id | claude-opus-4-7 |
| cache_hit_rate | 0 |
| model › id | claude-opus-4-7 |
|---|---|
| model › provider | anthropic |
| model › name | Claude Opus 4.7 |
| model › input usd per mtoken | 5 |
| model › output usd per mtoken | 25 |
| model › cache write usd per mtoken | 6.25 |
| model › cache read usd per mtoken | 0.5 |
| model › context window | 1000000 |
| model › notes | Flagship reasoning model — 1M context. |
| effective cost per call | 0.25 |
| cost per idea | 1.6500000000000001 |
| cost per validated trade | 3.3000000000000003 |
| cost per day | 33 |
| cost per month | 990 |
| cost per year | 12045 |
Computed live at build time.
| input_tokens_per_call | 40000 |
|---|---|
| output_tokens_per_call | 2000 |
| calls_per_idea | 6 |
| retry_rate | 0.1 |
| ideas_per_day | 20 |
| validation_rate | 0.5 |
| model_id | claude-opus-4-7 |
| cache_hit_rate | 0.5 |
| model › id | claude-opus-4-7 |
|---|---|
| model › provider | anthropic |
| model › name | Claude Opus 4.7 |
| model › input usd per mtoken | 5 |
| model › output usd per mtoken | 25 |
| model › cache write usd per mtoken | 6.25 |
| model › cache read usd per mtoken | 0.5 |
| model › context window | 1000000 |
| model › notes | Flagship reasoning model — 1M context. |
| effective cost per call | 0.16 |
| cost per idea | 1.056 |
| cost per validated trade | 2.112 |
| cost per day | 21.12 |
| cost per month | 633.6 |
| cost per year | 7708.8 |
Computed live at build time.
| input_tokens_per_call | 40000 |
|---|---|
| output_tokens_per_call | 2000 |
| calls_per_idea | 6 |
| retry_rate | 0.1 |
| ideas_per_day | 20 |
| validation_rate | 0.5 |
| model_id | claude-opus-4-7 |
| cache_hit_rate | 0.9 |
| model › id | claude-opus-4-7 |
|---|---|
| model › provider | anthropic |
| model › name | Claude Opus 4.7 |
| model › input usd per mtoken | 5 |
| model › output usd per mtoken | 25 |
| model › cache write usd per mtoken | 6.25 |
| model › cache read usd per mtoken | 0.5 |
| model › context window | 1000000 |
| model › notes | Flagship reasoning model — 1M context. |
| effective cost per call | 0.088 |
| cost per idea | 0.5808 |
| cost per validated trade | 1.1616 |
| cost per day | 11.616 |
| cost per month | 348.48 |
| cost per year | 4239.84 |
Computed live at build time.
| input_tokens_per_call | 40000 |
|---|---|
| output_tokens_per_call | 2000 |
| calls_per_idea | 6 |
| retry_rate | 0.1 |
| ideas_per_day | 20 |
| validation_rate | 0.5 |
| model_id | claude-sonnet-4-6 |
| cache_hit_rate | 0 |
| model › id | claude-sonnet-4-6 |
|---|---|
| model › provider | anthropic |
| model › name | Claude Sonnet 4.6 |
| model › input usd per mtoken | 3 |
| model › output usd per mtoken | 15 |
| model › cache write usd per mtoken | 3.75 |
| model › cache read usd per mtoken | 0.3 |
| model › context window | 500000 |
| model › notes | Best price/performance for bulk research loops. |
| effective cost per call | 0.15 |
| cost per idea | 0.99 |
| cost per validated trade | 1.98 |
| cost per day | 19.8 |
| cost per month | 594 |
| cost per year | 7227 |
Computed live at build time.
| input_tokens_per_call | 40000 |
|---|---|
| output_tokens_per_call | 2000 |
| calls_per_idea | 6 |
| retry_rate | 0.1 |
| ideas_per_day | 20 |
| validation_rate | 0.5 |
| model_id | claude-sonnet-4-6 |
| cache_hit_rate | 0.7 |
| model › id | claude-sonnet-4-6 |
|---|---|
| model › provider | anthropic |
| model › name | Claude Sonnet 4.6 |
| model › input usd per mtoken | 3 |
| model › output usd per mtoken | 15 |
| model › cache write usd per mtoken | 3.75 |
| model › cache read usd per mtoken | 0.3 |
| model › context window | 500000 |
| model › notes | Best price/performance for bulk research loops. |
| effective cost per call | 0.0744 |
| cost per idea | 0.49104 |
| cost per validated trade | 0.98208 |
| cost per day | 9.8208 |
| cost per month | 294.624 |
| cost per year | 3584.592 |
Computed live at build time.
| input_tokens_per_call | 40000 |
|---|---|
| output_tokens_per_call | 2000 |
| calls_per_idea | 6 |
| retry_rate | 0.1 |
| ideas_per_day | 20 |
| validation_rate | 0.5 |
| model_id | gemini-3-5-flash |
| cache_hit_rate | 0 |
| model › id | gemini-3-5-flash |
|---|---|
| model › provider | |
| model › name | Gemini 3.5 Flash |
| model › input usd per mtoken | 1.5 |
| model › output usd per mtoken | 9 |
| model › context window | 1000000 |
| model › notes | Frontier agent-tier at Flash speed — not a budget model (output ~3.6x Gemini 2.5 Flash). |
| effective cost per call | 0.078 |
| cost per idea | 0.5148 |
| cost per validated trade | 1.0296 |
| cost per day | 10.296000000000001 |
| cost per month | 308.88000000000005 |
| cost per year | 3758.0400000000004 |
Computed live at build time.
Frequently asked questions
- Does prompt caching make Claude Opus 4.7 cheaper than Gemini 3.5 Flash for finance agents?
- No, not on an output-heavy loop. The Token Cost Optimizer takes Opus 4.7 from $990.00/month at 0% cache to $348.48/month at 90% cache, but Gemini 3.5 Flash uncached costs $308.88/month on the same workload. Caching discounts input only; Opus's $25/Mtok output rate is what keeps it above the Gemini line.
- What cache hit rate does Claude Sonnet 4.6 need to beat Gemini 3.5 Flash on cost?
- About 70%. Sonnet 4.6 starts at $594.00/month uncached and falls to $294.62/month at a 70% cache hit rate, just under Gemini 3.5 Flash's $308.88/month. Below roughly 70% cache, Gemini is cheaper; above it, cached Sonnet wins.
- Why does caching help input-heavy workloads more than agent loops?
- Caching discounts repeated input tokens (Anthropic reads at 0.1x base input) and never touches output. An agent loop generates a lot of output per call, so a large part of its bill is uncacheable. A full-context extraction job is input-dominated, so a given cache hit rate buys a bigger absolute saving there.
- How do I measure my real prompt-cache hit rate?
- Anthropic's API returns cache_read_input_tokens and cache_creation_input_tokens on every response. Aggregate them over 24 hours and compute reads divided by total input tokens to get the blended hit rate, then enter that rate in the Token Cost Optimizer to price your actual workload.
- Is prompt caching ever a reason to choose a more expensive model?
- Caching is a reason to run a model you already want more cheaply, not a reason to pick a pricier model over a budget one on cost. On output-heavy loops, even a near-perfect cache hit will not bring a premium model under a budget frontier model, because caching cannot reach the output bill.