A 55% cache hit rate cuts a research loop's monthly bill by 37%, and the break-even is a cache-write-tax calculation most teams never run. For a 12k-input loop with 800-token output, 6 calls per idea, 8% retry, 8 ideas per day, and 50% validation on Claude Sonnet 4.6, the Token Cost Optimizer returns at 55% cache hit rate effectiveCostPerCall $0.0302, cost-per-idea $0.196, cost-per-validated-trade $0.391, and cost-per-month $46.94. At 13% cache hit rate (the engineering folk-rule for "break-even" on cache writes), monthly cost rises to $68.10, 45% more. At 0% cache hit rate, $74.65. The cache-write tax is real: a 55% cache hit rate cuts monthly cost by 37% versus no cache. The break-even threshold below which caching becomes a net cost depends on the cache-write multiplier, which on Anthropic models is 1.25× the input rate.
TL;DR
Same workload, three cache regimes:
| Cache hit rate | Cost per call | Cost per idea | Cost per validated trade | Cost per month |
|---|---|---|---|---|
| 0% | $0.0480 | $0.311 | $0.622 | $74.65 |
| 13% | $0.0438 | $0.284 | $0.567 | $68.10 |
| 55% | $0.0302 | $0.196 | $0.391 | $46.94 |
A batch-API discount (24-hour SLA) on the same workload drops cost-per-day from $2.40 to $1.20, a flat 50% savings on top of any cache regime. Caching and batching are orthogonal cost levers; both apply.
The cache-write tax
Anthropic's published prompt-caching model has two costs:
- Cache write: 1.25× the standard input rate. Charged on the first call where the cached prefix is established.
- Cache read: 0.1× the standard input rate. Charged on every subsequent call that hits the cached prefix.
For Claude Sonnet 4.6 (input $3/Mtok): cache write = $3.75/Mtok; cache read = $0.30/Mtok. The break-even for a single cache-write to pay back depends on how many subsequent reads hit it. The arithmetic:
break_even_reads = (cache_write_rate − standard_input_rate) / (standard_input_rate − cache_read_rate)
= ($3.75 − $3.00) / ($3.00 − $0.30)
= $0.75 / $2.70
= 0.278 ≈ ceil to 1
So after one cache-read hit the cache write has paid back. After two reads it is net-cheaper than no cache. After three or more, the savings compound at $2.70 per Mtok of cached input.
The often-cited "13% break-even" is the average-cost number: at 13% blended hit rate across all calls, the total cache-write cost equals the total cache-read savings. Below 13% the workflow is a net cost; above, a net saving. The engine's output at 13% (cost-per-month $68.10) matches the 0% baseline's $74.65 minus an 8.8% discount, not exactly zero net, but within the precision the engine reports.
Why 55% cache hit rate is the right design target
The engine's canonical input uses 55% cache hit rate. At that rate, monthly cost is $46.94, 37% below the no-cache baseline ($74.65). The marginal savings on each additional point of cache hit rate above 55% is positive but decreasing: lifting from 55% to 65% saves another ~10%; lifting from 65% to 75% saves another ~7%; lifting from 75% to 85% saves another ~5%.
The diminishing-returns shape comes from the (small) fraction of input tokens that are not cacheable in principle. In a research loop the system prompt (instructions, schema, examples) caches with near-100% hit rate; the per-idea body (the actual document or query) caches with ~30% hit rate (only if same-day re-queried); the per-call dynamic context caches with ~5% hit rate. A 55% blended hit rate represents most workflows with a stable system prompt and modest body re-use.
The architectural decision is not "what cache hit rate target", it is "what fraction of the input is dedicated to stable prompt sections." A workflow where 70% of input tokens come from a stable system prompt will hit a 55% blended cache rate effortlessly. A workflow where the system prompt is short and 90% of input is per-call dynamic context will struggle to clear 20% even with aggressive caching.
The cost-per-validated-trade frame
The engine's output decomposes cost into four units:
- effectiveCostPerCall, the all-in per-API-call cost including cache, output, and retry.
- costPerIdea, multiplied by calls_per_idea (6 in the canonical run), adjusted for retry rate (8%).
- costPerValidatedTrade, divided by validation_rate (50% in the canonical run) so it represents the cost of producing one trade-ready output (half of ideas are validated, so each validated trade costs twice the per-idea cost).
- costPerDay and costPerMonth, multiplied by ideas_per_day (8) and a 30-day month (240 ideas/month).
For the canonical run: $0.0302 per call × 6.48 calls (6 base × 1.08 retry factor) = $0.196 per idea. $0.196 / 0.50 validation rate = $0.391 per validated trade. $0.196 × 8 ideas/day = $1.564/day. $1.564 × 30 days = $46.94/month.
The cost-per-validated-trade is the most actionable unit. If the strategy makes $5 expected profit per validated trade, a $0.391 cost-per-validated-trade is acceptable (7.8% of expected profit). If the cost-per-validated-trade exceeds 50% of expected profit, the workflow is uneconomic regardless of how clean the cache architecture is.
The batch-vs-realtime axis
The engine's companion Batch vs Realtime Cost Calculator runs the same workload (12k input, 800 output, 50 jobs/day, 24h deadline) on Claude Sonnet 4.6 and returns: realtime $2.40/day, batch $1.20/day, savings $1.20/day or $36/month at the realtime rate. The batch eligibility test passes because the 24h deadline meets the published Anthropic batch SLA.
The batching saving is orthogonal to caching: a workflow that runs at 55% cache hit rate AND batch-mode would land at roughly $23.50/month, a 69% reduction from the no-cache real-time baseline. The provider-comparison output in the engine shows Haiku at lower absolute cost and Opus at higher; the cross-vendor comparison is built into the engine's response.
Where the engine is conservative
The engine assumes cache reads cost a fixed 10% of input rate (Anthropic) or 50% (OpenAI). Production cache hit-rate measurement is noisy: a workflow's true hit rate depends on TTL, prompt-shape stability, and concurrent-call patterns. The engine's input is a single blended number; real workflows vary day-to-day.
The retry_rate input (8% canonical) bundles all failure modes, rate-limit retries, tool-error retries, validation-fail retries. Production retry rates can be 3% for stable workflows or 25% for adversarial / high-temperature settings. The engine is linear in retry_rate; double it and the per-idea cost rises ~7% (because retries amortize most fixed-cost components).
The validation_rate input (50% canonical) is the fraction of ideas that pass downstream review. For a well-prompted workflow this is 70–85%; for an exploratory workflow it can be 20–30%. The cost-per-validated-trade is highly sensitive to validation_rate — halving it doubles the cost-per-validated-trade. Track validation_rate as a first-class metric.
Connects to
- Token Cost Prompt Cache vs Distill vs RAG — the broader cache-architecture comparison.
- Caching Strategies for LLM Pipelines 2026 — system-prompt, body, and dynamic-context cache patterns.
- Inference Cost Attribution per Trade — how to amortize per-call cost to per-trade unit economics.
- Token Cost Optimizer — engine endpoint.
- Batch vs Realtime Cost Calculator — orthogonal batch-mode savings.
- Agent Cost Envelope Calculator — the agent-loop cost model the optimizer feeds.
References
- Anthropic. "Prompt caching." docs.anthropic.com/en/docs/build-with-claude/prompt-caching, accessed 2026-05-21.
- Anthropic. "Batch API." docs.anthropic.com/en/docs/build-with-claude/batch-processing, accessed 2026-05-21.
- OpenAI. "Prompt caching." platform.openai.com/docs/guides/prompt-caching, accessed 2026-05-21.
- OpenAI. "Batch API." platform.openai.com/docs/guides/batch, accessed 2026-05-21.
- Pope, R., Douglas, S., Chowdhery, A., et al. (2023). "Efficiently Scaling Transformer Inference." MLSys 2023. https://arxiv.org/abs/2211.05102. Reference for the technical foundation of prompt-cache reuse.
Verified engine output
Show the recompute-verified inputs and outputs
| input_tokens_per_call | 12000 |
|---|---|
| output_tokens_per_call | 800 |
| calls_per_idea | 6 |
| retry_rate | 0.08 |
| ideas_per_day | 8 |
| validation_rate | 0.5 |
| cache_hit_rate | 0.55 |
| model › id | claude-sonnet-4-6 |
|---|---|
| model › provider | anthropic |
| model › name | Claude Sonnet 4.6 |
| model › input usd per mtoken | 3 |
| model › output usd per mtoken | 15 |
| model › cache write usd per mtoken | 3.75 |
| model › cache read usd per mtoken | 0.3 |
| model › context window | 500000 |
| model › notes | Best price/performance for bulk research loops. |
| effective cost per call | 0.03018 |
| cost per idea | 0.1955664 |
| cost per validated trade | 0.3911328 |
| cost per day | 1.5645312 |
| cost per month | 46.935936 |
| cost per year | 571.053888 |
Computed live at build time.
Frequently asked questions
- Why is the 13% break-even threshold above zero net savings?
- The cache-write multiplier is 1.25× the input rate, so the first call where the cache is established costs more than a non-cached call. The 13% folk-rule is approximate; true break-even sits at 8–10% depending on the cache-write to input-rate ratio.
- Does the engine model OpenAI cache pricing correctly?
- Yes. OpenAI cache reads at 0.5× the input rate (vs Anthropic's 0.1×) and writes at 1.0× (vs Anthropic's 1.25×); break-even is lower but the absolute savings ceiling is smaller.
- How do I measure cache hit rate in production?
- Anthropic's API returns cache_read_input_tokens and cache_creation_input_tokens per response. Aggregate over 24h and compute the read/total ratio.
- What's the relationship between retry rate and validation rate?
- Independent inputs but correlated in practice: high retry rate suggests prompt instability, which often produces lower validation rate. Optimize jointly.
- Should I prefer batch-mode over caching?
- Both. They are orthogonal levers; cache amortizes the system-prompt portion of every call, batch-mode is a 50% discount on real-time calls with a 24-hour SLA.