Why is the 13% break-even threshold above zero net savings?

The cache-write multiplier is 1.25× the input rate, so the first call where the cache is established costs more than a non-cached call. The 13% folk-rule is approximate; true break-even sits at 8–10% depending on the cache-write to input-rate ratio.

Does the engine model OpenAI cache pricing correctly?

Yes. OpenAI cache reads at 0.5× the input rate (vs Anthropic's 0.1×) and writes at 1.0× (vs Anthropic's 1.25×); break-even is lower but the absolute savings ceiling is smaller.

How do I measure cache hit rate in production?

Anthropic's API returns cache_read_input_tokens and cache_creation_input_tokens per response. Aggregate over 24h and compute the read/total ratio.

What's the relationship between retry rate and validation rate?

Independent inputs but correlated in practice: high retry rate suggests prompt instability, which often produces lower validation rate. Optimize jointly.

Should I prefer batch-mode over caching?

Both. They are orthogonal levers; cache amortizes the system-prompt portion of every call, batch-mode is a 50% discount on real-time calls with a 24-hour SLA.

Token Cost Optimizer: Cache Amortization, Three Regimes

A 55% cache hit rate cuts a research loop's monthly bill by 37%, and the break-even is a cache-write-tax calculation most teams never run. For a 12k-input loop with 800-token output, 6 calls per idea, 8% retry, 8 ideas per day, and 50% validation on Claude Sonnet 4.6, the Token Cost Optimizer returns at 55% cache hit rate effectiveCostPerCall $0.0302, cost-per-idea $0.196, cost-per-validated-trade $0.391, and cost-per-month $46.94. At 13% cache hit rate (the engineering folk-rule for "break-even" on cache writes), monthly cost rises to $68.10, 45% more. At 0% cache hit rate, $74.65. The cache-write tax is real: a 55% cache hit rate cuts monthly cost by 37% versus no cache. The break-even threshold below which caching becomes a net cost depends on the cache-write multiplier, which on Anthropic models is 1.25× the input rate.

TL;DR

Same workload, three cache regimes:

Cache hit rate	Cost per call	Cost per idea	Cost per validated trade	Cost per month
0%	$0.0480	$0.311	$0.622	$74.65
13%	$0.0438	$0.284	$0.567	$68.10
55%	$0.0302	$0.196	$0.391	$46.94

A batch-API discount (24-hour SLA) on the same workload drops cost-per-day from $2.40 to $1.20, a flat 50% savings on top of any cache regime. Caching and batching are orthogonal cost levers; both apply.

The cache-write tax

Anthropic's published prompt-caching model has two costs:

Cache write: 1.25× the standard input rate. Charged on the first call where the cached prefix is established.
Cache read: 0.1× the standard input rate. Charged on every subsequent call that hits the cached prefix.

For Claude Sonnet 4.6 (input $3/Mtok): cache write = $3.75/Mtok; cache read = $0.30/Mtok. The break-even for a single cache-write to pay back depends on how many subsequent reads hit it. The arithmetic:

break_even_reads = (cache_write_rate − standard_input_rate) / (standard_input_rate − cache_read_rate)
                 = ($3.75 − $3.00) / ($3.00 − $0.30)
                 = $0.75 / $2.70
                 = 0.278 ≈ ceil to 1

So after one cache-read hit the cache write has paid back. After two reads it is net-cheaper than no cache. After three or more, the savings compound at $2.70 per Mtok of cached input.

The often-cited "13% break-even" is the average-cost number: at 13% blended hit rate across all calls, the total cache-write cost equals the total cache-read savings. Below 13% the workflow is a net cost; above, a net saving. The engine's output at 13% (cost-per-month $68.10) matches the 0% baseline's $74.65 minus an 8.8% discount, not exactly zero net, but within the precision the engine reports.

Why 55% cache hit rate is the right design target

The engine's canonical input uses 55% cache hit rate. At that rate, monthly cost is $46.94, 37% below the no-cache baseline ($74.65). The marginal savings on each additional point of cache hit rate above 55% is positive but decreasing: lifting from 55% to 65% saves another ~10%; lifting from 65% to 75% saves another ~7%; lifting from 75% to 85% saves another ~5%.

The diminishing-returns shape comes from the (small) fraction of input tokens that are not cacheable in principle. In a research loop the system prompt (instructions, schema, examples) caches with near-100% hit rate; the per-idea body (the actual document or query) caches with ~30% hit rate (only if same-day re-queried); the per-call dynamic context caches with ~5% hit rate. A 55% blended hit rate represents most workflows with a stable system prompt and modest body re-use.

The architectural decision is not "what cache hit rate target", it is "what fraction of the input is dedicated to stable prompt sections." A workflow where 70% of input tokens come from a stable system prompt will hit a 55% blended cache rate effortlessly. A workflow where the system prompt is short and 90% of input is per-call dynamic context will struggle to clear 20% even with aggressive caching.

The cost-per-validated-trade frame

The engine's output decomposes cost into four units:

effectiveCostPerCall, the all-in per-API-call cost including cache, output, and retry.
costPerIdea, multiplied by calls_per_idea (6 in the canonical run), adjusted for retry rate (8%).
costPerValidatedTrade, divided by validation_rate (50% in the canonical run) so it represents the cost of producing one trade-ready output (half of ideas are validated, so each validated trade costs twice the per-idea cost).
costPerDay and costPerMonth, multiplied by ideas_per_day (8) and a 30-day month (240 ideas/month).

For the canonical run: $0.0302 per call × 6.48 calls (6 base × 1.08 retry factor) = $0.196 per idea. $0.196 / 0.50 validation rate = $0.391 per validated trade. $0.196 × 8 ideas/day = $1.564/day. $1.564 × 30 days = $46.94/month.

The cost-per-validated-trade is the most actionable unit. If the strategy makes $5 expected profit per validated trade, a $0.391 cost-per-validated-trade is acceptable (7.8% of expected profit). If the cost-per-validated-trade exceeds 50% of expected profit, the workflow is uneconomic regardless of how clean the cache architecture is.

The batch-vs-realtime axis

The engine's companion Batch vs Realtime Cost Calculator runs the same workload (12k input, 800 output, 50 jobs/day, 24h deadline) on Claude Sonnet 4.6 and returns: realtime $2.40/day, batch $1.20/day, savings $1.20/day or $36/month at the realtime rate. The batch eligibility test passes because the 24h deadline meets the published Anthropic batch SLA.

The batching saving is orthogonal to caching: a workflow that runs at 55% cache hit rate AND batch-mode would land at roughly $23.50/month, a 69% reduction from the no-cache real-time baseline. The provider-comparison output in the engine shows Haiku at lower absolute cost and Opus at higher; the cross-vendor comparison is built into the engine's response.

Where the engine is conservative

The engine assumes cache reads cost a fixed 10% of input rate (Anthropic) or 50% (OpenAI). Production cache hit-rate measurement is noisy: a workflow's true hit rate depends on TTL, prompt-shape stability, and concurrent-call patterns. The engine's input is a single blended number; real workflows vary day-to-day.

The retry_rate input (8% canonical) bundles all failure modes, rate-limit retries, tool-error retries, validation-fail retries. Production retry rates can be 3% for stable workflows or 25% for adversarial / high-temperature settings. The engine is linear in retry_rate; double it and the per-idea cost rises ~7% (because retries amortize most fixed-cost components).

The validation_rate input (50% canonical) is the fraction of ideas that pass downstream review. For a well-prompted workflow this is 70–85%; for an exploratory workflow it can be 20–30%. The cost-per-validated-trade is highly sensitive to validation_rate — halving it doubles the cost-per-validated-trade. Track validation_rate as a first-class metric.

Connects to

Token Cost Prompt Cache vs Distill vs RAG — the broader cache-architecture comparison.
Caching Strategies for LLM Pipelines 2026 — system-prompt, body, and dynamic-context cache patterns.
Inference Cost Attribution per Trade — how to amortize per-call cost to per-trade unit economics.
Token Cost Optimizer — engine endpoint.
Batch vs Realtime Cost Calculator — orthogonal batch-mode savings.
Agent Cost Envelope Calculator — the agent-loop cost model the optimizer feeds.

References

Anthropic. "Prompt caching." docs.anthropic.com/en/docs/build-with-claude/prompt-caching, accessed 2026-05-21.
Anthropic. "Batch API." docs.anthropic.com/en/docs/build-with-claude/batch-processing, accessed 2026-05-21.
OpenAI. "Prompt caching." platform.openai.com/docs/guides/prompt-caching, accessed 2026-05-21.
OpenAI. "Batch API." platform.openai.com/docs/guides/batch, accessed 2026-05-21.
Pope, R., Douglas, S., Chowdhery, A., et al. (2023). "Efficiently Scaling Transformer Inference." MLSys 2023. https://arxiv.org/abs/2211.05102. Reference for the technical foundation of prompt-cache reuse.

Verified engine output

Show the recompute-verified inputs and outputs

Inputs
input_tokens_per_call	12000
output_tokens_per_call	800
calls_per_idea	6
retry_rate	0.08
ideas_per_day	8
validation_rate	0.5
cache_hit_rate	0.55

Result
model › id	claude-sonnet-4-6
model › provider	anthropic
model › name	Claude Sonnet 4.6
model › input usd per mtoken	3
model › output usd per mtoken	15
model › cache write usd per mtoken	3.75
model › cache read usd per mtoken	0.3
model › context window	500000
model › notes	Best price/performance for bulk research loops.
effective cost per call	0.03018
cost per idea	0.1955664
cost per validated trade	0.3911328
cost per day	1.5645312
cost per month	46.935936
cost per year	571.053888

Computed live at build time.

Frequently asked questions

Why is the 13% break-even threshold above zero net savings?: The cache-write multiplier is 1.25× the input rate, so the first call where the cache is established costs more than a non-cached call. The 13% folk-rule is approximate; true break-even sits at 8–10% depending on the cache-write to input-rate ratio.
Does the engine model OpenAI cache pricing correctly?: Yes. OpenAI cache reads at 0.5× the input rate (vs Anthropic's 0.1×) and writes at 1.0× (vs Anthropic's 1.25×); break-even is lower but the absolute savings ceiling is smaller.
How do I measure cache hit rate in production?: Anthropic's API returns cache_read_input_tokens and cache_creation_input_tokens per response. Aggregate over 24h and compute the read/total ratio.
What's the relationship between retry rate and validation rate?: Independent inputs but correlated in practice: high retry rate suggests prompt instability, which often produces lower validation rate. Optimize jointly.
Should I prefer batch-mode over caching?: Both. They are orthogonal levers; cache amortizes the system-prompt portion of every call, batch-mode is a 50% discount on real-time calls with a 24-hour SLA.