TL;DR

Prompt caching turns the dominant cost line in finance LLM workloads, namely re-reading the same system prompt or the same filing corpus dozens of times per day, from "always full price" into "1× write, then many cheap reads". Anthropic caches blocks for five minutes on a rolling TTL, with an extended one-hour option on some tiers; OpenAI ships automatic prefix caching with a shorter TTL; Gemini offers explicit context caching with a user-set TTL. On nightly filing sweeps the delta lands in the 50 to 90 percent range on input cost when the pattern is right. Get it wrong, meaning a timestamp in the cached block, the user query before the corpus, or a block under the minimum size, and hit rate collapses to zero. The fix is structural, not a flag.

The cost line caching actually moves

Finance workloads have a specific shape: a large, stable context (a filing, a schema, a tool definitions block, a style guide) gets combined with a small, variable question. Without caching, every call pays full input price on the stable part. A 25K-token 10-K asked four questions in a sweep pays for 100K tokens of input where only 25K is new information. Caching collapses the repeated 75K to a cache-read price that runs roughly one-tenth of the normal input rate on Anthropic and roughly one-quarter on Gemini, with OpenAI giving a smaller but friction-free automatic discount, as of 2026-04 published rates123.

The economics are bluntest at the extremes. A practitioner running one analyst agent against one 10-K, once, will save nothing and may pay more (the write premium has no subsequent reads to amortize). A practitioner running 50 question-answer cycles against a cached earnings-call transcript in a single five-minute window will see the input bill on those reads drop toward 10 percent of the uncached equivalent. Between those endpoints sits a practical rule that governs whether caching is worth the wiring effort: reuse at least three times within one TTL.

The practitioner question is not "should caching be enabled" but "is the prompt laid out so the cache can actually hit?" Most misses are structural: volatile content placed above the stable block, block sizes below provider minimums, or a TTL that expires between the write and the next read. The three providers differ in the exact TTL, the minimum block, and how aggressive the read discount is, but the failure modes are identical across all three.

How caching actually works

Each provider implements caching with the same core idea (reuse the compute of a prefix that has not changed) and different semantics for TTL, minimum block size, and how hits are declared. The numbers below reflect 2026-04 published rates. Providers update pricing, and the exact multipliers below should be verified against the linked docs before any budget decision.

Anthropic prompt caching

Anthropic's API exposes caching via explicit cache_control breakpoints attached to message blocks1. Up to four breakpoints per request are allowed, which permits tiered reuse (system block cached for one hour, tool block cached for five minutes, per-session memory cached for five minutes).

  • TTL: default five minutes, rolling, so every read resets the clock. An extended one-hour TTL is available on eligible tiers at a higher cache-write multiplier.
  • Pricing: cache-write runs at roughly 1.25× the standard input rate for the five-minute TTL and roughly 2× for the one-hour TTL. Cache-read runs at roughly 0.1× the standard input rate. On Sonnet 4.6 at $3/MT input, cache-read lands near $0.30/MT, which is the same order of magnitude as the cheapest frontier model's full input price.
  • Hit semantics: exact prefix match. Everything from message[0] up to the breakpoint must byte-match a prior request's prefix. One character different, whether a timestamp, a UUID, or a reordered tool, and the cache misses silently.
  • Minimum block size: 1024 tokens for the larger Claude models (Sonnet, Opus), 2048 tokens for Haiku as of 2026-04. Below the minimum, cache_control is silently ignored.

OpenAI prompt caching

OpenAI ships caching automatically on inputs above a threshold (currently 1024 tokens) on supported models2. There is no cache_control parameter; the API inspects the prefix and returns cache-hit metrics in the response.

  • TTL: shorter than Anthropic's, typically 5 to 10 minutes of inactivity, with the exact number undocumented and subject to change. During peak load the effective TTL can drop further.
  • Pricing: cache-read runs at roughly 0.5× the standard input rate on most current models, with no separate cache-write charge. Less aggressive than Anthropic's 0.1× read discount, but zero friction to enable.
  • Hit semantics: prefix match on the request. The response includes cached_tokens in the usage object, so hit rate is observable per call.
  • Minimum block size: ~1024 tokens. Below the threshold, no caching is attempted.

Gemini context caching

Gemini exposes caching as a first-class resource: the caller creates a CachedContent object with an explicit TTL, then references it by handle on subsequent calls3.

  • TTL: user-specified. Default one hour, extendable. Storage is billed per minute of retention.
  • Pricing: cache-read runs at roughly 0.25× the standard input rate on Gemini 2.5 Pro, plus a per-hour storage charge per million cached tokens.
  • Hit semantics: handle-based, not prefix-based. A cache reference is explicit; there is no silent-miss-on-string-mismatch failure mode.
  • Minimum block size: ~4096 tokens on Gemini 2.5 Pro as of 2026-04. The handle-based model makes the minimum the most consequential constraint, because a small block cannot be cached at all.

The three designs trade off differently. Anthropic's cache_control gives the tightest control and the largest read discount but the strictest prefix-match requirement. OpenAI's automatic caching is friction-free but the discount is smaller and TTL shorter. Gemini's handle-based caching eliminates the prefix-match trap but adds a storage line to the bill and a high minimum block size.

Provider TTL Cache-write Cache-read Min block Hit model
Anthropic 5 min rolling / 1 hr opt ~1.25× / ~2× input ~0.1× input 1024 tok Explicit breakpoints, prefix match
OpenAI ~5–10 min, undocumented none ~0.5× input ~1024 tok Automatic, prefix match
Gemini User-set + storage fee standard input ~0.25× input ~4096 tok Handle-based

The four design patterns that maximize hit rate

Four layout patterns cover nearly every finance workload. They are structural rules, not performance hints. A call that violates them will not hit the cache regardless of which provider is used.

a. Stable system prompt first, volatile user query last

The single most important rule. The request must be ordered [stable-block, variable-block] and nothing else. Any volatile token above the stable block invalidates the prefix. A practitioner who puts "Current UTC time: 2026-04-23 14:32:07" into the system prompt for logging convenience has just disabled caching for every downstream call. The observed bill looks normal, the hit-rate field in the response reads zero, and the practitioner debugs their code for an hour before noticing the offending line. Log the timestamp in metadata, never in the prompt itself.

b. Tool definitions in the cached block

Tool schemas for agents are large (a single schema for a SEC filing extractor can run 1200 tokens) and stable across a session. They belong inside the cache breakpoint. On Anthropic specifically, tools appear before messages in the canonical prefix, so a cache_control on the last tool or on the system message captures them automatically. Reordering or editing a tool schema mid-session invalidates the cache; lock the tool registry before opening a cached loop, and treat schema edits as deploys rather than runtime adjustments.

c. Filing corpus cached, per-query retrieval snippet in tail

For a 10-K or 10-Q extraction loop, the filing itself is the stable block. The per-question text (the retrieval snippet pulled from a vector index, the instruction "extract revenue by segment", the output schema) goes in the tail. A sweep running four questions against one filing pays 1× cache-write on the filing plus 3× cache-read plus 4× the small tail. The Financial Document Token Estimator computes the breakeven point for a given filing size and question count.

d. Breakpoints pattern on Anthropic

Up to four cache_control breakpoints per request permit tiered reuse. A typical finance agent uses:

  1. Breakpoint after the system prompt (stable across all sessions).
  2. Breakpoint after the tool definitions (stable across a deploy).
  3. Breakpoint after the per-filing corpus (stable across a sweep over one filing).
  4. Tail: the variable user question, uncached.

Each breakpoint is an independent TTL. A one-hour TTL on breakpoint 1 and five-minute TTLs on breakpoints 2 and 3 minimize cache-write costs while keeping hit rate near 100% for an active loop.

Worked example: a 500-filing nightly sweep

The shape: one practitioner runs a nightly sweep asking four questions of 500 10-Ks on Claude Sonnet 4.6. Each filing is 25K input tokens, each question adds 500 tokens of tail, each answer returns 400 output tokens. The sweep runs 22 business days per month.

Uncached baseline: every one of the 2000 calls pays full input on 25.5K tokens.

Cached version: for each filing, one cache-write on 25K tokens plus three cache-reads on the same 25K tokens, plus four uncached tails of 500 tokens. The tails are below the 1024-token minimum and not cached, which is fine because they are meant to be variable anyway.

The Python below computes the monthly total from published rates and returns the delta.

# Sonnet 4.6 rates, USD per 1M tokens, 2026-04 published
INPUT_RATE = 3.00
OUTPUT_RATE = 15.00
CACHE_WRITE_RATE = 3.75   # ~1.25x input for 5-min TTL
CACHE_READ_RATE = 0.30    # ~0.1x input

FILINGS = 500
QUESTIONS_PER_FILING = 4
FILING_TOKENS = 25_000
TAIL_TOKENS = 500
OUTPUT_TOKENS = 400
BUSINESS_DAYS = 22

def cost_per_million(tokens, rate):
    return tokens * rate / 1_000_000

def uncached_nightly():
    calls = FILINGS * QUESTIONS_PER_FILING
    input_cost = cost_per_million(
        calls * (FILING_TOKENS + TAIL_TOKENS), INPUT_RATE
    )
    output_cost = cost_per_million(calls * OUTPUT_TOKENS, OUTPUT_RATE)
    return input_cost + output_cost

def cached_nightly():
    # Per filing: 1 cache-write on filing, 3 cache-reads on filing,
    # 4 uncached tails, 4 outputs.
    write = cost_per_million(FILING_TOKENS, CACHE_WRITE_RATE)
    read = cost_per_million(3 * FILING_TOKENS, CACHE_READ_RATE)
    tails = cost_per_million(
        QUESTIONS_PER_FILING * TAIL_TOKENS, INPUT_RATE
    )
    outs = cost_per_million(
        QUESTIONS_PER_FILING * OUTPUT_TOKENS, OUTPUT_RATE
    )
    per_filing = write + read + tails + outs
    return per_filing * FILINGS

uncached_night = uncached_nightly()
cached_night = cached_nightly()
saving = 1 - cached_night / uncached_night

print(f"uncached  : [USD] {uncached_night:,.2f} per night")
print(f"cached    : [USD] {cached_night:,.2f} per night")
print(f"saving    : {saving:.1%}")
print(f"uncached m: [USD] {uncached_night * BUSINESS_DAYS:,.2f}")
print(f"cached m  : [USD] {cached_night * BUSINESS_DAYS:,.2f}")

Running the numbers: uncached lands near $165 per nightly sweep and $3,636 per month. Cached lands near $42 per sweep and $932 per month. The saving is roughly 74 percent on the total bill and roughly 88 percent on the input-only line (outputs are unaffected by caching). A larger filing corpus or a higher questions-per-filing ratio pushes the saving toward 90 percent; a small corpus with one question per filing pushes it toward zero (a single cache-read never amortizes the write premium).

Two sensitivities are worth calling out explicitly. First, the output cost is untouched by caching. If the workload is output-heavy (long generations, extensive chain-of-thought, large structured extraction schemas returned verbatim), the ceiling on savings drops. A sweep that returns 4000-token answers instead of 400-token answers has roughly the same absolute cache saving in dollars but a much smaller relative reduction on the total bill. Second, the breakeven point on questions-per-filing depends on the write premium. At Anthropic's 1.25× five-minute premium, two questions per filing already beats uncached; at the 2× one-hour premium, the breakeven moves to three. Practitioners running very short per-filing loops should stay on the five-minute tier and use keep-alive reads rather than paying the one-hour write multiplier.

The Batch vs Real-Time Cost Calculator and the Token Cost Optimizer compose with this math: a cached block run through the batch API stacks both discounts on the input line.

Failure modes

Four traps account for nearly every reported zero-hit-rate bug. Each is structural and each has a specific fix.

TTL expiry mid-loop

A sweep that pauses 6 minutes between filings on Anthropic's 5-minute TTL will re-write the cache every filing. Fix: insert a no-op keep-alive read every 3–4 minutes (a short request that references the breakpoint), or switch to the 1-hour TTL tier and accept the higher write premium. On Gemini, set an explicit TTL long enough to cover the full loop.

Accidentally volatile system prompt

A timestamp, a request ID, a session UUID, a rotating A/B flag: any of these at the top of the prompt invalidates the entire prefix for every downstream request. The fix is a discipline rule: no volatile content in any block above a cache_control marker. Variable metadata goes in the tail or in request headers, never in the prompt body. This is the single most common failure mode in practice and is silent, because the API does not warn. The only observable signal is that the cached_input_tokens field in the usage object reads zero call after call while the billing dashboard keeps charging full input rate.

Blocks below the minimum

A 600-token system prompt with cache_control is a no-op. The API ignores the marker and bills standard input. The fix is to combine small blocks (merge system prompt and tool descriptions under one breakpoint) or acknowledge that a small block is not cacheable and remove the marker. The Financial Document Token Estimator reports token counts for prospective cache blocks so the minimum check is explicit before a deploy.

User content inside the cached block

A practitioner who interpolates the user's question into the system template (f"System: ... Question: {query} ...") and then marks the template as cached has created a block that is volatile by construction. Every distinct query is a distinct prefix and the cache hit rate is zero by definition. Fix: the cached block is a static string; the user question is a separate message appended after the breakpoint. Code review for cached endpoints should check that the string containing cache_control is a module-level constant, never an f-string or template.format() call.

When caching does not help

Caching is not free. The write premium plus the storage or TTL management overhead creates a breakeven point below which standard input is cheaper.

  • One-off analyses: a single call against a novel corpus pays the write premium with no subsequent reads to amortize it.
  • Exploratory work with frequent prompt edits: every edit invalidates the prefix; the cache is continuously rewritten and never hit.
  • Small corpora below the minimum block: Gemini's 4096-token floor in particular rules out a lot of short-document work.
  • High variance in the system prompt: A/B-testing prompts, per-user prompt customization, or rotating style instructions all defeat caching. If the prompt changes more often than it is reused, the math does not work.

A practical rule: caching pays when the reuse count is at least three within one TTL. Below that, standard input is cheaper. Above that, the saving scales roughly linearly with reuse count up to the point where output cost dominates. The Token Cost Optimizer takes reuse count as an input and shows the breakeven line.

One more consideration belongs on this list: prompt-level A/B testing for output quality. A setup that runs two prompt variants for every question to measure which produces better extractions will never hit the cache on either variant unless each variant is held stable across enough calls to amortize its own write. The cleaner pattern is to fix the prompt for a day or a week at a time and run a holdout on the winning variant, rather than interleaving variants at the call level. Caching and online prompt A/B testing are fundamentally at odds; pick one.

Connects to

References

Footnotes

  1. Anthropic (2026). "Prompt caching." Anthropic API documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching (accessed 2026-04-23). Rates and TTLs cited as of this access date; verify against current docs before budget decisions. 2

  2. OpenAI (2026). "Prompt caching." OpenAI API documentation. https://platform.openai.com/docs/guides/prompt-caching (accessed 2026-04-23). 2

  3. Google (2026). "Context caching." Gemini API documentation. https://ai.google.dev/gemini-api/docs/caching (accessed 2026-04-23). 2