How Fallback Chain Simulator works

The Fallback Chain Simulator models the pattern most production LLM agents eventually reach for: a primary provider that handles the happy path, a first fallback that catches rate limits and latency blowups, and an optional second fallback for full-provider outages. It reports how often the whole chain makes it inside your deadline, how much you pay per call on average, and where requests actually end up landing.

What the tool computes

For each Monte Carlo trial (default 1,000), the simulator walks the chain in order. For every leg it rolls two independent outcomes: a 429 / throttle response, and a latency sample drawn from an exponential distribution. A leg "succeeds" when it does not throttle and the cumulative elapsed time stays at or under your deadline. If a leg fails, control passes to the next leg and the already-spent time is carried forward. If no leg succeeds in time, the trial is a failure and contributes the deadline value to the latency distribution.

Outputs: overall success rate, p50 / p95 / p99 latency, aggregate and per-call cost across the trial set, number of trials that degraded to fallback 1 or fallback 2, and a share-of-successes bar chart per provider. A short recommendation flags whether the current primary is the best cost-per-successful-call leg given the inputs — a common configuration error is picking a cheap but unreliable primary whose failures push most traffic (and most cost) onto the fallback.

Inputs and assumptions

Provider + model per leg. Anthropic, OpenAI, and Google model lists are hardcoded in sim-engine.ts with published per-token pricing as of 2026-04.
429 rate (0–30%). Probability of a rate-limit / throttle response on that leg. A 429 costs no tokens but adds ~50 ms of overhead before falling through.
p99 latency (ms). The user-supplied long tail. The simulator derives an exponential mean as max(p50_ref, p99 / ln(100)), so a small p99 is never below the published p50 reference for that model.
Deadline (ms). End-to-end budget for the full chain. Default 1000 ms reflects a typical interactive agent step.
Trials. Monte Carlo sample size. Higher trials shrink estimator variance; the default 1,000 is enough for stable success-rate and p95/p99 readings.
Input / output tokens. Applied identically to every leg that is actually invoked. A 429-throttled leg bills zero tokens.

Formulas

for each trial:
  elapsed = 0; cost = 0
  for leg in [primary, fallback1, fallback2?]:
    if uniform() < leg.rate_429:
      elapsed += 50ms
      continue
    mean      = max(model.p50, leg.p99 / ln(100))
    latency   = -ln(1 - uniform()) * mean   # exponential
    elapsed  += latency
    cost     += leg.input_tokens * price_in
              + leg.output_tokens * price_out
    if elapsed <= deadline:
      return success
  return failure

Recommendation logic

Each leg gets an expected cost-per-successful-call score: cost_per_success = tokens × price / P(success), where P(success) = (1 − rate_429) × P(latency ≤ deadline). The recommendation flags the leg with the lowest score; if it is not the current primary, it suggests swapping. This is a per-leg comparison, not an end-to-end optimization — the Monte Carlo result above still reflects the chain you actually configured.

Limitations

Independent failures. The simulator treats each leg's 429 and latency as independent draws. Real outages are correlated (a Cloudflare or DNS event can take out more than one provider at once). Your measured fallback rate during a genuine incident will be worse than the tool predicts.
Stationary rates. 429 probability is a constant across trials. In production, rate-limit pressure spikes with your own traffic — your effective 429 is load-dependent.
Exponential latency. Good enough for the tail shape, but real API latency distributions are usually bimodal (fast path + retry path). Use observed p99 from your own telemetry, not a vendor brochure number.
No streaming credit. A call that streams partial tokens before the deadline is treated here as a full failure if total latency exceeds the deadline.
Pricing drift. Per-token rates reflect published pricing as of 2026-04. Re-check the vendor page before committing a budget.
Planning tool, not investment advice. This calculates infrastructure economics, not portfolio decisions.

Pricing sources

Rate-limit design for LLM research loops — where the fallback-chain pattern comes from and why naive retries make throttling worse.
Observability for LLM trading agents — which latency and error-rate metrics you actually need to feed into this simulator.
Bounded-cost agentic research loops — the cost side of the chain: how fallback selection changes your $/validated-idea budget.

Changelog

2026-04-23 — Initial release. Three providers, up to two fallbacks, deterministic mulberry32 RNG for reproducible trial runs.