aifinhub

How Fallback Chain Simulator works

The Fallback Chain Simulator models the pattern most production LLM agents eventually reach for: a primary provider that handles the happy path, a first fallback that catches rate limits and latency blowups, and an optional second fallback for full-provider outages. It reports how often the whole chain makes it inside your deadline, how much you pay per call on average, and where requests actually end up landing.

What the tool computes

For each Monte Carlo trial (default 1,000), the simulator walks the chain in order. For every leg it rolls two independent outcomes: a 429 / throttle response, and a latency sample drawn from an exponential distribution. A leg "succeeds" when it does not throttle and the cumulative elapsed time stays at or under your deadline. If a leg fails, control passes to the next leg and the already-spent time is carried forward. If no leg succeeds in time, the trial is a failure and contributes the deadline value to the latency distribution.

Outputs: overall success rate, p50 / p95 / p99 latency, aggregate and per-call cost across the trial set, number of trials that degraded to fallback 1 or fallback 2, and a share-of-successes bar chart per provider. A short recommendation flags whether the current primary is the best cost-per-successful-call leg given the inputs — a common configuration error is picking a cheap but unreliable primary whose failures push most traffic (and most cost) onto the fallback.

Inputs and assumptions

Formulas

for each trial:
  elapsed = 0; cost = 0
  for leg in [primary, fallback1, fallback2?]:
    if uniform() < leg.rate_429:
      elapsed += 50ms
      continue
    mean      = max(model.p50, leg.p99 / ln(100))
    latency   = -ln(1 - uniform()) * mean   # exponential
    elapsed  += latency
    cost     += leg.input_tokens * price_in
              + leg.output_tokens * price_out
    if elapsed <= deadline:
      return success
  return failure

Recommendation logic

Each leg gets an expected cost-per-successful-call score: cost_per_success = tokens × price / P(success), where P(success) = (1 − rate_429) × P(latency ≤ deadline). The recommendation flags the leg with the lowest score; if it is not the current primary, it suggests swapping. This is a per-leg comparison, not an end-to-end optimization — the Monte Carlo result above still reflects the chain you actually configured.

Limitations

Pricing sources

Related articles

Changelog

Planning estimates only — not financial, tax, or investment advice.