Production LLM Latency Budgets: P50/P95/P99 Math for Trading Apps

TL;DR

Trading apps with LLM-augmented research need explicit latency budgets per call, expressed as P50/P95/P99, not single-point averages. The math is unforgiving. A multi-step agent with three model calls at "1 second average each" routinely produces a P99 end-to-end latency near 18 seconds once tool-call retries and cold-cache writes compound. This article walks through the queue-theory bound that determines tail behaviour, the per-call budget sheet for a typical research-then-execute pipeline, the Anthropic and OpenAI tail-latency numbers as published in their docs through April 2026, and the architecture patterns (parallel speculative decoding, async pre-warm, hard cancellation) that hold P99 inside a usable bound. Run your own pipeline numbers through the Agent Cost Envelope Calculator and stress-test the execution path with the Execution Simulator.

Why averages lie about LLM latency

Most teams quote LLM latency as "about a second" or "around 800ms TTFT." Those numbers describe the median of a long-tailed distribution, and the median is the wrong number for trading. A trade that takes 1 second 50% of the time and 12 seconds 1% of the time is a 12-second trade for the purpose of stop-loss exposure, slippage budgeting, and catastrophic-failure planning.

The Anthropic API status reports posted through 2026 show Claude 4.7 Sonnet TTFT median around 350ms and P99 above 4500ms during steady operation. The OpenAI status page for GPT-5 family models similarly shows P99 latencies four to twelve times the median during normal load, with multi-second spikes during regional traffic shifts. Google's Gemini 2.5 docs publish a P99 latency target around 6× the P50 for the Pro tier.

The tail dominates compounded multi-step pipelines. If each call has independent P99 latency 6× the median, a three-step agent has end-to-end P99 latency closer to 8× the sum of the medians, not 6×, because the bad call lands somewhere in the chain almost certainly when the chain is long enough. Treat the median as the optimistic-case figure for capacity planning. Treat the P95 as the budget. Treat the P99 as the hard timeout.

The queue-theory bound

LLM serving sits on a queue. Each request joins a worker pool with finite capacity, waits for the next free worker, occupies that worker for a generation duration, and leaves. The queue has a known mathematical structure: an M/G/c queue, where service time has a heavy-tailed (general) distribution.

The relevant result for capacity planning is that as utilization rises toward 1, the expected wait time grows as 1/(1 - ρ), where ρ is the worker utilization fraction. At 70% utilization the wait time is roughly 3× the service time variance contribution; at 90% it is 10×; at 95% it is 20×. Provider-side autoscaling smooths this, but autoscaling has its own response time, and during traffic spikes the queue is in the bad regime for minutes before new workers come online.

What this means for the pipeline budget. A model call that takes 800ms inside the worker can show 4–8 seconds wall-clock during the morning US-market open, when LLM API usage spikes. Trading windows are exactly the moments when latency goes worst. The mitigation is not to demand a tighter SLA from the provider (the provider does not offer that for chat completion APIs through 2026); it is to architect around the queue.

A worked latency budget for a research-then-execute pipeline

Consider a typical pipeline. The agent receives a market event, runs three retrieval-and-reasoning steps, produces a structured trade decision, and submits the order. Each step is one LLM call plus optional tool calls.

Stage	Median	P95	P99
Step 1: ingest, classify event	450ms	1.6s	3.8s
Step 1 tool call: market data API	80ms	250ms	800ms
Step 2: retrieval + analysis	700ms	2.8s	5.4s
Step 2 tool call: filing fetch	150ms	450ms	1.4s
Step 3: decision synthesis	550ms	2.1s	4.6s
Order submission (broker REST)	120ms	380ms	1.2s
End-to-end	2.05s	7.6s	17.2s

The P99 is 17 seconds for a pipeline whose median is 2 seconds. That is the gap that bites in production. A retail equity order that is "obvious" at the top of the book is no longer obvious 17 seconds later, and the slippage you eat on the rare slow run can wipe out the edge from the fast ones.

The end-to-end P99 is not the simple sum of per-stage P99s; that overcounts. The right approximation, when stages are serial and roughly independent, is to take the P99 of the bottleneck stage plus the sum of P95s of the others. The arithmetic above gives 17.2 seconds with that rule; running 5,000 Monte Carlo simulations against the per-stage distributions confirms the figure within 5%.

Per-stage budget allocation

Once the end-to-end target is set (say, 8 seconds for a non-time-critical research pipeline; 2 seconds for a fast-execution pipeline), the per-stage budget falls out of the math.

The constraint that matters: the sum of stage medians must be well under the target divided by some safety factor. A factor of 3 is the working number; the empirical relationship between sum-of-medians and end-to-end P95 in published API latency data sits around 3×. So an 8-second P95 target permits a sum-of-medians around 2.7 seconds; a 2-second P95 target permits 700ms, which forces architectural compression (fewer steps, smaller models, or parallelism).

The compression options, ordered by cost:

Drop a step. The cheapest gain. If two of the three reasoning steps can collapse into one prompt with a chain-of-thought scratchpad inside, the latency budget drops by one full stage and the queue-theory tail compresses.

Use a smaller model for early stages. Classification and routing rarely need Opus-tier reasoning. A Haiku-tier or 4o-mini-tier model at 200ms median frees most of the budget for the final synthesis step, which is the call that needs the most capability.

Parallelize independent calls. If steps 2A and 2B do not depend on each other (a filing analysis and a price-history analysis, for example), run them concurrently. The end-to-end median is then max(A, B) instead of A + B, and the P99 improves by roughly 1.7× rather than the naive 2×, because the parallel max is dominated by whichever leg drew the bad sample.

The Agent Cost Envelope Calculator lets you plug per-stage models and observed latency distributions and outputs the end-to-end P50/P95/P99 along with the model-cost envelope at each percentile, so you can see the cost-latency tradeoff explicitly.

Tail-killing patterns

Three architectural patterns reliably keep P99 inside a usable bound.

Hard cancellation with a fallback model. Set a hard timeout per stage at the P95 budget. On timeout, abort the in-flight request and route to a smaller, faster model whose median is well inside the remaining budget. The cost of the wasted call is small relative to the cost of breaching the end-to-end timeout. This is the single highest-impact pattern; it converts the long tail of one model into the median of another.

Speculative parallel execution. When the same query can be answered by two providers with uncorrelated latency tails, fire both and take the first response. The P99 of min(X, Y) where X and Y are independent and have the same distribution is approximately the P95 of X, a 6× to 8× tail compression for a 2× cost. This is worth doing on the critical-path call only, not every call.

Async pre-warm of caches. The provider-side prompt cache penalizes the first call (cache-write premium, 1.25× for Anthropic 5-minute TTL). For predictable workloads (the morning research pass that always starts with the same system prompt, tool block, and watchlist filings), fire a warmup call 30 seconds before the user-triggered call. The user-triggered call lands on a warm cache, and the median latency drops by 100–400ms, which is most of the difference between a 1.5-second median and a sub-1-second one.

The warmup is not free. It pays the cache-write cost without adding signal. The math works when the warmup is amortized over multiple subsequent calls within the TTL; at one warmup per session of 6+ calls, the per-call cost adjustment is in the noise.

What the providers actually publish

A summary of what is documented through April 2026, useful as a baseline before you measure your own production traffic.

Anthropic. The status page at status.anthropic.com publishes rolling 30-day latency percentiles per model. Claude 4.7 Sonnet, normal operation, reports a TTFT median around 350ms and P99 around 4500ms. The streaming-token latency (the time between successive tokens after the first) sits around 25-40ms median. Anthropic does not publish a chat-completion SLA for tail latency.

OpenAI. The status page at status.openai.com posts incident reports with latency context but does not publish standing percentile numbers. Community measurements through Q1 2026 cluster GPT-5 TTFT median around 600ms, P99 around 5-7 seconds. Streaming-token latency around 30ms median. The Realtime API, used for sub-second voice round-trips, has a separate tail profile that is not directly comparable.

Google. Gemini 2.5 Pro docs publish a P99 latency target around 6× the P50, with no specific number guaranteed. The 2.5 Flash tier has materially lower latency variance and is the right default for routing-and-classification stages.

What none of the providers publish, and what you have to measure: the latency distribution under your specific traffic mix, region, and workload shape. The published numbers are useful for sanity-checking the order of magnitude. The numbers that drive the architecture are your numbers, instrumented.

Streaming vs non-streaming, and when to care

Streaming token responses change the latency conversation in two specific cases.

For user-facing surfaces (a research-assistant chat, a research note that the trader reads as it generates), TTFT is the perceptual metric. A 350ms TTFT feels fast even if the full generation takes 8 seconds; the user sees output start, scrolls, reads. Total generation duration matters less than the gap before any output appears.

For agent surfaces (the model output is parsed by code, not a human), streaming buys you nothing on the critical path. The downstream code waits for a complete, valid output before it acts. Streaming adds parsing complexity (you have to maintain a partial-output buffer and decide when to commit) without reducing the wall-clock until the next stage starts. Most production trading agents do not stream; the simpler synchronous code path is the right default.

The third case where streaming matters: structured output validation. Anthropic's tool-use, OpenAI's response_format=json_schema, and Gemini's response_mime_type all enforce schema at generation time, but the validation only completes when generation completes. If the model generates a 500-token JSON output, the schema-valid output appears at the end of generation, not before. Plan the latency budget around the full generation duration, not TTFT.

Operationally, the test that matters is whether the trader's experience improves with streaming or only the apparent responsiveness improves. Many "streaming reduces latency" claims confuse perceived responsiveness with actual time-to-decision. For the latter, the wall-clock to a complete, parseable output is the only number that drives the architecture.

Latency vs cost vs accuracy: the trilemma

Tightening the latency budget always trades against one of cost or accuracy.

A faster model has lower latency but lower benchmark accuracy. A larger context window with more retrieved snippets improves accuracy but pushes generation time. Speculative parallel execution doubles cost. Hard cancellation forfeits some completed calls.

The right framing is to fix the accuracy floor (whatever the strategy needs to be edge-positive after costs) and minimize cost subject to a latency P95 target. That phrasing is testable: you can measure accuracy on a held-out eval set, you can measure cost from the API bills, and you can measure P95 from instrumentation. The temptation is to leave one of the three soft and tune the other two; that is how you arrive at a system that is fast and cheap and quietly wrong.

For trading specifically, the order of priorities I have arrived at after a year of operating retail-scale agents is: accuracy floor first, latency P95 second, cost third. Cost is the easiest to fix later via batch APIs and prompt-cache wiring; latency and accuracy are structural and have to be designed in.

Connects to

Agent Cost Envelope Calculator computes end-to-end P50/P95/P99 and cost given per-stage model and traffic-shape inputs.
Execution Simulator stress-tests the full pipeline against historical price-action windows so you can quantify the slippage cost of a slow tail.
Caching Strategies for LLM Pipelines covers Layers 1-3 of provider-side and application-side caching, the prerequisite for the async pre-warm pattern.
Heartbeats, Watchdogs, Circuit Breakers covers the failure-mode side: what happens when the timeout fires.

References

Anthropic. "Claude API Reference: Latency and Streaming." docs.anthropic.com, accessed April 2026. Documents TTFT and streaming-token latency for the 4.x model family along with the prompt-cache write-vs-read pricing structure.
OpenAI. "API Status and Performance." status.openai.com and platform.openai.com/docs, accessed April 2026. Documents the chat completion latency reporting model and the absence of a published tail SLA.
Google. "Gemini API: Latency and Throughput." ai.google.dev/gemini-api/docs, accessed April 2026. Documents the 2.5 Pro and 2.5 Flash latency targets and the context-cache mechanism.
Kingman, J. F. C. (1961). "The Single Server Queue in Heavy Traffic." Mathematical Proceedings of the Cambridge Philosophical Society, 57(4), 902-904. The canonical M/G/1 heavy-traffic result behind the 1/(1-ρ) wait-time scaling.
Dean, J., & Barroso, L. A. (2013). "The Tail at Scale." Communications of the ACM, 56(2), 74-80. The Google paper that established the architectural patterns (hedged requests, speculative execution, tied requests) for tail-latency control in distributed services. The patterns transfer directly to LLM pipelines.