Rate Limit Design for LLM Research Loops

TL;DR

LLM research loops hit rate limits at small scale. Finance workloads are bursty (earnings-day storms, batch triage sweeps, intraday re-scores on news) and the naive fix, retry-with-exponential-backoff, wastes hours and still drops work at the tail. Three primitives handle it end-to-end: a token bucket per provider that tracks requests-per-minute and tokens-per-minute separately, a fallback chain that reroutes to a second or third provider when the primary is saturated, and graceful degradation that shrinks the prompt or downshifts the model when every provider is capped. Roughly one hundred fifty lines of Python implements all three cleanly. The rest of this piece is the code and the decision rules that wrap it.

Why this is a distinct problem

An ingestion loop hitting a market-data vendor and a research loop hitting Anthropic look similar on the outside: both are rate-limited, both need retries, both run under a scheduler. They diverge in three places that matter.

First, the unit of cost is different. Market-data vendors meter requests. LLM providers meter requests and tokens, and the token meter saturates first on finance workloads. A single 10-K prompt can consume fifty thousand input tokens; at Tier 2 Anthropic limits a practitioner hits the TPM ceiling after a handful of calls while the RPM counter is still at single digits.

Second, the failure is deadline-bound. A missed market-data bar can be re-fetched hours later. A missed earnings-call triage at 21:05 UTC is worthless by 21:30. The rate limiter has to cooperate with a deadline, beyond a retry policy alone.

Third, fallback is meaningful. Three providers sell substitutable frontier models. Switching from Anthropic to OpenAI to Gemini is a real option when the primary says 429 and retry-after exceeds the deadline. Market-data vendors are not substitutable the same way.

The pattern below is the market-data ingestion token bucket adapted to these three differences.

The rate-limit landscape in 2026

Published limits change quarterly. The shapes below are stable as of 2026-04; verify against the vendor dashboards before relying on specific numbers.

Anthropic

Anthropic publishes tiered quotas: Free, Tier 1, Tier 2, Tier 3, Tier 4. Each tier publishes three separate counters per model: requests-per-minute, input-tokens-per-minute, and output-tokens-per-minute. Daily token caps exist on lower tiers and lift at Tier 3+. Quota recovers as a rolling one-minute window, not a top-of-minute reset; a burst at 14:00:30 does not free capacity at 14:01:00, it frees incrementally as each event ages past sixty seconds.

A 429 from Anthropic returns a JSON body with type: rate_limit_error and a retry-after header in seconds. The header is advisory and generally accurate to within a second or two.¹

OpenAI

OpenAI also publishes tiered quotas (Free, Tier 1 through Tier 5) with separate RPM and TPM counters per model. Organization-level and project-level quotas stack; the lower of the two binds. The retry-after header is present on 429 responses. Rolling-window recovery matches Anthropic's shape.

One practical difference: OpenAI's TPM counter includes both input and output tokens in a single pool, where Anthropic tracks them separately. A workload that is heavy on output (long structured extractions) saturates OpenAI's TPM faster than equivalent calls to Anthropic.

Google Gemini

Gemini publishes RPM, TPM, and requests-per-day caps. Gemini's API allows burst capacity above the steady-state RPM for short windows, which suits spiky finance workloads but makes capacity planning less predictable. A 429 returns a RESOURCE_EXHAUSTED status; retry-after is not always populated, so the client must fall back to a configured default backoff.

What every provider gets wrong the same way

All three count tokens at their tokenizer, not the user's estimator. A client-side len(text) // 4 approximation underestimates by ten to fifteen percent on English prose and by more on SEC filings, which contain tables, XBRL tags, and code-like structures. The token bucket must budget defensively: reserve a fifteen to twenty percent headroom on TPM.

Primitive 1: Token bucket per provider

A research loop needs two buckets per provider, one for requests and one for tokens, and both must be consulted before every call. The token bucket must also accept a pre-call estimate so it can decline the request before the HTTP call wastes a socket.

import asyncio
import time
from collections import deque
from dataclasses import dataclass, field

@dataclass
class RateLimiter:
    """
    Per-provider rate limiter tracking both RPM and TPM as rolling windows.
    acquire(estimated_tokens) blocks until both counters have headroom.
    """
    rpm_limit: int
    tpm_limit: int
    period: float = 60.0
    _req_events: deque = field(default_factory=deque)
    _tok_events: deque = field(default_factory=deque)
    _lock: asyncio.Lock = field(default_factory=asyncio.Lock)

    async def acquire(self, estimated_tokens: int) -> None:
        async with self._lock:
            while True:
                now = time.monotonic()
                self._evict(now)
                req_used = len(self._req_events)
                tok_used = sum(t for _, t in self._tok_events)

                req_ok = req_used < self.rpm_limit
                tok_ok = tok_used + estimated_tokens <= self.tpm_limit
                if req_ok and tok_ok:
                    self._req_events.append(now)
                    self._tok_events.append((now, estimated_tokens))
                    return

                sleep_for = self._next_free(now, estimated_tokens)
                await asyncio.sleep(max(sleep_for, 0.05))

    def _evict(self, now: float) -> None:
        cutoff = now - self.period
        while self._req_events and self._req_events[0] <= cutoff:
            self._req_events.popleft()
        while self._tok_events and self._tok_events[0][0] <= cutoff:
            self._tok_events.popleft()

    def _next_free(self, now: float, need: int) -> float:
        waits = []
        if len(self._req_events) >= self.rpm_limit:
            waits.append(self._req_events[0] + self.period - now)
        tok_used = sum(t for _, t in self._tok_events)
        if tok_used + need > self.tpm_limit:
            running = tok_used
            for ts, t in self._tok_events:
                running -= t
                if running + need <= self.tpm_limit:
                    waits.append(ts + self.period - now)
                    break
        return min(waits) if waits else 0.1

Two details matter. The _lock serializes the check-then-append so concurrent coroutines cannot both pass the gate on a near-full bucket. The _next_free helper computes the smallest wait that will actually free capacity, avoiding the pathology of sleeping one second at a time against a sixty-second window.

Usage is one line per call:

await limiter.acquire(estimated_tokens=prompt_tokens + 2048)

The 2048 is a budget for the output. Underestimating output tokens is the single most common way to blow through TPM. The limiter looks fine for ninety seconds, then four long responses arrive in one window and the next call hits 429.

Primitive 2: Provider fallback chain

When Anthropic returns 429 and retry-after exceeds the research deadline, the loop needs an alternate path. The fallback chain routes the same logical request through the next provider in the ordered list, with a provider-specific prompt.

The trap: prompts are not portable. Anthropic's system prompt pattern, OpenAI's response_format for structured output, and Gemini's systemInstruction all differ. A fallback that shoves the Anthropic prompt into an OpenAI client breaks in two ways: the model will produce a differently-shaped response, and prompt caching hit rates collapse because the cache key is provider-specific. The fallback must carry an equivalent prompt object per provider.

from dataclasses import dataclass
from typing import Protocol, Any

class Provider(Protocol):
    name: str
    limiter: RateLimiter
    async def call(self, prompt: Any, estimated_tokens: int) -> dict: ...

@dataclass
class ProviderRequest:
    """Equivalent prompt object per provider; filled at build time."""
    anthropic: dict
    openai: dict
    gemini: dict
    estimated_tokens: int

class RateLimitError(Exception):
    def __init__(self, retry_after: float):
        self.retry_after = retry_after

class FallbackChain:
    def __init__(self, providers: list[Provider], deadline_s: float):
        self.providers = providers
        self.deadline = time.monotonic() + deadline_s

    async def dispatch(self, req: ProviderRequest) -> dict:
        last_err = None
        for p in self.providers:
            remaining = self.deadline - time.monotonic()
            if remaining <= 0:
                break
            prompt = getattr(req, p.name)
            try:
                await asyncio.wait_for(
                    p.limiter.acquire(req.estimated_tokens),
                    timeout=remaining,
                )
                return await p.call(prompt, req.estimated_tokens)
            except RateLimitError as e:
                if e.retry_after > remaining:
                    last_err = e
                    continue
                await asyncio.sleep(e.retry_after)
                try:
                    return await p.call(prompt, req.estimated_tokens)
                except Exception as inner:
                    last_err = inner
                    continue
            except (asyncio.TimeoutError, Exception) as e:
                last_err = e
                continue
        raise RuntimeError(f"all providers exhausted: {last_err}")

The chain is deadline-aware. Each provider gets the budget that remains, not a fixed timeout. When retry-after exceeds the remaining deadline, the chain moves on rather than burning the whole window on one provider.

Ordering matters. Put the provider with the highest-quality output first and the provider with the cheapest fallback last; practitioners on a tight budget sometimes invert this and route most traffic through the cheapest provider with a premium fallback on critical-path calls. The Fallback Chain Simulator models the cost and hit-rate tradeoffs for specific workload mixes.

Primitive 3: Graceful degradation

When all providers are saturated or the deadline has shrunk below the p95 latency, the loop has three levers:

Drop optional context. A peer-comparison prompt that normally includes three comparable filings can run on the target filing alone. A news-aware re-score can skip the news bundle.
Downshift the model. Opus to Sonnet saves roughly eighty percent on output tokens; Sonnet to Haiku saves another sixty percent, at the cost of accuracy on harder prompts. Whether the downshift is acceptable depends on the downstream use.
Defer. Skip the call entirely and schedule the work for the next quiet window. Acceptable for batch triage, not for deadline-bound calls like pre-open earnings ranking.

The decision rule:

def choose_degradation(
    deadline_remaining: float,
    p95_latency: float,
    saturation: float,
    task_priority: str,
) -> str:
    """
    saturation: max(rpm_used / rpm_limit, tpm_used / tpm_limit) across providers.
    task_priority: "critical" | "standard" | "batch".
    """
    if deadline_remaining > p95_latency and saturation < 0.8:
        return "full"
    if deadline_remaining > p95_latency * 0.5 and saturation < 0.95:
        return "drop_optional_context"
    if task_priority == "critical":
        return "downshift_model"
    return "defer"

The thresholds are defaults. A practitioner running earnings-day workloads typically tightens them: saturation < 0.7 triggers the first degradation, anything above 0.9 triggers a downshift even for standard-priority work. Tracking p95 latency requires the observability layer described in observability for LLM trading agents.

Putting it together: earnings-day burst

Twenty filings to triage in two hours, 50k input tokens each, a 2k output target per call. At Anthropic Tier 2 published rates as of 2026-04 (roughly 80k TPM on Sonnet, verify current), the raw workload fits only if calls are perfectly paced; a burst exhausts the budget inside fifteen minutes.

import asyncio, time, os
from anthropic import AsyncAnthropic
from openai import AsyncOpenAI

ANTHROPIC_LIMITER = RateLimiter(rpm_limit=50, tpm_limit=80_000)
OPENAI_LIMITER = RateLimiter(rpm_limit=500, tpm_limit=200_000)

anthropic_client = AsyncAnthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
openai_client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

class AnthropicProvider:
    name = "anthropic"
    limiter = ANTHROPIC_LIMITER
    async def call(self, prompt, est):
        try:
            r = await anthropic_client.messages.create(**prompt)
            return {"provider": "anthropic", "text": r.content[0].text}
        except Exception as e:
            retry_after = getattr(e, "retry_after", None)
            if retry_after is not None:
                raise RateLimitError(float(retry_after))
            raise

class OpenAIProvider:
    name = "openai"
    limiter = OPENAI_LIMITER
    async def call(self, prompt, est):
        try:
            r = await openai_client.chat.completions.create(**prompt)
            return {"provider": "openai", "text": r.choices[0].message.content}
        except Exception as e:
            retry_after = getattr(e, "retry_after", None)
            if retry_after is not None:
                raise RateLimitError(float(retry_after))
            raise

def build_request(filing_text: str, include_peers: bool) -> ProviderRequest:
    system = "Extract risk factors as JSON. Output only JSON."
    user = filing_text if not include_peers else filing_text + "\n\nPeers:\n..."
    est = (len(user) // 3) + 2048  # defensive estimate
    return ProviderRequest(
        anthropic={
            "model": "claude-sonnet-4-6",
            "max_tokens": 2048,
            "system": system,
            "messages": [{"role": "user", "content": user}],
        },
        openai={
            "model": "gpt-4.1",
            "max_tokens": 2048,
            "messages": [
                {"role": "system", "content": system},
                {"role": "user", "content": user},
            ],
            "response_format": {"type": "json_object"},
        },
        gemini={},  # omitted for brevity
        estimated_tokens=est,
    )

async def triage(filings: list[str]) -> list[dict]:
    chain = FallbackChain(
        providers=[AnthropicProvider(), OpenAIProvider()],
        deadline_s=2 * 3600,
    )
    p95 = 8.0
    results = []

    async def one(f, idx):
        sat = max(
            len(ANTHROPIC_LIMITER._req_events) / ANTHROPIC_LIMITER.rpm_limit,
            len(OPENAI_LIMITER._req_events) / OPENAI_LIMITER.rpm_limit,
        )
        deadline_rem = chain.deadline - time.monotonic()
        mode = choose_degradation(deadline_rem, p95, sat, "standard")
        if mode == "defer":
            return {"idx": idx, "status": "deferred"}
        req = build_request(f, include_peers=(mode == "full"))
        try:
            r = await chain.dispatch(req)
            return {"idx": idx, "status": "ok", **r}
        except Exception as e:
            return {"idx": idx, "status": "failed", "error": str(e)}

    results = await asyncio.gather(*[one(f, i) for i, f in enumerate(filings)])
    return results

A run against a realistic twenty-filing batch with both providers configured typically completes inside the two-hour window with zero deferrals; the same batch against Anthropic alone with no fallback or degradation will stall at filing twelve or thirteen and recover only after the rolling window clears, adding roughly forty-five minutes to the tail.

Anti-patterns

Anti-pattern	Why it bites	Correction
Exponential backoff without jitter	Many workers recover in the same millisecond, thundering herd the API, all get 429 again	Add full jitter: `sleep = random.uniform(0, 2 ** attempt * base)`
Counting only requests, not tokens	TPM saturates first on finance workloads; RPM counter still says green	Always track both counters; estimate tokens defensively with fifteen percent headroom
Retrying non-retryable errors	401/403/404 and invalid-model errors burn quota on guaranteed failures	Whitelist retryable statuses: 408, 429, 500, 502, 503, 504 only
Fallback to a model with a different output schema	Primary returns JSON with field `risk_factors`, fallback returns `risks` — downstream parser breaks silently	Validate response shape against a shared schema before accepting
Sharing one limiter across providers	One provider's 429 blocks unrelated providers	One limiter instance per provider per counter
Ignoring `retry-after`	Client backoff diverges from server window, amplifies 429 cascade	Honor the header; use it as a lower bound, not an upper bound

Each of these shows up in incident retros with predictable frequency. The 5 Failure Modes of LLM Trading Agents article covers the adjacent operational failures.

Connects to

Observability for LLM Trading Agents — the metrics layer that feeds p95 latency into the degradation decision.
Bounded-Cost Agentic Research — token-budget envelopes that work with the rate limiter, not against it.
Rate-Limited, Resumable Market-Data Ingestion — the market-data sister pattern; same token-bucket shape, different failure surface.
Production Claude Agent for Finance — end-to-end agent loop that calls into this rate-limit layer.
Heartbeats, Watchdogs, and Circuit Breakers for Trading Systems — the supervisor layer that trips when the rate limiter is saturated for too long.
Fallback Chain Simulator — browser tool to model provider routing, cost, and deadline-miss rates under configurable load.
Token Cost Optimizer — quantify the cost delta between Opus, Sonnet, and Haiku for a given workload mix.
Trading System Blueprinter — generate the ingestion plus research skeleton including the rate-limit scaffolding.

References

OpenAI, "Rate limits" (platform.openai.com/docs/guides/rate-limits, retrieved 2026-04). Tier structure, RPM and TPM counters, organization vs project quotas.
Google, "Gemini API rate limits" (ai.google.dev/gemini-api/docs/rate-limits, retrieved 2026-04). Burst capacity and RESOURCE_EXHAUSTED response shape.
Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly. Chapter 21, "Handling Overload," covers token-bucket reasoning and load shedding at scale.
Brooker, M. (2015). "Exponential Backoff And Jitter." AWS Architecture Blog. Canonical treatment of jitter strategies for retry storms.
Nygard, M. T. (2018). Release It! Design and Deploy Production-Ready Software. 2nd ed. Pragmatic Bookshelf. Bulkheads, circuit breakers, and degradation patterns.

Anthropic, "Rate limits" (docs.anthropic.com, retrieved 2026-04). Tier table and retry-after semantics. Verify current values against the vendor dashboard before relying on specific numbers. ↩