TL;DR
Agentic research loops without explicit cost ceilings routinely produce Sonnet bills in four digits per week. The failure mode is mundane: an agent stuck in a clarification loop, or a retrieval step that re-reads the same 10-K eight times searching for a number that was never in the filing. Three gates prevent it. Gate 1 is a hard token budget per research idea, chosen so per-idea cost times ideas-per-day equals the daily budget target. Gate 2 is a step-count cap that terminates pathological loops regardless of per-call cost. Gate 3 is a cost-convergence check that halts the loop when the Nth step fails to move the posterior belief by more than epsilon. Combined, they cap cost-per-idea at a predictable number without limiting quality on ideas the model finishes quickly. The worked implementations below total about 125 lines of Python.
The runaway-cost failure mode
The canonical incident: an agent loop runs overnight, burns through $340 of Sonnet inference, and produces zero trades. Inspection of the trace shows two agents stuck in a tool-use ping-pong, with one requesting clarification while the other returns a tool result that prompts another clarification. The logical state never advances. Each round costs roughly 9K input tokens (context grows with each turn) and 800 output tokens. Thirty-two rounds in, the per-idea cost has blown through the intended envelope by 12x.
A second common shape: retrieval loops that cannot find a specific number because the number does not exist in the document. An agent asked to "find the segment revenue for the datacenter business in the 10-K" will re-read the filing under different chunking strategies, each pass costing 40K-60K input tokens, until step-count exhaustion or a timeout stops it. The filing simply groups revenue differently than the prompt assumed. The loop "works" in the sense that every tool call succeeds and every response is well-formed JSON, but the information is simply not there.
A third shape is subtler: the agent converges on an answer at step 3, then spends steps 4 through 12 producing elaborations that do not change the posterior probability at all. The final trade decision would have been identical with 75% less cost.
All three failure modes share a property: the agent has no mechanism to notice that further computation is unlikely to change the output. Without an external gate, the loop runs until it hits a framework timeout, which is almost always too late from a cost perspective. The cost-reality piece covered the per-call economics; see The Token-Cost Reality of LLM Trading Research for the baseline Sonnet-per-idea math that this article's budgets reference.
Gate 1: Hard token budget
The first defense is a fixed per-idea token ceiling. Derived by simple division: daily inference budget divided by target ideas-per-day gives cost-per-idea; cost-per-idea divided by effective blended price gives the token budget.
A concrete anchor: a solo operator targeting $200/month on Sonnet 4.6 with 50% cache-hit rate and 10 ideas per trading day gets roughly $1.00 per idea, or about 40K total tokens at blended input + output pricing. The budget is a hard stop, not a soft guideline. When the running sum exceeds the ceiling, the wrapper raises and the orchestrator decides: escalate to a human, mark the idea as inconclusive, or fall back to a cheaper model for a final low-cost synthesis.
from dataclasses import dataclass, field
from anthropic import Anthropic
class BudgetExceeded(Exception):
pass
@dataclass
class BudgetedClient:
inner: Anthropic
max_tokens: int
used: int = 0
calls: int = 0
trace: list = field(default_factory=list)
def create(self, **kwargs):
if self.used >= self.max_tokens:
raise BudgetExceeded(
f"used {self.used} >= cap {self.max_tokens}"
)
resp = self.inner.messages.create(**kwargs)
spent = resp.usage.input_tokens + resp.usage.output_tokens
self.used += spent
self.calls += 1
self.trace.append({"call": self.calls, "spent": spent})
return resp
The wrapper checks the pre-call budget, issues the call, updates the running sum, and keeps a minimal trace. Post-call is the correct point to charge tokens because it captures both input and output, but the pre-call check is what prevents a single runaway call from detonating the budget on its own. A call that is itself larger than the remaining budget still runs; tighten that with a pre-call estimate if the prompt template has a predictable ceiling.
Gate 2: Step-count cap
Token budgets catch total spend. They do not catch the specific pathology where a loop runs many cheap calls that each add marginal cost but never converge. A step-count cap is the simplest defense: an absolute limit on the number of tool-use rounds.
For a retrieval-then-synthesis agent, 8 to 12 rounds is usually plenty; loops that exceed that ceiling are almost always pathological rather than deeply thoughtful. Anthropic's tool-use documentation recommends treating step limits as a first-class safety control, not an emergency stop.1
class StepLimitExceeded(Exception):
pass
class StepCap:
def __init__(self, max_steps: int = 10):
self.max_steps = max_steps
self.steps = 0
def tick(self) -> None:
self.steps += 1
if self.steps > self.max_steps:
raise StepLimitExceeded(
f"exceeded {self.max_steps} tool rounds"
)
The cap is deliberately dumb. It does not try to detect loops by content (detecting that is itself a research problem); it just counts. The counter increments on every tool-use round, not on every model call, because a single round can include multiple parallel tool requests. Pair the cap with a log entry when it triggers so the post-mortem has enough signal to tell real pathology from a step limit set too tight.
Gate 3: Cost-convergence check
The interesting gate. The intuition: if the agent's current-best posterior belief at step N is indistinguishable from its belief at step N-1, step N+1 is unlikely to produce a different belief. Halt the loop and return the current state.
Implementing this requires the agent to emit a structured posterior after each step: at minimum a probability and a one-sentence thesis. Each round of tool-use, the model is asked to request the next tool call and, on the same turn, to restate its current estimate of the outcome. After each step, the wrapper compares the new posterior to the previous one and stops the loop when the change is below a threshold.
Two distance metrics are practical. KL divergence is the information-theoretic gold standard for comparing probability distributions.2 For a single binary outcome with probabilities p and q, KL(p || q) reduces to p * log(p/q) + (1-p) * log((1-p)/(1-q)). The simpler alternative is absolute difference |p - q|, which is cheaper to reason about and almost always adequate for a halt condition on a univariate posterior. KL becomes the right choice when the posterior is a full distribution over multiple outcomes, for instance when the agent emits a categorical over five bucketed return ranges.
import math
from dataclasses import dataclass
@dataclass
class Posterior:
prob: float
thesis: str
class ConvergenceGate:
def __init__(
self,
eps: float = 0.01,
method: str = "abs",
consecutive: int = 2,
):
self.eps = eps
self.method = method
self.consecutive = consecutive
self.history: list[Posterior] = []
self.streak = 0
def _distance(self, a: float, b: float) -> float:
if self.method == "abs":
return abs(a - b)
p, q = max(min(a, 1 - 1e-9), 1e-9), max(min(b, 1 - 1e-9), 1e-9)
return (
p * math.log(p / q)
+ (1 - p) * math.log((1 - p) / (1 - q))
)
def update(self, post: Posterior) -> bool:
self.history.append(post)
if len(self.history) < 2:
return False
d = self._distance(post.prob, self.history[-2].prob)
self.streak = self.streak + 1 if d < self.eps else 0
return self.streak >= self.consecutive
Two design choices deserve attention. First, the gate requires consecutive converged steps, not a single one. A single stable step can be a coincidence; two or three in a row are a signal. Second, epsilon is workload-dependent. A research prompt that produces calibrated probabilities between 0.40 and 0.60 has a useful range of 0.20, so epsilon of 0.01 is a 5% relative change, which is tight but plausible. Prompts with wider probability ranges tolerate larger epsilon.
Tune by inspecting logs. If the gate never fires, epsilon is too tight or the prompt is genuinely producing new information each step. If the gate fires before the agent has finished retrieval, epsilon is too loose or the model is anchoring on its initial estimate, which is a separate pathology worth fixing upstream.
Practical combined pattern
All three gates compose into a single orchestrator. The pattern below runs a retrieval-analysis-synthesis loop with retrieval tools stubbed out for brevity; the important shape is the gate ordering and the fallback behavior when each one fires.
def run_research(
client: BudgetedClient,
step_cap: StepCap,
conv: ConvergenceGate,
prompt: str,
tools: list,
) -> dict:
messages = [{"role": "user", "content": prompt}]
posteriors: list[Posterior] = []
termination = "step_cap"
while True:
try:
step_cap.tick()
except StepLimitExceeded:
termination = "step_cap"
break
try:
resp = client.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=messages,
)
except BudgetExceeded:
termination = "token_budget"
break
messages.append({"role": "assistant", "content": resp.content})
post = extract_posterior(resp)
if post is not None:
posteriors.append(post)
if conv.update(post):
termination = "converged"
break
if resp.stop_reason == "end_turn":
termination = "end_turn"
break
tool_results = run_tools(resp.content, tools)
messages.append({"role": "user", "content": tool_results})
final = posteriors[-1] if posteriors else None
return {
"termination": termination,
"posterior": final,
"steps": step_cap.steps,
"tokens": client.used,
"calls": client.calls,
}
The orchestrator logs why it stopped. That field is load-bearing for diagnostics: a healthy agent terminates on converged or end_turn most of the time, with occasional step_cap or token_budget on genuinely hard ideas. An agent terminating 80% on token_budget is telling the operator that the budget is too tight for the prompt, the prompt is too broad for the budget, or the tool surface is returning noise.
extract_posterior is workload-specific: parse the model output for a JSON block, a tagged span, or a tool call arg. The contract with the model is in the system prompt: ask explicitly for a current_posterior field on every turn, and the convergence gate has something to chew on.
| Termination reason | Typical cause | First fix to try |
|---|---|---|
| converged | Agent reached stable belief | Nothing; this is the happy path |
| end_turn | Model signaled finished | Nothing; also happy path |
| step_cap | Loop did not converge in N rounds | Inspect trace; raise cap to 15 if content looks productive |
| token_budget | Context grew past budget | Truncate older tool results or switch to a cheaper model |
| tool_error | Tool raised repeatedly | Fix upstream; this is not a gate failure |
When gates fight each other
The gates interact. A step-count cap that fires before the convergence gate ever does means the cap is too tight for the prompt's natural length; raise it in steps of 2 and watch whether convergence starts firing first. A token-budget gate that fires before step-cap means the budget is too small for the observed token consumption per step, which is usually a cache-miss problem or a context-growth problem. The fix is to truncate stale tool results from the message history or switch the synthesis step to a smaller model for that particular idea.
A convergence gate that fires while the agent has clearly not finished real work is the most diagnostic failure. It usually means one of two things. First, the model is anchoring on its initial prior and producing superficially new text that does not move the probability, which is a prompt problem: ask the model to state new evidence on each turn rather than a restated thesis. Second, epsilon is too loose for the dynamic range of the posterior: tighten from 0.02 to 0.01 or require three consecutive converged steps instead of two.
When multiple gates fire on the same idea, log all three reasons. A single run that trips step-cap at step 10, budget at step 12, and convergence at step 11 is a clear signal that the prompt is underspecified: every gate thinks the loop is misbehaving, which usually means the loop genuinely is. The fix is upstream prompt engineering, not loosening gates.
Budget drift across a week is worth monitoring. A daily budget hit on 5% of ideas is healthy. A daily budget hit on 40% of ideas means the workload has shifted (model version, prompt change, a new tool returning larger payloads) and the envelope needs a recompute, not more headroom. The Agent Cost Envelope Calculator takes observed per-idea distributions and produces budget recommendations at target reject rates.
| Gate interaction | What it means | Action |
|---|---|---|
| Step-cap fires, convergence never does | Cap too tight or prompt underspecified | Raise cap; if still failing, simplify prompt |
| Token-budget fires, convergence never does | Context growth or cache miss | Truncate history; pin stable prefix |
| Convergence fires too early | Model anchoring or epsilon too loose | Tighten epsilon; require 3 consecutive |
| All three fire | Prompt is not actually researchable | Fix upstream; do not loosen gates |
Connects to
- Observability for LLM Trading Agents — gate triggers are only useful if you log and query them.
- Rate-Limit Design for LLM Research — cost gates and rate gates are complementary controls on the same pipeline.
- The Token-Cost Reality of LLM Trading Research — the pricing baseline that per-idea budgets are derived from.
- Building a Production-Grade Claude Agent for Finance — the end-to-end agent architecture into which these gates drop.
- 5 Failure Modes of LLM Trading Agents — runaway loops are one of the five; this article is the detailed fix.
- Agent Cost Envelope Calculator — turns observed per-idea cost distributions into gate parameter recommendations.
- Token Cost Optimizer — pricing table behind the per-idea budget derivation.
References
- Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley. Reference for entropy-based convergence diagnostics on empirical distributions.
- Anthropic (2026). "Prompt caching." https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching (accessed 2026-04-22). Relevant to budget math when the fixed prefix is cached.
- Kullback, S., & Leibler, R. A. (1951). "On Information and Sufficiency." Annals of Mathematical Statistics 22(1), pp. 79-86. The original KL-divergence paper, for practitioners who want the primary source.
Footnotes
-
Anthropic (2026). "Tool use with Claude — safety and step limits." https://docs.anthropic.com/en/docs/build-with-claude/tool-use (accessed 2026-04-22). ↩
-
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley-Interscience. Chapter 2 (entropy, relative entropy) and Chapter 11 (information theory and statistics) define KL divergence and its use as a convergence measure. ↩