Bounded-Cost Agentic Research — AI Fin Hub Research

TL;DR

Agentic research loops without explicit cost ceilings routinely produce Sonnet bills in four digits per week. The failure mode is mundane: an agent stuck in a clarification loop, or a retrieval step that re-reads the same 10-K eight times searching for a number that was never in the filing. Three gates prevent it. Gate 1 is a hard token budget per research idea, chosen so per-idea cost times ideas-per-day equals the daily budget target. Gate 2 is a step-count cap that terminates pathological loops regardless of per-call cost. Gate 3 is a cost-convergence check that halts the loop when the Nth step fails to move the posterior belief by more than epsilon. Combined, they cap cost-per-idea at a predictable number without limiting quality on ideas the model finishes quickly. The worked implementations below total about 125 lines of Python.

The runaway-cost failure mode

The canonical incident: an agent loop runs overnight, burns through $340 of Sonnet inference, and produces zero trades. Inspection of the trace shows two agents stuck in a tool-use ping-pong, with one requesting clarification while the other returns a tool result that prompts another clarification. The logical state never advances. Each round costs roughly 9K input tokens (context grows with each turn) and 800 output tokens. Thirty-two rounds in, the per-idea cost has blown through the intended envelope by 12x.

A second common shape: retrieval loops that cannot find a specific number because the number does not exist in the document. An agent asked to "find the segment revenue for the datacenter business in the 10-K" will re-read the filing under different chunking strategies, each pass costing 40K-60K input tokens, until step-count exhaustion or a timeout stops it. The filing simply groups revenue differently than the prompt assumed. The loop "works" in the sense that every tool call succeeds and every response is well-formed JSON, but the information is simply not there.

A third shape is subtler: the agent converges on an answer at step 3, then spends steps 4 through 12 producing elaborations that do not change the posterior probability at all. The final trade decision would have been identical with 75% less cost.

All three failure modes share a property: the agent has no mechanism to notice that further computation is unlikely to change the output. Without an external gate, the loop runs until it hits a framework timeout, which is almost always too late from a cost perspective. The cost-reality piece covered the per-call economics; see The Token-Cost Reality of LLM Trading Research for the baseline Sonnet-per-idea math that this article's budgets reference.

Gate 1: Hard token budget

The first defense is a fixed per-idea token ceiling. Derived by simple division: daily inference budget divided by target ideas-per-day gives cost-per-idea; cost-per-idea divided by effective blended price gives the token budget.

A concrete anchor: a solo operator targeting $200/month on Sonnet 4.6 with 50% cache-hit rate and 10 ideas per trading day gets roughly $1.00 per idea, or about 40K total tokens at blended input + output pricing. The budget is a hard stop, not a soft guideline. When the running sum exceeds the ceiling, the wrapper raises and the orchestrator decides: escalate to a human, mark the idea as inconclusive, or fall back to a cheaper model for a final low-cost synthesis.

from dataclasses import dataclass, field
from anthropic import Anthropic

class BudgetExceeded(Exception):
    pass

@dataclass
class BudgetedClient:
    inner: Anthropic
    max_tokens: int
    used: int = 0
    calls: int = 0
    trace: list = field(default_factory=list)

    def create(self, **kwargs):
        if self.used >= self.max_tokens:
            raise BudgetExceeded(
                f"used {self.used} >= cap {self.max_tokens}"
            )
        resp = self.inner.messages.create(**kwargs)
        spent = resp.usage.input_tokens + resp.usage.output_tokens
        self.used += spent
        self.calls += 1
        self.trace.append({"call": self.calls, "spent": spent})
        return resp

The wrapper checks the pre-call budget, issues the call, updates the running sum, and keeps a minimal trace. Post-call is the correct point to charge tokens because it captures both input and output, but the pre-call check is what prevents a single runaway call from detonating the budget on its own. A call that is itself larger than the remaining budget still runs; tighten that with a pre-call estimate if the prompt template has a predictable ceiling.

Gate 2: Step-count cap

Token budgets catch total spend. They do not catch the specific pathology where a loop runs many cheap calls that each add marginal cost but never converge. A step-count cap is the simplest defense: an absolute limit on the number of tool-use rounds.

For a retrieval-then-synthesis agent, 8 to 12 rounds is usually plenty; loops that exceed that ceiling are almost always pathological rather than deeply thoughtful. Anthropic's tool-use documentation recommends treating step limits as a first-class safety control, not an emergency stop.¹

class StepLimitExceeded(Exception):
    pass

class StepCap:
    def __init__(self, max_steps: int = 10):
        self.max_steps = max_steps
        self.steps = 0

    def tick(self) -> None:
        self.steps += 1
        if self.steps > self.max_steps:
            raise StepLimitExceeded(
                f"exceeded {self.max_steps} tool rounds"
            )

The cap is deliberately dumb. It does not try to detect loops by content (detecting that is itself a research problem); it just counts. The counter increments on every tool-use round, not on every model call, because a single round can include multiple parallel tool requests. Pair the cap with a log entry when it triggers so the post-mortem has enough signal to tell real pathology from a step limit set too tight.

Gate 3: Cost-convergence check

The interesting gate. The intuition: if the agent's current-best posterior belief at step N is indistinguishable from its belief at step N-1, step N+1 is unlikely to produce a different belief. Halt the loop and return the current state.

Implementing this requires the agent to emit a structured posterior after each step: at minimum a probability and a one-sentence thesis. Each round of tool-use, the model is asked to request the next tool call and, on the same turn, to restate its current estimate of the outcome. After each step, the wrapper compares the new posterior to the previous one and stops the loop when the change is below a threshold.

Two distance metrics are practical. KL divergence is the information-theoretic gold standard for comparing probability distributions.² For a single binary outcome with probabilities p and q, KL(p || q) reduces to p * log(p/q) + (1-p) * log((1-p)/(1-q)). The simpler alternative is absolute difference |p - q|, which is cheaper to reason about and almost always adequate for a halt condition on a univariate posterior. KL becomes the right choice when the posterior is a full distribution over multiple outcomes, for instance when the agent emits a categorical over five bucketed return ranges.

import math
from dataclasses import dataclass

@dataclass
class Posterior:
    prob: float
    thesis: str

class ConvergenceGate:
    def __init__(
        self,
        eps: float = 0.01,
        method: str = "abs",
        consecutive: int = 2,
    ):
        self.eps = eps
        self.method = method
        self.consecutive = consecutive
        self.history: list[Posterior] = []
        self.streak = 0

    def _distance(self, a: float, b: float) -> float:
        if self.method == "abs":
            return abs(a - b)
        p, q = max(min(a, 1 - 1e-9), 1e-9), max(min(b, 1 - 1e-9), 1e-9)
        return (
            p * math.log(p / q)
            + (1 - p) * math.log((1 - p) / (1 - q))
        )

    def update(self, post: Posterior) -> bool:
        self.history.append(post)
        if len(self.history) < 2:
            return False
        d = self._distance(post.prob, self.history[-2].prob)
        self.streak = self.streak + 1 if d < self.eps else 0
        return self.streak >= self.consecutive

Two design choices deserve attention. First, the gate requires consecutive converged steps, not a single one. A single stable step can be a coincidence; two or three in a row are a signal. Second, epsilon is workload-dependent. A research prompt that produces calibrated probabilities between 0.40 and 0.60 has a useful range of 0.20, so epsilon of 0.01 is a 5% relative change, which is tight but plausible. Prompts with wider probability ranges tolerate larger epsilon.

Tune by inspecting logs. If the gate never fires, epsilon is too tight or the prompt is genuinely producing new information each step. If the gate fires before the agent has finished retrieval, epsilon is too loose or the model is anchoring on its initial estimate, which is a separate pathology worth fixing upstream.

Practical combined pattern

All three gates compose into a single orchestrator. The pattern below runs a retrieval-analysis-synthesis loop with retrieval tools stubbed out for brevity; the important shape is the gate ordering and the fallback behavior when each one fires.

def run_research(
    client: BudgetedClient,
    step_cap: StepCap,
    conv: ConvergenceGate,
    prompt: str,
    tools: list,
) -> dict:
    messages = [{"role": "user", "content": prompt}]
    posteriors: list[Posterior] = []
    termination = "step_cap"

    while True:
        try:
            step_cap.tick()
        except StepLimitExceeded:
            termination = "step_cap"
            break

        try:
            resp = client.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                tools=tools,
                messages=messages,
            )
        except BudgetExceeded:
            termination = "token_budget"
            break

        messages.append({"role": "assistant", "content": resp.content})

        post = extract_posterior(resp)
        if post is not None:
            posteriors.append(post)
            if conv.update(post):
                termination = "converged"
                break

        if resp.stop_reason == "end_turn":
            termination = "end_turn"
            break

        tool_results = run_tools(resp.content, tools)
        messages.append({"role": "user", "content": tool_results})

    final = posteriors[-1] if posteriors else None
    return {
        "termination": termination,
        "posterior": final,
        "steps": step_cap.steps,
        "tokens": client.used,
        "calls": client.calls,
    }

The orchestrator logs why it stopped. That field is load-bearing for diagnostics: a healthy agent terminates on converged or end_turn most of the time, with occasional step_cap or token_budget on genuinely hard ideas. An agent terminating 80% on token_budget is telling the operator that the budget is too tight for the prompt, the prompt is too broad for the budget, or the tool surface is returning noise.

extract_posterior is workload-specific: parse the model output for a JSON block, a tagged span, or a tool call arg. The contract with the model is in the system prompt: ask explicitly for a current_posterior field on every turn, and the convergence gate has something to chew on.

Termination reason	Typical cause	First fix to try
converged	Agent reached stable belief	Nothing; this is the happy path
end_turn	Model signaled finished	Nothing; also happy path
step_cap	Loop did not converge in N rounds	Inspect trace; raise cap to 15 if content looks productive
token_budget	Context grew past budget	Truncate older tool results or switch to a cheaper model
tool_error	Tool raised repeatedly	Fix upstream; this is not a gate failure

When gates fight each other

The gates interact. A step-count cap that fires before the convergence gate ever does means the cap is too tight for the prompt's natural length; raise it in steps of 2 and watch whether convergence starts firing first. A token-budget gate that fires before step-cap means the budget is too small for the observed token consumption per step, which is usually a cache-miss problem or a context-growth problem. The fix is to truncate stale tool results from the message history or switch the synthesis step to a smaller model for that particular idea.

A convergence gate that fires while the agent has clearly not finished real work is the most diagnostic failure. It usually means one of two things. First, the model is anchoring on its initial prior and producing superficially new text that does not move the probability, which is a prompt problem: ask the model to state new evidence on each turn rather than a restated thesis. Second, epsilon is too loose for the dynamic range of the posterior: tighten from 0.02 to 0.01 or require three consecutive converged steps instead of two.

When multiple gates fire on the same idea, log all three reasons. A single run that trips step-cap at step 10, budget at step 12, and convergence at step 11 is a clear signal that the prompt is underspecified: every gate thinks the loop is misbehaving, which usually means the loop genuinely is. The fix is upstream prompt engineering, not loosening gates.

Budget drift across a week is worth monitoring. A daily budget hit on 5% of ideas is healthy. A daily budget hit on 40% of ideas means the workload has shifted (model version, prompt change, a new tool returning larger payloads) and the envelope needs a recompute, not more headroom. The Agent Cost Envelope Calculator takes observed per-idea distributions and produces budget recommendations at target reject rates.

Gate interaction	What it means	Action
Step-cap fires, convergence never does	Cap too tight or prompt underspecified	Raise cap; if still failing, simplify prompt
Token-budget fires, convergence never does	Context growth or cache miss	Truncate history; pin stable prefix
Convergence fires too early	Model anchoring or epsilon too loose	Tighten epsilon; require 3 consecutive
All three fire	Prompt is not actually researchable	Fix upstream; do not loosen gates

Connects to

Observability for LLM Trading Agents — gate triggers are only useful if you log and query them.
Rate-Limit Design for LLM Research — cost gates and rate gates are complementary controls on the same pipeline.
The Token-Cost Reality of LLM Trading Research — the pricing baseline that per-idea budgets are derived from.
Building a Production-Grade Claude Agent for Finance — the end-to-end agent architecture into which these gates drop.
5 Failure Modes of LLM Trading Agents — runaway loops are one of the five; this article is the detailed fix.
Agent Cost Envelope Calculator — turns observed per-idea cost distributions into gate parameter recommendations.
Token Cost Optimizer — pricing table behind the per-idea budget derivation.

References

Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley. Reference for entropy-based convergence diagnostics on empirical distributions.
Anthropic (2026). "Prompt caching." https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching (accessed 2026-04-22). Relevant to budget math when the fixed prefix is cached.
Kullback, S., & Leibler, R. A. (1951). "On Information and Sufficiency." Annals of Mathematical Statistics 22(1), pp. 79-86. The original KL-divergence paper, for practitioners who want the primary source.

Anthropic (2026). "Tool use with Claude — safety and step limits." https://docs.anthropic.com/en/docs/build-with-claude/tool-use (accessed 2026-04-22). ↩
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley-Interscience. Chapter 2 (entropy, relative entropy) and Chapter 11 (information theory and statistics) define KL divergence and its use as a convergence measure. ↩