Context Hygiene for Multi-Step Research

TL;DR

Multi-step research agents that append every retrieved document to one rolling context pay two costs. Token bills scale linearly with step count, and signal-to-noise degrades as the model attends less to the actual question buried among peer filings, macro notes, and stale tool outputs. The fix is a three-tier layered summary: raw documents live at the leaf and are fetched by retrieval ID, condensed two-page briefs sit at the intermediate layer with citations back to the leaf, and a short two-paragraph working memory carries the current posterior, open questions, and candidate actions. At each step the loop decides what stays raw, what collapses to a brief, and what drops entirely. Fidelity is preserved where it matters; everywhere else, tokens get cut.

Why raw context accumulation fails

A naive research loop fetches a filing, appends it to context, asks a question, appends the tool result, fetches the next filing, appends that, and continues. Two failure modes follow directly.

Cost scales linearly with step count. Ten research steps each appending a 30,000-token 10-K produce a 300,000-token input on the final call. At Anthropic's published 2026-04 Sonnet 4.6 input rate of $3 per million tokens, that single terminal call costs roughly $0.90 in input alone, and every intermediate call pays for its own accumulated prefix. A hundred such loops per day costs real money. The cumulative input across a ten-step loop is not 300,000 tokens but the sum of the series 30K + 60K + 90K + ... + 300K, which is 1.65 million tokens — more than five times the terminal-call figure. Prompt caching flattens the curve for repeated prefixes¹ but does not help when every step genuinely adds new raw material.

Attention dilutes across long context. Needle-in-a-haystack evaluations published by Kamradt² and extended by subsequent long-context benchmarks show retrieval accuracy dropping in the middle third of very long contexts, even on models marketed as million-token-capable. Anthropic's own long-context guidance³ and Google's Gemini 2.5 Pro technical notes⁴ acknowledge the same pattern: a fact placed at position 200,000 of a 400,000-token input is harder to retrieve than the same fact placed at position 5,000. The model still returns an answer. It is just quietly worse. Liu and colleagues catalogued this as the "lost in the middle" effect across multiple models and task types. For multi-step research the implication is direct: the question being answered almost always sits early in the prompt, while the newest tool result sits at the end. Everything accumulated between them becomes attention-starved middle.

Both failure modes share a cause. Raw accumulation treats every retrieved byte as equally worth the model's attention. Real research tasks do not work that way.

The three-tier layered summary

The alternative is a retention hierarchy with explicit policies at each tier.

Leaf tier — raw documents, fetched on demand. Full 10-Ks, full earnings call transcripts, full legal filings. These live in a retrieval store indexed by a stable ID (CIK plus accession number for SEC EDGAR filings, for instance). The research loop never concatenates them into context by default. It requests specific sections via a retrieval tool when the brief at the intermediate tier signals a gap.

Intermediate tier — structured briefs, roughly 2,000 tokens per source document. Each brief is a deterministic summary with a fixed schema: issuer identifier, reporting period, key financial figures as a small table, material risk factors as a bulleted list, management commentary quoted with source offsets, and a citations block mapping every claim back to a leaf-tier offset. Briefs are cheap to regenerate but usually cached; a brief written once feeds many downstream calls.

Working memory tier — two paragraphs, under 4,000 tokens. The current posterior (what the agent currently believes about the research question, with uncertainty), open questions the next step should address, and candidate actions with expected information value. This is the only tier that changes on every step. Everything else is cache-friendly.

The retention policy is explicit. Leaf documents stay in the store forever but enter context only when referenced. Briefs stay in context as long as the working memory mentions their source. Working memory is rewritten at each step from the previous version plus the latest tool result.

Decision rules for what stays in context

Each step the driver applies a fixed policy to every candidate piece of content.

Content type	Keep raw	Summarize	Drop
Current target filing	yes
Adjacent peer filings		yes
Macro backdrop		yes
Previously explored dead ends			yes
Tool results from prior steps	context-dependent	context-dependent	if relevance below 5%
Citation metadata	yes
Model scratch reasoning			yes after one step

"Context-dependent" for prior tool results means the driver scores relevance against the current working memory: tool outputs cited by the current posterior stay raw, outputs cited only by superseded hypotheses collapse to a one-line note, outputs unreferenced for two consecutive steps drop entirely. Citation metadata is cheap and traceability-critical, so it always stays raw.

Dead ends are the highest-value drops. A research agent that explored "does SYNTHETIC_A have exposure to argentine peso devaluation" and concluded no should not carry that 40,000-token exploration into the next step. A one-line note in working memory — "Argentine FX exposure: ruled out, see tool call 7" — is enough to prevent re-exploration while leaving a retrieval path if a later step resurrects the hypothesis. The leaf store still holds the original tool output; only the context bloat is gone.

Tool-output scoring is the hardest rule to get right. A cheap heuristic: tag every tool result with the open question it was run against, then score relevance by whether that question survives into the next working memory. Questions that get answered collapse their supporting tool outputs to a brief note. Questions that stay open keep the raw output for one more step. Questions that get dropped from the working memory entirely take their tool outputs with them.

Runnable implementation

Two functions cover the core hygiene operations. The first collapses a raw document to a brief; the second prunes the working memory between steps.

import json
from dataclasses import dataclass, field
from anthropic import Anthropic

client = Anthropic()

@dataclass
class Brief:
    source_id: str
    period: str
    figures: dict
    risks: list
    quotes: list
    citations: list

@dataclass
class WorkingMemory:
    posterior: str
    open_questions: list
    candidate_actions: list
    brief_refs: list = field(default_factory=list)

BRIEF_SCHEMA = {
    "source_id": "string, e.g. CIK+accession",
    "period": "reporting period, e.g. FY2025",
    "figures": "dict of metric -> value with unit",
    "risks": "list of material risk strings",
    "quotes": "list of {text, offset} from source",
    "citations": "list of {claim, source_offset}",
}

def summarize_to_brief(full_doc: str, source_id: str,
                       target_tokens: int = 2000) -> Brief:
    prompt = (
        "Summarize the document to the schema below. "
        f"Target length: {target_tokens} tokens. "
        "Every numerical claim must include a citation with the byte offset "
        "in the source. Do not omit citations.\n\n"
        f"Schema: {json.dumps(BRIEF_SCHEMA)}\n\n"
        f"Source ID: {source_id}\n\n"
        f"Document:\n{full_doc}"
    )
    resp = client.messages.create(
        model="claude-haiku-4-6",
        max_tokens=target_tokens + 500,
        messages=[{"role": "user", "content": prompt}],
    )
    data = json.loads(resp.content[0].text)
    return Brief(**data)

def prune_working_memory(state: WorkingMemory,
                         latest_tool_output: str,
                         max_tokens: int = 4000) -> WorkingMemory:
    prompt = (
        "Update the working memory with the latest tool output. "
        "Rules: (1) rewrite posterior in under 200 words, "
        "(2) keep at most 5 open questions ranked by information value, "
        "(3) drop any brief_ref not cited by the posterior or open questions, "
        "(4) total output must fit in "
        f"{max_tokens} tokens.\n\n"
        f"Current state: {json.dumps(state.__dict__)}\n\n"
        f"Latest tool output: {latest_tool_output}"
    )
    resp = client.messages.create(
        model="claude-haiku-4-6",
        max_tokens=max_tokens,
        messages=[{"role": "user", "content": prompt}],
    )
    data = json.loads(resp.content[0].text)
    return WorkingMemory(**data)

The driver loop wires both together. On each step it calls the research tool, prunes working memory with the result, and then chooses whether to request a new leaf document or finalize the answer.

def research_loop(question: str, max_steps: int = 8) -> str:
    wm = WorkingMemory(
        posterior=f"Unknown. Question: {question}",
        open_questions=[question],
        candidate_actions=["fetch_primary_filing"],
    )
    briefs: dict = {}

    for step in range(max_steps):
        action = pick_action(wm, briefs)
        if action.kind == "finalize":
            return compose_answer(wm, briefs)
        if action.kind == "fetch_leaf":
            doc = retrieve(action.source_id)
            briefs[action.source_id] = summarize_to_brief(
                doc, action.source_id
            )
            tool_out = f"Brief for {action.source_id} available."
        else:
            tool_out = run_tool(action, briefs)
        wm = prune_working_memory(wm, tool_out)

    return compose_answer(wm, briefs)

pick_action, retrieve, run_tool, and compose_answer are task-specific; what matters is the shape. The loop never passes raw documents through the driver's context. Briefs live in a dict keyed by source ID and are assembled into the final compose call only when needed.

Interaction with prompt caching

Hygiene and caching have aligned incentives. A brief of peer filings that sits in context across thirty research steps is an ideal caching candidate: written once at a 25% write premium, read thirty times at a 90% read discount. The economics of this pattern are covered in Prompt Caching Economics for Finance.

Raw filings, by contrast, are poor caching candidates unless the same filing is queried dozens of times in a five-minute window. A one-off 30,000-token filing read twice during a single research session costs more with caching than without, because the write premium outweighs the single reread discount. The hygiene rule — summarize peer filings to briefs, keep only the current target raw — is exactly the right input to the caching decision. Briefs become cache-resident, targets rotate, and leaves live outside context entirely.

The Financial Document Token Estimator and Token Cost Optimizer make the arithmetic concrete for a given corpus before committing to a hygiene policy. A typical finding: a 40-filing peer set at 25,000 tokens each is a million tokens raw, versus 80,000 tokens summarized to briefs. Cached as briefs at 2026-04 Sonnet 4.6 rates, the cost per read collapses to about $0.024 — well below the uncached raw-peers cost on any realistic step count.

Anti-patterns

Four recurring mistakes show up in agents that skip context hygiene.

Appending the full conversation history on every turn. The transcript grows linearly with step count, and most of it is stale reasoning the model does not need. The working memory tier exists precisely to replace the transcript with a reconstructed state summary.

Dropping source citations during summarization. A brief without citations is a liability: downstream the agent cannot verify claims, and the human reviewer has no trace back to primary documents. Citations are the cheapest possible part of a brief — a byte offset is a few dozen tokens — and must never be stripped. The 8-step LLM research prompt template bakes citation preservation into the schema.

Summarizing before the model has extracted anything useful. If a filing was just fetched and no question has been asked of it, summarizing it immediately throws away detail the next step might need. The correct order is: fetch raw, run the first extraction against the raw document, then collapse to brief. Premature summarization is a form of lossy preprocessing dressed up as hygiene.

Treating every intermediate tool output as required context. A tool that returned "no matching filings" two steps ago does not need to ride along. A one-line note in working memory suffices. Tool-output inflation is the most common single cause of runaway context.

Fidelity-vs-cost tradeoff

Different research shapes call for different hygiene profiles.

Strategy	Fidelity	Token cost per step	When appropriate
Raw accumulation	High	High (linear in step count)	Short loops, single target document, under 5 steps
Eager summarization	Medium	Low (bounded by brief size)	Long loops, many peer documents, extraction-focused
Retrieval on demand	High where retrieved	Lowest (only pay for what loads)	Unlimited document corpus, citation-heavy output
Hybrid three-tier	High at leaf, bounded elsewhere	Medium, flat across steps	Default for production research agents

The hybrid three-tier approach described above is the default for a reason: it gives raw fidelity on the current target, bounded cost on everything else, and an explicit drop policy for stale material. Pure retrieval on demand is cheaper but pushes more work onto the retrieval system and risks missing cross-document connections that benefit from carrying a few briefs in context. Pure eager summarization is cheapest but loses detail the agent later wants.

The right choice depends on loop length, corpus size, and whether the final output must cite primary sources. A three-step loop against a single filing can reasonably skip briefs entirely. A fifteen-step loop against a twenty-peer corpus cannot.

Bounded-Cost Agentic Research treats this same tradeoff from the cost-envelope side: set a hard token budget per loop and let the hygiene policy fit inside it. Inference Cost Attribution per Trade extends the accounting to the individual decision level.

Connects to

Reading Financial Filings with LLMs 2026 — pillar guide that frames filing-based research.
Prompt Caching Economics for Finance — cache-hit arithmetic for briefs vs leaves.
Bounded-Cost Agentic Research — hard token budgets that force hygiene.
Agent Memory Patterns for Finance — long-horizon memory beyond single loops.
Inference Cost Attribution per Trade — attributing token spend to decisions.
Token Cost Reality for LLM Trading Research — empirical cost of naive accumulation.
8-step LLM Research Prompt Template — citation-preserving research schema.
Financial Document Token Estimator — measure raw vs brief token footprint.
Token Cost Optimizer — cost modelling across hygiene policies.
Agent Skill Tester — verify that summarization preserves key claims.

References

Liu, N. F., Lin, K., Hewitt, J., et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics 12, pp. 157–173. Empirical characterization of position-dependent attention degradation.
Hsieh, C.-P., Sun, S., Kriman, S., et al. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" arXiv:2404.06654. Extended long-context benchmark covering variable-difficulty retrieval and aggregation tasks.
Bailey, D. H. & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio." Journal of Portfolio Management 40(5). Cited as an example of the kind of primary-source claim that hygiene must preserve citations back to.

Anthropic (2026). "Prompt caching." Anthropic API documentation. Cache write premium 1.25x base input rate, cache read discount 0.1x base input rate, five-minute default TTL. ↩
Kamradt, G. (2023). "Needle In A Haystack — Pressure Testing LLMs." Public evaluation repository and blog post, documenting retrieval accuracy across context position for long-context models. ↩
Anthropic (2025). "Long context tips and best practices." Anthropic API documentation. Guidance on placement of critical facts within long inputs and recommended retrieval augmentation. ↩
Google DeepMind (2025). "Gemini 2.5 Pro technical report." Long-context evaluation sections on multi-needle retrieval and position-dependent accuracy. ↩