TL;DR

LLM research agents that treat every request as a cold start produce inconsistent, expensive output. The fix is three memory tiers with distinct roles. Working memory holds the current request's state in process and dies when the request ends. Episodic memory is a queryable store of past research outputs, keyed by ticker, task, and date, with time-based and relevance-based eviction. Long-term memory is an append-only lesson library of methods that worked and failed, injected into the system prompt on every run. The common mistake is conflating them: stuffing prior requests into the context window, or letting lessons drift into an unbounded episodic blob. Below: runnable Python for each tier, a SQLite schema for episodic memory, and a 40-line integration that queries all three before a research call and writes back after.

Why memory matters in finance research specifically

Finance research carries unusually strong reference-class signals. A ticker analyzed yesterday does not need a full cold-start re-analysis today if the filings set and macro regime are unchanged. An earnings call studied last quarter frames what questions to ask on this quarter's transcript. A calibration lesson ("probability estimates on macro topics run systematically overconfident, shrink by 15%") should persist across every subsequent loop and get applied automatically.

A loop that forgets these signals pays three costs. First, cost in tokens: re-running a full 8-step research pack1 for a ticker whose thesis hasn't moved burns $0.30-$1.20 per cycle at 2026-04 Sonnet 4.6 rates, for no incremental information. Second, cost in consistency: two cold starts a week apart produce probabilities that differ by 10-20 percentage points not because new evidence arrived but because the prompt hit a different slice of latent space. Third, cost in learning: every miscalibration is re-discovered rather than corrected, because there is no substrate for the correction to land on.

Three-tier memory also shifts how a research loop handles regime changes. When rates flip from cutting to holding, or when earnings season changes the volume of guidance language in the pipeline, the episodic store should register that its older entries are weaker evidence without being deleted. The lesson library should absorb anything the agent learns about behaving differently in the new regime. Neither shift should collapse into the working-memory scratch space, which resets on every call.

The three-tier split is the standard remedy. The production Claude agent for finance scaffold touches this briefly in its decision-log layer; this article unpacks all three tiers with runnable code and schemas. The CoALA architecture paper2 formalizes the split in academic terms; the treatment below is the retail-scale operational version.

Tier 1: Working memory

Working memory is the scratch space of a single research request. Scope: one request lifecycle. Storage: in-process Python objects plus whatever sits in the model's context window for that call. Retrieval: trivial, just pass it as an argument. Eviction: automatic, when the request function returns.

The failure mode is stuffing prior requests into the context window under the heading "memory." Every unrelated prior dossier a model sees in its context adds input tokens, drags attention, and creates false reference classes ("ticker A's thesis also had weak guidance, so ticker B's probably does"). Working memory stays strictly inside the current request.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class WorkingMemory:
    ticker: str
    task_type: str
    filings_loaded: list[str] = field(default_factory=list)
    subproblems_completed: list[str] = field(default_factory=list)
    partial_findings: dict[str, str] = field(default_factory=dict)
    posterior_so_far: Optional[float] = None
    token_budget_remaining: int = 0
    tool_calls_used: int = 0

    def note(self, subproblem: str, finding: str) -> None:
        self.subproblems_completed.append(subproblem)
        self.partial_findings[subproblem] = finding

    def within_budget(self) -> bool:
        return self.token_budget_remaining > 1500 and self.tool_calls_used < 8

That is the full contract: mutable state for this one call, no persistence, no cross-request read. When the request returns, the object is garbage-collected and the only thing that survives is what the request explicitly writes to episodic memory or to the lesson library.

Tier 2: Episodic memory

Episodic memory is the interesting tier and the one most implementations get wrong. Scope: past research outputs, keyed by ticker + task + date. Storage: SQLite or JSONL plus a vector index for semantic query. Retrieval: two-step. First a structured query ("most recent research on SYNTHETIC_A from the past 30 days"), then a vector search for semantically related past analysis across similar tickers or question types. Eviction: time-based, size-based, or relevance-based, chosen deliberately.

Schema

A SQLite schema that has held up across several production retail loops:

CREATE TABLE IF NOT EXISTS episodic_memory (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker          TEXT    NOT NULL,
    task_type       TEXT    NOT NULL,
    created_at      TEXT    NOT NULL,
    question        TEXT    NOT NULL,
    answer          TEXT    NOT NULL,
    probability     REAL,
    confidence      REAL,
    source_citations TEXT,
    regime_tag      TEXT,
    read_count      INTEGER DEFAULT 0,
    last_read_at    TEXT,
    embedding_blob  BLOB
);

CREATE INDEX idx_epi_ticker_date ON episodic_memory (ticker, created_at);
CREATE INDEX idx_epi_task_date   ON episodic_memory (task_type, created_at);

source_citations is a JSON blob of filing URLs, page anchors, transcript timestamps. regime_tag (for example {"rates": "cutting", "vol": "low", "earnings_season": true}) matters because episodic entries recorded under one regime are weaker evidence under another. read_count and last_read_at drive relevance-based eviction: an entry nobody has retrieved in 90 days is probably dead weight.

The embedding is a 1024- or 1536-dimensional float32 vector (8KB per entry); SQLite stores it as a BLOB. For retail scale (10K-50K entries) brute-force cosine similarity across the whole set runs in under 50ms. Hosted vector DBs are unnecessary overhead at this volume; see the storage table below.

Retrieval and the should-rerun check

The canonical call into episodic memory on a new request is: is the prior analysis still valid, or does new evidence force a rerun?

import json, sqlite3
from datetime import datetime, timedelta, timezone

def should_rerun(
    con: sqlite3.Connection,
    ticker: str,
    task_type: str,
    days_threshold: int = 14,
    new_signals: Optional[dict] = None,
) -> tuple[bool, Optional[dict]]:
    """Return (rerun_needed, prior_entry_or_None)."""
    new_signals = new_signals or {}
    cutoff = (datetime.now(timezone.utc) - timedelta(days=days_threshold)).isoformat()
    row = con.execute(
        """SELECT id, created_at, question, answer, probability,
                  confidence, regime_tag
             FROM episodic_memory
            WHERE ticker = ? AND task_type = ? AND created_at >= ?
            ORDER BY created_at DESC LIMIT 1""",
        (ticker, task_type, cutoff),
    ).fetchone()
    if row is None:
        return True, None
    prior = dict(zip(
        ["id", "created_at", "question", "answer", "probability",
         "confidence", "regime_tag"], row))
    prior_regime = json.loads(prior["regime_tag"] or "{}")
    # Material-change heuristics:
    if new_signals.get("new_filing_since", False):
        return True, prior
    if new_signals.get("guidance_update_since", False):
        return True, prior
    if prior_regime.get("rates") != new_signals.get("rates"):
        return True, prior
    if prior["confidence"] is not None and prior["confidence"] < 0.55:
        return True, prior  # low-confidence prior is weak evidence; rerun
    con.execute(
        "UPDATE episodic_memory SET read_count = read_count + 1, "
        "last_read_at = ? WHERE id = ?",
        (datetime.now(timezone.utc).isoformat(), prior["id"]),
    )
    con.commit()
    return False, prior

If should_rerun returns (False, prior), the loop short-circuits: reuse the prior answer, skip the expensive research call, advance to decision-and-sizing. A loop running 200 ticker-task pairs per week with a 40% reuse rate cuts research-layer LLM spend by 40% with no quality loss on unchanged reference classes.

The material-change heuristics above are deliberately conservative. Any new filing, any guidance update, any regime-tag mismatch, and any prior marked low-confidence all force a rerun. False positives (running a rerun when the prior would have held) are cheap. False negatives (reusing a stale prior after material news) are dangerous because they propagate into sizing and execution. The asymmetry makes the default bias toward rerunning the right call.

Vector search for analogous cases

The second retrieval pattern is semantic: "before researching SYNTHETIC_A's guidance tone, retrieve the five most similar past guidance-tone analyses regardless of ticker." That is a cosine-similarity query over the embedding column, filtered to task_type = "guidance_tone", top-5 by similarity. The retrieved snippets enter the research call as few-shot examples of the analytical pattern, not as direct evidence about the current ticker.

Eviction

Three eviction policies, usually applied together:

  • Time-based: delete entries older than 180 days, or archive them to cold storage for post-mortem only.
  • Size-based: keep at most the 20 most-recent entries per (ticker, task_type) pair.
  • Relevance-based: delete entries with read_count = 0 and created_at older than 30 days. Nobody has retrieved them, so they are not informing current research.

Tier 3: Long-term memory (lesson library)

Long-term memory is the smallest tier by volume and the highest per-entry payoff. Scope: cross-request, cross-ticker patterns about what methods worked and what failed. Storage: append-only lessons.jsonl, also rendered into a markdown file for human review. Retrieval: either injected wholesale into the system prompt (feasible up to ~150 lessons) or retrieved contextually by tag. Eviction: append-only by design; human curation on a monthly cadence to retire superseded lessons.

Lessons are not facts about tickers. They are facts about the research process:

  • "Extraction from 10-K footnotes requires a two-pass approach: first pass identifies numeric tables, second pass resolves cross-references. One-pass extraction drops 20-30% of linked notes."
  • "Sentiment analysis on guidance language overcalls 'negative' on technology issuers because cautious language is idiomatic. Apply a +0.08 probability shift toward 'neutral' for technology tickers."
  • "Probability estimates on macro-dependent questions are overconfident by 15% on average. Apply isotonic calibration from the isotonic calibration fit before sizing."
  • "Research packs over 40K input tokens degrade extraction recall by ~12%. Cap research pack size and run multi-call synthesis if needed."
import json, time, uuid
from pathlib import Path

class LessonLibrary:
    def __init__(self, path: str = "memory/lessons.jsonl"):
        self.path = Path(path)
        self.path.parent.mkdir(parents=True, exist_ok=True)
        self.path.touch(exist_ok=True)

    def remember(self, context: str, outcome: str, takeaway: str,
                 tags: list[str]) -> str:
        lesson_id = str(uuid.uuid4())
        row = {
            "id": lesson_id,
            "at": time.time(),
            "context": context,
            "outcome": outcome,
            "takeaway": takeaway,
            "tags": tags,
            "active": True,
        }
        with self.path.open("a") as f:
            f.write(json.dumps(row) + "\n")
        return lesson_id

    def recall(self, tags: list[str], limit: int = 30) -> list[dict]:
        active = []
        with self.path.open() as f:
            for line in f:
                row = json.loads(line)
                if not row.get("active"):
                    continue
                if any(t in row["tags"] for t in tags):
                    active.append(row)
        return active[-limit:]

The active flag enables soft-deletion without losing audit history: a human curator flips active = False on superseded lessons rather than deleting the line. The .recall(tags) output is rendered as a bulleted block and injected into the system prompt under a <learned_patterns> tag.

Three rules keep the library useful. First, lessons are atomic claims, not paragraphs: one takeaway per entry. Second, every lesson cites the evidence that produced it (a decision-log ID, a calibration-set range, a counted failure). Third, the library is curated monthly: conflicts get resolved, duplicates merged, stale lessons deactivated. Without curation, the library grows into a second source of prompt noise.

Putting it together

A single research-loop step integrating all three tiers:

def research_step(
    con: sqlite3.Connection,
    lessons: LessonLibrary,
    client,  # anthropic.Anthropic()
    ticker: str,
    task_type: str,
    new_signals: dict,
    token_budget: int = 8000,
) -> dict:
    # Tier 2: episodic short-circuit
    rerun, prior = should_rerun(con, ticker, task_type,
                                days_threshold=14, new_signals=new_signals)
    if not rerun and prior is not None:
        return {"source": "cached", "prior_id": prior["id"],
                "probability": prior["probability"],
                "answer": prior["answer"]}

    # Tier 3: inject lessons for this task_type
    relevant = lessons.recall(tags=[task_type, "calibration", "extraction"])
    lesson_block = "\n".join(f"- {l['takeaway']}" for l in relevant)

    # Tier 1: working memory for this call
    wm = WorkingMemory(ticker=ticker, task_type=task_type,
                       token_budget_remaining=token_budget)

    system = (f"You are a finance research analyst. "
              f"Apply these learned patterns:\n<learned_patterns>\n"
              f"{lesson_block}\n</learned_patterns>")
    resp = client.messages.create(
        model="claude-sonnet-4-6-20260115",
        max_tokens=1500, temperature=0,
        system=[{"type": "text", "text": system,
                 "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user",
                   "content": json.dumps({"ticker": ticker,
                                          "task": task_type,
                                          "signals": new_signals})}],
    )
    result = json.loads(resp.content[0].text)

    # Write-back: append to episodic memory
    con.execute(
        """INSERT INTO episodic_memory
           (ticker, task_type, created_at, question, answer,
            probability, confidence, regime_tag)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
        (ticker, task_type,
         datetime.now(timezone.utc).isoformat(),
         result["question"], result["answer"],
         result["probability"], result["confidence"],
         json.dumps(new_signals)),
    )
    con.commit()

    # Conditional lesson: only when outcome surprised the prior
    if prior and abs(result["probability"] - (prior["probability"] or 0.5)) > 0.25:
        lessons.remember(
            context=f"{task_type} on {ticker}, regime={new_signals}",
            outcome=f"probability shifted {prior['probability']} -> {result['probability']}",
            takeaway=f"Large shifts on {task_type} often follow {list(new_signals.keys())}",
            tags=[task_type, "regime-shift"],
        )
    return {"source": "fresh", "result": result}

That function is the contract. Before the model is called, episodic memory decides whether a call is needed at all, and the lesson library shapes how the model should think. After the model returns, the episodic store grows by one entry, and the lesson library grows only when the result was surprising enough to be worth remembering as a pattern. Cost is bounded because the cache hit on the system prompt pays for itself from the second call onward, and episodic short-circuits remove entire calls.

Storage choices

Axis SQLite + pgvector-compatible blob Hosted vector DB (Pinecone, Weaviate)
Setup time 15 minutes 1-2 hours including auth, SDK, region
Cost at 10K entries $0 (local disk, ~200MB) $70-$150/month
Latency (cosine top-k=5) 30-80 ms brute force 20-50 ms network-bound
Scale ceiling ~100K entries before brute force degrades Billions
Backup cp *.db Vendor-specific export
Offline / vacation-proof Yes No (network dependency)

For a retail loop under 50K episodic entries, SQLite plus a per-row embedding BLOB handles everything with zero operational overhead. A single-file database, periodic cp to an off-machine backup, and brute-force cosine similarity in 50ms are enough until the loop scales past about 100K entries, by which point the cost profile and retrieval-latency requirements have usually shifted enough to justify moving. For bigger shops, pgvector on a managed Postgres gives a middle path: familiar SQL, embedding operators, horizontal scale, no separate vendor. Hosted vector DBs are the right answer only when the scale or multi-tenant requirement justifies the ongoing cost.

The agent cost envelope calculator models the full-loop economics including the token savings from episodic short-circuits and lesson-library injection via prompt caching. The trading system blueprinter generates the scaffold around these memory tiers.

Connects to

References

  • LlamaIndex documentation. (2026). Memory and persistence modules. Reference for the retrieve-augment pattern used in Tier 2 vector search.
  • LangChain documentation. (2026). Memory abstractions: ConversationBufferMemory, VectorStoreRetrieverMemory. Contrasts with the three-tier split presented here; the library's memory types map roughly onto working and episodic, with no first-class lesson-library abstraction.
  • Pinecone. (2026). Vector database operations guide. Reference for the hosted-DB row of the storage table.
  • Weaviate. (2026). Embedded vs cloud deployment docs. Reference for mid-scale hybrid deployments.
  • pgvector project. (2026). pgvector README and operator documentation. Reference for the Postgres + embedding path.

Footnotes

  1. Anthropic. (2026). Prompt caching and the extended-context API. Published model documentation, 2026-04 rates. Primary source for the Sonnet 4.6 input/output pricing used in the cost claims above.

  2. Sumers, T., Yao, S., Narasimhan, K., & Griffiths, T. (2023). "Cognitive Architectures for Language Agents." arXiv:2309.02427. The CoALA paper formalizes working, episodic, and semantic/procedural memory for language agents; this article is the retail-operational version of that split.