Observability Patterns for LLM Trading Agents

TL;DR

An LLM trading agent that silently makes bad decisions is worse than one that crashes. Three patterns prevent silent failure. Trace-ID propagation binds research, decision, and execution into one lifecycle so any post-mortem can follow a single idea end-to-end. A structured log schema with per-step costs, model versions, prompt hashes, and confidences turns fuzzy logs into queryable evidence. A deterministic replay harness re-runs archived inputs against the current model to detect prompt drift, model drift, and non-determinism. Together these produce cost-per-idea and cost-per-validated-trade SQL that drives continuous improvement rather than blind retries. The patterns cost under 200 lines of Python plus a single SQLite file.

Why LLM agents break observability norms

Traditional microservice tracing assumes deterministic code paths. A request with the same inputs returns the same outputs; a stack trace pins down the failure. LLM agents break every assumption. Outputs are stochastic, model versions ship silently on the provider side, and a one-line prompt edit can shift a confidence score from 0.62 to 0.78 without any downstream signal.

A log entry saying {"event": "research_complete", "ticker": "SYNTHETIC_A", "action": "skip"} is worse than useless. A week later the operator cannot tell which prompt generated the skip, which model version ran, how many tokens it consumed, how long it took, or what the model actually returned. The five failure modes of LLM trading agents all start with logs that cannot reconstruct the moment of failure.

Three concrete examples where traditional logs fail. Case one: a calibration job shows that confidence-bucket 0.8–0.9 has dropped from 68% win-rate to 54% over the past month. Without prompt_hash and model_version on each row, the operator cannot tell whether the prompt template changed, the model alias rolled forward to a new version, or the market regime shifted. Case two: a single idea costs $4.20 in research tokens, 40x the median. Without per-step cost and tool-call boundaries, the operator cannot tell whether one tool call returned 100k tokens of junk or the agent entered a retry loop. Case three: the agent decided skip on a trade that, three days later, would have paid 8%. Without the full structured output, the operator cannot reconstruct the reasoning to decide whether the skip was justified ex ante or a prompt bug.

The minimum viable observability for an LLM agent includes: a trace-ID that crosses process boundaries, a schema with prompt hash and model version on every step, and a replay mechanism that can re-execute the step against archived inputs. Everything below builds these three in order.

Pattern 1: Trace-ID propagation

Assign one trace_id per idea. Every downstream event carries it: the research call, the decision, the order placement, the fill confirmation, the realized PnL. The trace_id is the spine; spans are vertebrae.

Three fields per log line carry the structure:

Field	Purpose
`trace_id`	Ties every event from research to outcome into one idea. UUIDv4.
`span_id`	Identifies a single step within the trace (one research call, one order).
`parent_span_id`	Parent span, so causality is preserved across async calls.

The trace must cross process boundaries. A launchd-scheduled research job writes trace_id into its decision record; a separate execution job reads that decision and inherits the trace_id on every order-placement log line. HTTP calls to the broker propagate the trace_id in a header (x-trace-id) so the broker's own logs can be correlated later if the vendor supports it.

Python's contextvars module makes this clean inside one process. The context propagates automatically through asyncio tasks, so every await inside a request handler sees the same trace_id without passing it explicitly:

import contextvars
import uuid
import time
import json
from contextlib import contextmanager

_trace_id: contextvars.ContextVar[str] = contextvars.ContextVar("trace_id", default="")
_span_id: contextvars.ContextVar[str] = contextvars.ContextVar("span_id", default="")

def new_trace() -> str:
    tid = uuid.uuid4().hex
    _trace_id.set(tid)
    _span_id.set("")
    return tid

@contextmanager
def span(name: str):
    parent = _span_id.get()
    sid = uuid.uuid4().hex[:12]
    _span_id.set(sid)
    t0 = time.time()
    try:
        yield sid
    finally:
        elapsed_ms = int((time.time() - t0) * 1000)
        log_event(name, {"parent_span_id": parent, "elapsed_ms": elapsed_ms})
        _span_id.set(parent)

def log_event(event: str, fields: dict):
    rec = {
        "trace_id": _trace_id.get(),
        "span_id": _span_id.get(),
        "event": event,
        "ts": time.time(),
        **fields,
    }
    print(json.dumps(rec))

Across process boundaries, the trace_id is serialized into the row that the next job reads. If the research job writes into a decisions table with trace_id as a column, the execution job SELECTs it and calls _trace_id.set(row.trace_id) before placing any orders. The same trace_id flows from prompt input to realized PnL.

One subtlety matters for launchd-scheduled agents. Jobs restart between cycles, so contextvars state does not persist. The trace_id must be re-set at the top of each job cycle from whatever row the job is currently processing. Treating the trace_id as a first-class database column (not an ambient log detail) is what makes the pattern survive process death and restarts. That discipline mirrors what heartbeats and circuit breakers impose on operational state generally.

Pattern 2: Structured log schema

Unstructured logs cannot be queried. The schema below is the minimum for SQL that answers real questions: cost per idea, win-rate by confidence bucket, latency distribution by model version. SQLite is sufficient for a single-operator agent running a few hundred ideas per day; Postgres or a column store is only needed when logs exceed a million rows per week.

Four tables. Every table has a trace_id column, and foreign keys between them capture the research-to-outcome lifecycle.

CREATE TABLE research_step (
    trace_id        TEXT NOT NULL,
    span_id         TEXT NOT NULL,
    model           TEXT NOT NULL,
    model_version   TEXT NOT NULL,
    prompt_hash     TEXT NOT NULL,
    tokens_in       INTEGER NOT NULL,
    tokens_out      INTEGER NOT NULL,
    cost_usd        REAL NOT NULL,
    latency_ms      INTEGER NOT NULL,
    confidence      REAL,
    output_json     TEXT NOT NULL,
    ts              REAL NOT NULL,
    PRIMARY KEY (trace_id, span_id)
);

CREATE TABLE decision (
    trace_id                TEXT PRIMARY KEY,
    research_trace_id       TEXT NOT NULL,
    action                  TEXT NOT NULL,
    size                    REAL NOT NULL,
    rationale_json          TEXT NOT NULL,
    invalidation_json       TEXT NOT NULL,
    ts                      REAL NOT NULL,
    FOREIGN KEY (research_trace_id) REFERENCES research_step(trace_id)
);

CREATE TABLE execution (
    trace_id                TEXT NOT NULL,
    decision_trace_id       TEXT NOT NULL,
    order_id                TEXT NOT NULL,
    fill_price              REAL,
    slippage                REAL,
    fees                    REAL,
    ts                      REAL NOT NULL,
    PRIMARY KEY (trace_id, order_id),
    FOREIGN KEY (decision_trace_id) REFERENCES decision(trace_id)
);

CREATE TABLE outcome (
    trace_id                TEXT PRIMARY KEY,
    decision_trace_id       TEXT NOT NULL,
    realized_pnl            REAL NOT NULL,
    days_held               REAL NOT NULL,
    invalidation_triggered  INTEGER NOT NULL,
    ts                      REAL NOT NULL,
    FOREIGN KEY (decision_trace_id) REFERENCES decision(trace_id)
);

Each field earns its place. model_version captures the exact model string the provider returned (e.g. claude-sonnet-4-5-20250929), not a shorthand. Provider aliases like claude-sonnet-latest hide version jumps that invalidate calibration. prompt_hash is a SHA-256 of the rendered prompt text; any change to the system prompt or template invalidates comparison across rows, and the hash makes the invalidation queryable. tokens_in and tokens_out feed cost reconciliation against the provider invoice. cost_usd is computed at log time from the current pricing table, not hindsight-recomputed, so the cost of a decision is locked to the pricing in force when the decision was made. confidence is what the model returned in its structured output; output_json is the full raw output, stored verbatim so the replay harness has something to diff against.

rationale_json in decision is the structured reasoning (the factors, the weights, the invalidation conditions) as the agent emitted them, not a summary. invalidation_json is the specific conditions under which the decision would be reversed before natural exit; these conditions are evaluated live against market data by a separate monitoring job, and outcome.invalidation_triggered captures whether any fired.

The outcome row is written only after a position closes, so any trace_id without an outcome row is an open position, a filtered idea that never executed, or a failed execution. That asymmetry is intentional: counting rows in each table by trace_id gives the funnel from idea to realized return.

Pattern 3: Replay harness

Given a trace_id, reconstruct the research step against the archived inputs and compare the new output to the stored one. The test is not "is the answer identical" (temperature noise makes that hopeless). The test is: does the decision survive re-execution under the current model?

The harness needs three archived inputs: the rendered prompt (recoverable from prompt_hash if the prompt template is in Git and the variables are in research_step.output_json or a separate research_input table), the market-state snapshot at the time of the call (tick data, order book, relevant fundamentals), and any tool-call results that the original run consumed.

import hashlib
import json
import sqlite3
from dataclasses import dataclass
from anthropic import Anthropic

@dataclass
class ReplayResult:
    trace_id: str
    original_action: str
    replayed_action: str
    original_confidence: float
    replayed_confidence: float
    original_model_version: str
    replayed_model_version: str
    prompt_hash_match: bool
    action_match: bool
    cost_usd: float

def replay(trace_id: str, db_path: str) -> ReplayResult:
    conn = sqlite3.connect(db_path)
    rs = conn.execute(
        "SELECT model, model_version, prompt_hash, output_json "
        "FROM research_step WHERE trace_id = ?",
        (trace_id,),
    ).fetchone()
    d = conn.execute(
        "SELECT action FROM decision WHERE research_trace_id = ?",
        (trace_id,),
    ).fetchone()
    original = json.loads(rs[3])
    rendered_prompt = original["rendered_prompt"]
    current_hash = hashlib.sha256(rendered_prompt.encode()).hexdigest()

    client = Anthropic()
    resp = client.messages.create(
        model=rs[0],
        max_tokens=2048,
        temperature=0,
        messages=[{"role": "user", "content": rendered_prompt}],
    )
    replayed = json.loads(resp.content[0].text)
    in_toks = resp.usage.input_tokens
    out_toks = resp.usage.output_tokens
    # Sonnet 4.6 2026-04 published rates
    cost = in_toks / 1e6 * 3.0 + out_toks / 1e6 * 15.0
    return ReplayResult(
        trace_id=trace_id,
        original_action=d[0],
        replayed_action=replayed["action"],
        original_confidence=original.get("confidence", 0.0),
        replayed_confidence=replayed.get("confidence", 0.0),
        original_model_version=rs[1],
        replayed_model_version=resp.model,
        prompt_hash_match=(current_hash == rs[2]),
        action_match=(d[0] == replayed["action"]),
        cost_usd=cost,
    )

Three things the harness surfaces. First, prompt drift: prompt_hash_match == False means the prompt template changed since the original run, so the replay is testing a different question. Second, model drift: original_model_version != replayed_model_version means the provider shipped a new version under the same alias, and any calibration derived from old runs may be stale. Third, intrinsic non-determinism: identical prompt, identical model version, different action, which flags the need to investigate temperature, seed handling, or tool-call ordering.

Determinism is bounded. Temperature above zero guarantees variance; Anthropic and OpenAI both document that seed-pinning is best-effort, not a hard guarantee. For reliable replay, research calls should run at temperature=0 and, where the provider supports it, a fixed seed, with the understanding that even then, infrastructure-level non-determinism (hardware, batching) introduces residual variance. The harness is most useful as a regression check: run it against the last 50 trace_ids after a prompt edit and flag any that flip decisions.

Replay costs real tokens. Run it selectively: on closed positions with unexpected outcomes, on a random 5% sample for drift monitoring, or on every trace when a prompt version ships. The Token Cost Optimizer and Agent Cost Envelope Calculator help size a replay budget that does not eat the research budget.

Cost-per-idea and cost-per-validated-trade SQL

Two queries answer most operational questions. The first aggregates research cost per idea:

SELECT
    r.trace_id,
    SUM(r.cost_usd)                                      AS idea_cost_usd,
    COUNT(*)                                             AS research_steps,
    MAX(r.ts) - MIN(r.ts)                                AS idea_wall_seconds,
    (SELECT action FROM decision d
      WHERE d.research_trace_id = r.trace_id)            AS action
FROM research_step r
GROUP BY r.trace_id
ORDER BY idea_cost_usd DESC;

The second computes cost per validated trade, meaning ideas that actually executed and closed. This is the ratio that matters for unit economics: if cost-per-idea is $0.18 and 1 in 12 ideas becomes a closed trade, the research cost baked into each closed trade is $2.16 before fees and slippage.

WITH idea_costs AS (
    SELECT trace_id, SUM(cost_usd) AS c FROM research_step GROUP BY trace_id
),
closed AS (
    SELECT DISTINCT d.research_trace_id AS trace_id
    FROM decision d
    JOIN outcome o ON o.decision_trace_id = d.trace_id
),
totals AS (
    SELECT
        SUM(c)                                    AS total_research_cost,
        COUNT(*)                                  AS total_ideas,
        (SELECT COUNT(*) FROM closed)             AS total_closed_trades
    FROM idea_costs
)
SELECT
    total_research_cost,
    total_ideas,
    total_closed_trades,
    total_research_cost / total_ideas            AS cost_per_idea,
    total_research_cost / NULLIF(total_closed_trades, 0)
                                                 AS cost_per_validated_trade
FROM totals;

Two derivatives fall out for free once these queries exist. Win-rate by prompt_hash shows whether the latest template is better or worse than the previous one. Latency percentiles by model_version flag when a provider silently degrades inference speed. Both come from adding GROUP BY to the joins above; no new logging is required.

A third query worth keeping on the dashboard tracks research cost as a fraction of realized PnL, bucketed by week. An agent whose research cost exceeds 15–20% of gross PnL is burning alpha on token spend; an agent whose research cost is under 3% is either extraordinarily efficient or under-researching. The ratio drifts week to week as both sides move, which is exactly why logging it continuously matters more than computing it once during a review.

Anti-patterns

Four recurring mistakes that defeat the patterns above.

Anti-pattern	Cost
Logging prompts in plain text, unhashed	Log bloat (a 50k-token prompt is ~200KB per row), privacy exposure if prompts contain user data, legal exposure on regulated datasets. Store the hash; archive the rendered prompt separately with access controls.
Using timestamps as trace IDs	Two concurrent ideas started in the same microsecond collide. UUIDv4 or ULID eliminates the collision class.
Skipping `output_json` because it is large	Replay becomes impossible. The full raw output is the only fixture for regression testing. Compress with zstd if disk pressure matters.
Logging only failures	Success cases cannot be reconstructed, so win-rate attribution across prompt versions breaks. Log every call at the same level of detail.

The pattern that underlies all four: observability code that is optimized for the happy path defeats its own purpose. The logs exist to be queried during incidents, when the happy path has already failed.

A fifth anti-pattern worth calling out separately: mixing structured and freeform logs in the same stream. A pipeline that writes json.dumps(rec) on most lines but falls back to print("retrying...") on others forces every downstream query to tolerate malformed rows. Route human-readable status to stderr, route structured events to stdout, and ingest only stdout into the query tables. The production Claude agent architecture shows this split end-to-end.

Connects to

Rate-Limit Design for LLM Research — avoids the retry storms that poison cost-per-idea metrics.
Bounded-Cost Agentic Research — caps per-trace spend before the replay harness inherits the overage.
Research Diary Schema for Auditable Agents — the human-readable counterpart to the machine schema above.
Heartbeats, Watchdogs, Circuit Breakers for Trading — reliability sister piece; observability catches silent wrong-answers that watchdogs miss.
5 Failure Modes of LLM Trading Agents — every failure mode listed is easier to diagnose with trace-IDs and a replay harness.
Production Claude Agent for Finance — end-to-end architecture that these patterns slot into.
Trading System Blueprinter — scaffolds a new agent with the schema above wired in.
Agent Cost Envelope Calculator — sizes replay and research budgets from idea-volume assumptions.

References

Anthropic. (2026). Messages API Reference — Token Counting and Response Fields. docs.anthropic.com/en/api/messages. Defines usage.input_tokens and usage.output_tokens used for cost_usd computation.
Anthropic. (2026). Tool Use Guide. docs.anthropic.com/en/docs/build-with-claude/tool-use. Structured output patterns that make output_json reliably parseable for the replay harness.
SQLite Consortium. (2024). JSON Functions and Operators. sqlite.org/json1.html. Used when rationale_json and output_json need to be queried with JSON path expressions rather than treated as opaque blobs.
Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio." Journal of Portfolio Management 40(5). Not observability itself, but the multiple-testing correction that becomes tractable once prompt_hash makes every prompt variant a named trial.