Postmortem Template for LLM Trading Systems

TL;DR

When an LLM trading system produces a bad outcome, ad-hoc retrospective is useless: human memory fails, logs get pruned, blame rotates around the room. A structured postmortem template captures the what, when, why, and fix in a reproducible form keyed to the trace ID that generated the decision. The twenty recurring failure modes below, extending the five-mode catalog in 5 Failure Modes of LLM Trading Agents, form the scanning checklist. Every completed postmortem is blameless, public within the operating team, append-only, and cross-referenced to a commit SHA for the fix. Runnable Python at the end enforces the append-only property; attempting to mutate a prior record raises.

What counts as a postmortem-worthy event

Not every red day triggers a postmortem. The template exists to capture anomalies where cause is unclear or controls failed, not to relitigate normal variance. Six trigger conditions cover the ground most operators care about:

Realized loss exceeding the daily drawdown threshold. The common setting is one percent of deployed capital on a single session, or three percent rolling over five sessions, whichever fires first.
Thesis invalidation without position exit. A trade opened on a thesis that later becomes demonstrably false, yet the system holds the position past the invalidation timestamp. The loss size is secondary; the control failure is the issue.
Agent execution the human operator would not have approved. Any trade where the reviewer, shown the same inputs the model saw, would have declined or sized differently. Captures distribution shift between operator intent and model behavior.
Token-budget overrun of five times or more. A research loop that was planned at 50,000 tokens and consumed 300,000. Cost failure mode; separate from correctness but often the earliest warning of a deeper control gap.
Prompt injection or schema violation detected in production. Any retrieved document that attempts tool coercion, any model output that bypasses the validator. Treated as a security event regardless of whether the trade was affected.
Provider outage or rate-limit cascade. Rate-limit responses that cause a fallback chain to degrade the model or skip a validation step. Output quality drops without an obvious marker in the equity curve.

A trigger does not mean the system was wrong. It means the event warrants a written record. Many postmortems conclude that the system behaved correctly and the control was appropriate. Those are still written.

The template

A postmortem is a structured record with seven fields, in this order, every time. Template rigidity is the point. Free-form retrospectives drift into narrative and blame; a form with seven slots produces a record that can be grep-searched six months later.

Field	Content	Length
Summary	One or two sentences, factual, no attribution	≤ 40 words
Timeline	UTC timestamps for idea → decision → execution → outcome	5–20 rows
Root cause	One of: cost, correctness, compliance, security, operational	1 label
Contributing factors	Bullet list of system conditions	3–5 items
Fix implemented	Specific change, linked to commit SHA	1 paragraph
Prevention	How a similar failure would be detected earlier	1 paragraph
References	trace_id, commit SHA, fix PR link	3–6 links

The summary states the event in neutral language. "Position opened in SYNTHETIC_A at 14:02 UTC was not closed on thesis invalidation at 15:47 UTC; realized loss 1.4 percent of deployed capital." Not "the bot blew up" and not "the operator failed to monitor."

The timeline runs in UTC, one row per significant event. Idea generation timestamp, research completion, decision timestamp, order placement, fill, invalidation signal (if any), exit. Every row carries the trace_id so a reader can pull the underlying log. Observability for LLM Trading Agents covers the trace schema the timeline rows reference.

Root cause classification is a single label drawn from a closed vocabulary of five: cost, correctness, compliance, security, operational. Every postmortem picks one. Cross-cutting failures pick the dominant category and list the others as contributing. The vocabulary is closed because a fixed taxonomy produces counts that mean something across a year of records.

Contributing factors is three to five bullets of system conditions that made the failure possible. "The validator did not check schema version" is a contributing factor. "The operator was not watching" is not. See the blameless section below.

Fix implemented names the code, config, or prompt change and links the commit SHA. If no code change is made, the field reads "no change; documented and deferred" with a ticket link. Empty fixes are acceptable when the cost of the control exceeds the expected value of prevention.

Prevention is the forward-looking mirror of the fix. Not "the team will be more careful" but "the ingestion pipeline now emits a schema-version metric; alert fires on change." Detectability, not intent.

References close the loop: trace_id for log replay, commit SHA for the fix, PR link for the review record. If a follow-up task exists, its ticket ID goes here too.

The twenty-failure-mode scanning checklist

Every postmortem begins with a scan. The operator reads the twenty questions below and answers each with yes, no, or not-applicable. A yes on any item becomes a contributing factor, and potentially the root cause. The list extends the five-mode catalog in 5 Failure Modes of LLM Trading Agents with fifteen additional patterns observed across audit work.

Price-blind leak. Did the research prompt contain the current price, a percentage move, or position-state language? Tooling: Price-Blind Auditor.
Numeric fabrication. Did the model output any number that cannot be traced to a verified source document?
Prompt drift. Did the system prompt, tool schema, or retrieval template change between the research run and the execution run?
Token runaway. Did the loop hit its token budget, or exceed 2x the planned spend?
Audit amnesia. Is the trace_id log complete, with prompts, tool calls, model outputs, and the executed trade all retrievable?
Cache poisoning. Did the research read from a cache that contained content written by a prior compromised or degraded run?
Tool-result injection. Did any retrieved news item, filing, or transcript contain text that attempted to steer the model (instruction-like language, suspicious URLs, schema-coercing tokens)? Tooling: Prompt Injection Tester.
Rate-limit degradation. Did the provider return 429 or throttling responses that caused the agent to auto-downshift to a weaker model or skip a validation step?
Fallback schema mismatch. If a fallback provider was invoked, did its response schema match the primary provider's, or did downstream validators silently accept degraded structure?
Schema-version drift. Did the prompt or output schema change mid-batch without a version bump in the trace record?
Unit or GAAP confusion. Did the agent misread thousands as units, millions as thousands, or mix accounting standards across reporting periods?
Restatement blindness. Was prior-year data restated in a later filing, and did the agent use the original rather than the restated figure?
Timestamp error. Was the news item, quote, or filing used in research stale by more than the strategy's decay window?
Dedup failure. Did the research ingest many copies of the same wire story, inflating conviction by repetition rather than independent confirmation?
Calibration drift. Did probability estimates shift systematically after a model upgrade, without the calibrator being retrained?
Convergence gate miscalibration. Did the research loop halt on a weak convergence signal, or conversely, run long past a confident answer?
After-hours boundary miss. Did the thesis fail to account for an earnings release or macro print that landed outside regular hours?
Thinking-token tax. Did the agent expend extended-thinking tokens on a task that did not require them, inflating cost without improving output?
Research-diary gap. Was the record of rejected ideas written, or does only the accepted idea survive?
Cost attribution drift. Was the cache-write amortization computed correctly, or did a single write get charged against a single read instead of many?

The scan is mechanical. It is not meant to feel clever. Most postmortems score one or two yes answers; the rest of the template explains the connection between those yes answers and the realized loss.

Blameless framing

The template enforces blamelessness structurally, not aspirationally. Every root cause must name a system change that would prevent recurrence. "Operator missed the invalidation signal" is not a valid root cause; "no automated invalidation monitor on open positions" is. The first sentence assigns blame to a person; the second names a missing control.

Google's SRE practice¹ and John Allspaw's Etsy-era writing on blameless postmortems[^2] arrive at the same conclusion from different angles. Blame suppresses reporting. When operators fear being named, they either do not write the postmortem or write it to minimize their exposure. The institutional memory that would prevent the next incident disappears. Blamelessness is a procedural choice, not a sentiment; it is enforced by the template requiring a system-level cause.

A concrete translation rule:

Blame-flavored phrasing	Blameless rewrite
The operator did not check the dashboard	No alerting on the dashboard metric
The reviewer approved a bad prompt	No automated regression test on the prompt change
The agent went off the rails	No max_steps ceiling; no convergence gate
Someone pushed the wrong config	No config-schema validation in the deploy pipeline
The on-call was asleep	No paging integration for this class of alert

Every right-hand column entry is a ticket-able change. Every left-hand column entry is a personnel complaint. A postmortem that produces items from the right column is useful; one that produces items from the left is a morale cost without a control improvement.

How to run the meeting

A postmortem meeting is thirty minutes, scheduled within seventy-two hours of the event. Longer delays degrade recall; shorter delays catch operators before they have had time to read the traces. The attendees are the system operator, the code owner of the affected module, and optionally a reviewer who was not involved. Three people, not five.

The agenda is the template itself. The facilitator reads each field aloud, the team fills it in, and the record is committed to a postmortems/ directory in the repository. No slides, no presentation. If a field cannot be filled in thirty minutes, the postmortem is marked incomplete, pending trace review and reconvened within a week.

Prior postmortems are never silently rewritten. If new evidence emerges (a trace replay reveals a contributing factor missed at the meeting, or the fix turns out not to prevent the class of issue), the record is amended with a dated addendum. Amendments append; they never overwrite. The append-only rule exists for the same reason it exists in the trace log: a mutable postmortem history is no history at all.

The append-only log

The discipline is structural. Below is a thirty-line Python module that enforces it: postmortems are written as dataclasses, serialized to JSON lines, and any attempt to rewrite an existing record by its ID raises. Amendments go into a separate file keyed to the parent ID.

from dataclasses import dataclass, asdict, field
from datetime import datetime, timezone
from pathlib import Path
import json
import uuid

@dataclass
class Postmortem:
    id: str = field(default_factory=lambda: str(uuid.uuid4()))
    created_utc: str = field(
        default_factory=lambda: datetime.now(timezone.utc).isoformat()
    )
    summary: str = ""
    timeline: list = field(default_factory=list)
    root_cause: str = ""  # cost | correctness | compliance | security | operational
    contributing_factors: list = field(default_factory=list)
    fix_commit_sha: str = ""
    prevention: str = ""
    trace_id: str = ""
    pr_link: str = ""

def append(pm: Postmortem, path: Path = Path("postmortems/log.jsonl")) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    existing_ids = set()
    if path.exists():
        with path.open("r", encoding="utf-8") as fh:
            for line in fh:
                if line.strip():
                    existing_ids.add(json.loads(line)["id"])
    if pm.id in existing_ids:
        raise ValueError(f"postmortem {pm.id} already exists; use amend() instead")
    with path.open("a", encoding="utf-8") as fh:
        fh.write(json.dumps(asdict(pm), ensure_ascii=False) + "\n")

def amend(parent_id: str, note: str, path: Path = Path("postmortems/amendments.jsonl")) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    record = {
        "parent_id": parent_id,
        "amended_utc": datetime.now(timezone.utc).isoformat(),
        "note": note,
    }
    with path.open("a", encoding="utf-8") as fh:
        fh.write(json.dumps(record, ensure_ascii=False) + "\n")

Usage is mechanical: instantiate a Postmortem, fill the fields during the meeting, call append. If a teammate later tries to "fix" the record in place, the ID collision raises and they are forced to use amend instead, which is what the audit trail demanded in the first place.

Connects to

Observability for LLM Trading Agents — trace_id schema the postmortem timeline references.
Bounded-Cost Agentic Research — how token-runaway triggers are instrumented upstream.
Prompt Injection Defenses for Finance — security-classified failures often surface here first.
Research Diary Schema for Auditable Agents — the rejected-idea record that closes checklist item 19.
5 Failure Modes of LLM Trading Agents — sister piece; the five-mode catalog this article extends.
Heartbeats, Watchdogs, Circuit Breakers — the control layer that produces many of the trigger events above.
Price-Blind Auditor — scan for checklist item 1.
Hallucination Detector — scan for checklist item 2.
Prompt Injection Tester — scan for checklist item 7.

References

Allspaw, J. (2012). "Blameless PostMortems and a Just Culture." Code as Craft (Etsy engineering blog), May 22, 2012.
Cichonski, P., Millar, T., Grance, T., & Scarfone, K. (2012). Computer Security Incident Handling Guide. NIST Special Publication 800-61 Revision 2, National Institute of Standards and Technology.
Dekker, S. (2014). The Field Guide to Understanding Human Error, 3rd edition, Ashgate. Foundational text on blameless investigation in high-consequence domains.
Woods, D. D., Dekker, S., Cook, R., Johannesen, L., & Sarter, N. (2010). Behind Human Error, 2nd edition, Ashgate.

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (editors) (2016). Site Reliability Engineering: How Google Runs Production Systems, O'Reilly Media. Chapter 15, "Postmortem Culture: Learning from Failure." ↩