TL;DR

Silent failure is the worst-case failure mode for any live trading system: the pipeline appears to be running, no error is thrown, but decisions are stale or duplicated or plain wrong. The three patterns that prevent silent failure — heartbeat, watchdog, circuit breaker — cost under 100 lines of code combined, run on launchd for free, and let a retail operator sleep at night. Below: what each pattern does, how to wire them together, and the minimal files an independent auditor would need to see.

The failure mode nobody expects

A trading script exits unexpectedly mid-market. No exception, no stack trace — a SIGKILL from an OOM, or a network partition that left the event loop stuck, or a pod that the OS quietly restarted. Next day the operator looks at yesterday's log and sees nothing unusual because there's nothing unusual to see: the script simply stopped emitting.

The defense is not "better error handling" inside the script. The defense is an independent witness.

Pattern 1: Heartbeat

Every pipeline cycle writes a small file before exiting the cycle:

import json, time, pathlib

def write_heartbeat(status: str = "ok", extra: dict | None = None):
    hb = {"at": time.time(), "iso": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()), "status": status, **(extra or {})}
    pathlib.Path("data/heartbeat.json").write_text(json.dumps(hb))

The heartbeat file is the signal. If it is stale beyond the expected cycle interval, something is wrong. It does not matter whether the script crashed, hung, or is slowly corrupting state — a stale heartbeat catches all three failure classes.

Pattern 2: Watchdog

An independent process reads the heartbeat on its own schedule and decides whether to raise the alarm:

import json, time, subprocess, pathlib
MAX_AGE_SECONDS = 900  # 15 min during market hours

def check():
    hb_path = pathlib.Path("data/heartbeat.json")
    if not hb_path.exists():
        return alert("heartbeat missing")
    hb = json.loads(hb_path.read_text())
    age = time.time() - hb.get("at", 0)
    if age > MAX_AGE_SECONDS:
        return alert(f"heartbeat stale: {age:.0f}s > {MAX_AGE_SECONDS}s")
    if hb.get("status") != "ok":
        return alert(f"heartbeat status = {hb.get('status')}")

def alert(msg: str):
    # Telegram + circuit breaker
    subprocess.run(["/usr/local/bin/notify-telegram.sh", msg])
    pathlib.Path("data/circuit.json").write_text(
        json.dumps({"paused": True, "reason": msg, "since": time.time()})
    )

The watchdog must be independent — a different launchd plist, a different cron entry, or (ideally) a different machine entirely. The whole point is to catch cases where the primary pipeline cannot catch itself.

Pattern 3: Circuit breaker

A state file that every pipeline layer honors:

def check_circuit() -> bool:
    """True if green (proceed), False if paused."""
    path = pathlib.Path("data/circuit.json")
    if not path.exists():
        return True
    state = json.loads(path.read_text())
    return not state.get("paused", False)

# At the top of every pipeline layer:
if not check_circuit():
    print("[layer] circuit paused, exiting", file=sys.stderr)
    sys.exit(2)

The circuit breaker state is deliberately a single file, not a database. Anyone with repo access can trip it manually in ~3 seconds. The watchdog trips it automatically. A legal review can trip it. A sponsor dispute can trip it. A bad market close can trip it.

Resume is deliberately manual. The file stays paused until a human edits it to { "paused": false }. This asymmetry — automatic pause, manual resume — is the whole point. You never want the system to auto-resume after a bad pause without a human looking at what broke.

Wiring the three together

Production shape (launchd for macOS; systemd-timers for Linux equivalent):

com.rusty.trader-run           # every 5 min during market hours
  → runs the main pipeline
  → writes heartbeat.json at end of cycle
  → checks circuit.json at start; exits if paused

com.rusty.trader-watchdog      # every 15 min, 24/7
  → reads heartbeat.json; if stale, trips circuit + Telegram alert

com.rusty.trader-notify-resume # daily 09:00
  → reads circuit.json; if paused, Telegram-reminds operator

Three plists, under 100 lines of Python total. Total monthly cost: $0.

The minimal files an auditor would want

If a regulator, a client, or a partner asked to audit your system, the files that would demonstrate it is operated responsibly:

  • data/heartbeat.json — timestamped last-alive signal
  • data/circuit.json — current operational state with reason + since timestamp
  • memory/decisions.jsonl — append-only decision log, one JSON per decision, never rewritten
  • plists/ — the launchd plist files themselves
  • scripts/watchdog.py — the independent-process watchdog

These are the evidence of responsible operation. Without them, "the system ran fine" is a claim; with them, it's a machine-readable fact.

What this does not solve

  • Byzantine failures — a layer that writes heartbeat-ok while silently corrupting state. The heartbeat catches stop-failure; it doesn't catch wrong-answer failure. You still need per-layer invariants.
  • Network partitions that isolate the watchdog — if the watchdog lives on the same host as the pipeline and that host is partitioned, neither can alert. Mitigation: watchdog on a different host, or push heartbeat to an external service every N cycles.
  • Silent billing runaway in LLM layers — an LLM-driven loop that enters a retry storm can cost real money before the watchdog alarm window expires. Use the Token-Cost Optimizer to model a budget ceiling and add it as an invariant to the pipeline.

Connects to

  • Trading System Blueprinter — generates a starter scaffold with these three patterns wired in by default.
  • Finance MCP Directory — grade-A servers with idempotent order submission reduce the blast radius when the circuit trips mid-flight.

References

  • Kirsch, C., & Gross, T. (2016). "The Design of a Watchdog Timer in Real-Time Systems." (Generic systems reference for the pattern.)
  • Google SRE Book. Site Reliability Engineering. Chapter 8 (Release Engineering) — applies directly to trading system operation.
  • Netflix Chaos Engineering blog — failure-injection as a correctness check.