Backtest to Paper to Live: Deployment Playbook

TL;DR

Most retail strategies skip straight from a backtest into real money, and the live P&L curve answers whatever question the backtest ignored. The alternative is a three-stage deployment pipeline — backtest, paper, live — with explicit promotion gates between each stage and explicit demotion criteria inside the live stage. Below: the exact numeric thresholds for each gate, the seven metrics that should trigger an auto-pause in production, a file-based kill switch that a cron can enforce in under 30 lines of Python, and the rollback rules that keep a drawdown from turning into a crater.

The three stages, in one sentence each

Backtest. Historical simulation on clean data with realistic costs. Measures whether the strategy could have worked.
Paper. Live market data, live signals, simulated fills (or broker-paper fills). Measures whether the signal survives contact with current microstructure.
Live. Real capital, real fills. Measures whether execution, latency, and slippage still leave an edge.

Each stage answers a different question. Collapsing them collapses the evidence.

Stage 1: backtest — what must be true to promote

A backtest is a hypothesis, not a strategy. To promote it to paper, it must clear four gates.

Gate 1.1 — Probability of backtest overfitting (PBO) below 0.5.

Bailey and López de Prado (2014) define PBO as the probability that a strategy ranked top in-sample is no better than the median out-of-sample. Above 0.5 means your selection process is worse than random. The Backtest Overfitting Score tool computes this via combinatorially symmetric cross-validation (CSCV). Concrete threshold: PBO ≥ 0.5 is an automatic fail. PBO in [0.3, 0.5) is promotable only with a walk-forward confirmation (gate 1.2). PBO < 0.3 is clean.

Gate 1.2 — Walk-forward OOS Sharpe ≥ 0.8 × IS Sharpe.

Split the history into anchored or rolling walk-forward folds. Train on the in-sample (IS) portion of each fold, evaluate on the out-of-sample (OOS) portion. Aggregate the OOS trades into a single equity curve and compute its Sharpe. If that OOS Sharpe falls below 80% of the IS Sharpe, the strategy is learning the training set, not the market.

The 80% threshold is deliberately conservative. A 1.8 IS Sharpe degrading to 1.4 OOS is acceptable. A 1.8 IS degrading to 1.0 OOS is a red flag even though 1.0 is still "profitable" — you will lose another chunk to microstructure costs you didn't model.

Gate 1.3 — Deflated Sharpe Ratio > 0.5.

Bailey and López de Prado (2014) introduced the Deflated Sharpe Ratio (DSR) to correct for multiple-testing and non-normality. DSR > 0.5 roughly corresponds to "better than what you'd expect from trying N strategies at random, given the observed skew and kurtosis of returns." The intuition: if you tried 100 parameter combinations, your best one is expected to look decent even on random data. DSR accounts for that implicit search.

Gate 1.4 — Code review confirms no future-peeking.

This is the gate everyone skips and the one that catches the worst failures. A concrete checklist:

# A checklist run on the backtest engine, not the strategy:
def future_peek_audit(engine):
    # 1. All features at time t computed from data with timestamp < t.
    # 2. Resampling/rolling windows use closed='left' (exclude current bar).
    # 3. Survivorship: universe at time t is the as-of universe at t, not today's.
    # 4. Corporate actions applied with their actual announcement date, not execution date.
    # 5. Fill prices are next-bar open or current-bar close with a documented slippage model —
    #    never current-bar close filled at current-bar close.
    # 6. Train/test split respects time — no random shuffle on time-series data.
    # 7. Feature normalization uses only past statistics.
    ...

If any of these fail, the backtest is leaking. No amount of PBO or DSR will save it.

Stage 2: paper — what must be true to promote to live

Paper trading is the single most under-rated stage in the pipeline. It's where you discover that your data vendor's timestamps are 400ms late, that your broker's order IDs collide under load, and that your clever opening-auction strategy gets filled at the worst tick in the print.

Gate 2.1 — Minimum 30 calendar days of live paper trading.

Thirty days is not arbitrary. It's the minimum window to observe: an FOMC meeting, an options expiry, an end-of-month rebalance, at least one macro surprise, and ~3 weekly cycles of liquidity patterns. If the paper phase is shorter, you have no sample of the corner cases that break strategies in live.

Gate 2.2 — Live-vs-paper fill-price divergence < 10 bps.

Log both your paper fill price (the broker's simulated fill or your own model) and the actual market print within 1 second on each side. Compute the mean absolute divergence across every paper order. If the mean exceeds 10 bps, the paper simulation is not representative and the live P&L will surprise you.

import statistics

def fill_divergence_bps(paper_fills, market_prints):
    diffs = []
    for (px_paper, px_market) in zip(paper_fills, market_prints):
        bps = 1e4 * abs(px_paper - px_market) / px_market
        diffs.append(bps)
    return statistics.mean(diffs), statistics.quantiles(diffs, n=20)[-1]

Gate 2.3 — Broker rate-limit headroom measured under peak load.

Run a deliberate peak-load test during paper. Fire every order, cancel, and modify the strategy would send in its worst-case minute, and log the 429 rate from the broker. If you're already at 50% of the rate-limit during paper, live (with reconnection storms, retries, and heartbeats) will breach it. Interactive Brokers and Alpaca both document their per-endpoint limits — respect them with 2× headroom.

Gate 2.4 — Heartbeat, watchdog, circuit-breaker all green for the full 30 days.

The reliability primitives (see Heartbeats, Watchdogs, and Circuit Breakers for Trading Systems) must be installed and green for the entire paper window. A silent 2-hour outage during paper means you don't actually know whether the strategy handles reconnection correctly. If any heartbeat gap exceeds your own SLA during paper, you restart the 30-day clock after fixing the cause.

Stage 3: live — the seven auto-pause metrics

Once in live, seven metrics should be computed every trading day and compared to thresholds. Any breach auto-pauses the strategy.

Daily P&L below 2nd percentile of backtest distribution. A one-day loss that would have been a tail event historically.
Rolling 5-day P&L below 5th percentile. Short-horizon regime change.
Live Sharpe (20-day rolling) below 50% of paper Sharpe. Signal decay or structural break.
Drawdown exceeds 2× the 95th-percentile backtest drawdown. Drawdown envelope breach.
Fill slippage (rolling 50 orders) exceeds 1.5× paper-phase slippage. Execution regime changed.
Broker-reported error rate above 2% over 1 hour. Connectivity or auth degradation.
Position correlation exceeds planned maximum. The ensemble collapsed into a single bet.

Each of these is independently falsifiable. The rule: any one of the seven triggers an auto-pause. Resuming is manual.

from dataclasses import dataclass

@dataclass
class LiveHealth:
    daily_pnl_pct: float
    rolling_5d_pnl_pct: float
    rolling_sharpe_20d: float
    paper_sharpe: float
    drawdown_pct: float
    backtest_dd_95: float
    slippage_bps_50: float
    paper_slippage_bps: float
    broker_error_rate_1h: float
    position_correlation: float

def should_auto_pause(h: LiveHealth, backtest_dist_p2: float, backtest_dist_p5: float,
                     correlation_cap: float) -> list[str]:
    breaches = []
    if h.daily_pnl_pct < backtest_dist_p2:
        breaches.append("daily_pnl_tail")
    if h.rolling_5d_pnl_pct < backtest_dist_p5:
        breaches.append("rolling_5d_tail")
    if h.rolling_sharpe_20d < 0.5 * h.paper_sharpe:
        breaches.append("sharpe_decay")
    if h.drawdown_pct > 2.0 * h.backtest_dd_95:
        breaches.append("drawdown_envelope")
    if h.slippage_bps_50 > 1.5 * h.paper_slippage_bps:
        breaches.append("slippage_regime")
    if h.broker_error_rate_1h > 0.02:
        breaches.append("broker_errors")
    if h.position_correlation > correlation_cap:
        breaches.append("correlation_collapse")
    return breaches

Rollback: the 20-day rule

Auto-pause halts new entries. Rollback is a step further: it flattens positions and demotes the strategy back to paper until the root cause is identified and fixed.

The explicit rollback rule: if rolling 20-trading-day live Sharpe falls below 50% of the observed paper Sharpe, or if drawdown exceeds 2× the 95th-percentile backtest drawdown, the strategy is demoted. Twenty days is long enough to distinguish luck from signal decay; 50% is aggressive enough to stop the bleed before it becomes existential; 2× drawdown means the live environment is materially different from the backtest distribution.

Demotion to paper is not a punishment — it's an information-preserving pause. The strategy keeps running on paper against live data, so you can observe whether the regression persists or reverses, and decide whether the fix is a code change, a parameter re-fit, or a retirement.

The kill switch

The simplest kill switch that actually works:

# kill_switch.py — import this at the top of every order-placing function.
import os
from pathlib import Path

KILL_FILE = Path.home() / ".trading" / "kill.flag"

def check_kill_switch():
    if KILL_FILE.exists():
        raise SystemExit(f"Kill switch engaged at {KILL_FILE}. Exiting.")

def engage(reason: str):
    KILL_FILE.parent.mkdir(parents=True, exist_ok=True)
    KILL_FILE.write_text(reason)

Pair it with a cron (or launchd job on macOS, systemd timer on Linux) that checks the file every 60 seconds and SIGTERMs the trading process if the flag appears:

# /etc/cron.d/trading-kill  (every minute)
* * * * * trader [ -f ~/.trading/kill.flag ] && pkill -TERM -f trading_loop.py

Why file-based rather than a message queue or a signal? Three reasons:

Unambiguous state. The flag either exists or doesn't. No partial delivery, no replay semantics.
Unopinionated creation. Any process — a monitoring script, a Telegram bot handler, a manual touch from SSH — can create it. No client library, no shared secret.
Durability across restarts. A restarted trader process will see the flag on the next check_kill_switch() call and exit immediately.

The thing a file-based switch is not: it's not a substitute for flatten-on-disconnect logic at the broker level. Many brokers let you set a server-side "close all on disconnect" policy. Set that too; treat the file-based switch as the local kill, the broker policy as the remote kill, and the auto-pause logic as the conditional kill. Defense in depth.

The demotion-is-cheap principle

The most common reason operators skip these gates is the emotional cost of demoting a strategy. You spent six weeks building it, you're excited about it, and pausing feels like failure. It's not. Demotion is the information-preserving version of "I don't yet know if this is working."

A useful frame: every stage of the pipeline exists to make a promotion decision cheap to reverse. The backtest → paper gate is cheap because you lose two minutes of wall clock. The paper → live gate is cheap because the only cost is opportunity cost during the 30-day window. The live auto-pause is cheap because you pause new entries — existing positions continue to resolve. Even the 20-day rollback to paper is cheap because the strategy keeps running, just against simulated capital.

The one gate that's expensive to reverse is skipping a gate. A strategy promoted straight from backtest to live will teach you what the paper stage would have, but it'll bill you for the lesson in real dollars.

The audit log

Every promotion and demotion event is a row in a table you keep forever:

CREATE TABLE strategy_events (
  strategy_id   TEXT NOT NULL,
  event_at      TEXT NOT NULL,          -- ISO 8601
  from_stage    TEXT NOT NULL,          -- backtest / paper / live / retired
  to_stage      TEXT NOT NULL,
  reason        TEXT NOT NULL,          -- human-readable
  metrics_json  TEXT NOT NULL,          -- dump of the health metrics at the event
  operator      TEXT NOT NULL,          -- whoever / whatever triggered it
  PRIMARY KEY (strategy_id, event_at)
);

This table is how you learn. Six months in, you can filter to all to_stage = 'paper' events after a live stage and ask: of the strategies that got demoted, how many recovered? How many retired? What was the average time-to-diagnosis? That answer tells you whether your gates are calibrated or too loose.

Connects to

Heartbeats, Watchdogs, and Circuit Breakers for Trading Systems — the reliability primitives that must be green for the full paper window.
Did You Overfit? Running PBO and Deflated Sharpe on Your Backtest — the statistical pre-requisites for gate 1.1 and 1.3.
Walk-Forward Validation: A Cookbook for Honest Backtests — how to construct the OOS Sharpe for gate 1.2.
Trading System Blueprinter — generate a stage-gate deployment spec for a given strategy profile.
Execution Simulator — quantify expected slippage before measuring actual slippage in paper.
Backtest Overfitting Score — compute PBO and DSR in-browser.

References

Bailey, D. H., & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5).
Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2014). "The Probability of Backtest Overfitting." Journal of Computational Finance.
López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
Pardo, R. (2008). The Evaluation and Optimization of Trading Strategies. 2nd ed. Wiley.