TL;DR

A useful backtest report answers five questions, in order: was the edge real?, was it persistent?, was it cheap enough to trade?, was the risk bearable?, and could you tell yourself the story?. The first is Sharpe + Deflated Sharpe + Probability of Backtest Overfitting (PBO). The second is walk-forward efficiency. The third is turnover-adjusted return. The fourth is max drawdown + kurtosis + 3σ tail mass. The fifth is a per-trade log you can scroll without flinching. Skip any one and you are reading a sales deck, not a backtest.

Why this cheat sheet exists

Retail backtest reports range from thoughtful to marketing. The difference is which statistics get surfaced, which get buried, and which get omitted. This is a checklist for reading someone else's backtest (or your own) without getting fooled. Every item links to a browser tool that runs the calculation on a returns CSV; you can verify a report while you are reading it.

1 · Was the edge real?

The first question is whether the reported Sharpe would have been reachable by random chance, given how many candidates the author tested.

  • Sharpe ratio (annualized) — the headline. Anything below 0.5 is noise at daily frequencies; 1.0 is decent; above 2.0 requires scrutiny.
  • Deflated Sharpe Ratio (DSR) — Bailey & Lopez de Prado's closed-form correction for multiple testing. If the author tested 50 strategies and reports the best one at Sharpe 2.1, the DSR tells you the probability that a zero-edge null would have produced that Sharpe. Use the Backtest Overfitting Score to compute it directly.
  • Probability of Backtest Overfitting (PBO) via Combinatorially-Symmetric Cross-Validation (CSCV). Measures how often the in-sample winner underperforms the median out-of-sample. PBO > 0.5 means the selection procedure has zero discriminating power.

Red flag: a Sharpe above 1.5 reported without a Deflated Sharpe or PBO number is either naïve or evasive.

2 · Was the edge persistent?

A strategy that wins 2024 and loses 2023 may be fitted to 2024. Walk-forward analysis slices history into rolling (or expanding) train/test windows and reports how the strategy performs when parameters are locked in on one window and evaluated on the next.

  • Walk-forward efficiency (WFE) — ratio of out-of-sample Sharpe to in-sample Sharpe. Above 0.5 is good; below 0.3 is overfit.
  • Per-window Sharpe stability — a single blow-up window can hide inside a respectable average. Scan the window-by-window table, not just the summary number.
  • Regime coverage — does the backtest span at least one material regime shift (e.g. rate cycle, volatility regime)? If it is three years of calm bull market, the out-of-sample performance in a crash is a pure extrapolation.

The Walk-Forward Validator generates both numbers and a concatenated OOS equity curve from a returns CSV.

3 · Was it cheap enough to trade?

Gross Sharpe is decorative. Net Sharpe — after commissions, slippage, borrow fees, tax-loss drag — is what you trade.

  • Turnover — portfolio turns per year. Higher turnover multiplies per-trade frictions. Anything above 500% annual turnover at a 0.05%-per-trade cost eats 2.5% of annual return; more at smaller portfolios.
  • Implementation-shortfall model — does the report include a slippage model, and is it conservative (e.g. half-spread + one tick for market orders)?
  • Borrow costs — for short positions, include the hard-to-borrow fee for the actual universe. For small-cap shorts, this frequently exceeds the gross edge.
  • Minimum scalable size — at what AUM does the strategy cease to be executable without moving the market? If the author does not say, assume the answer is uncomfortable.

Red flag: a "net Sharpe" that differs from gross Sharpe by less than 0.1 in a high-turnover strategy. The author forgot something.

4 · Was the risk bearable?

Sharpe bakes up-side and down-side variance into one number. The risk profile of a realistic portfolio needs more than that.

  • Max drawdown — as a percentage of capital. A 1.5-Sharpe strategy with 40% peak-to-trough drawdown is unholdable by most retail operators.
  • Calmar ratio — annualized return / max drawdown. Above 1.0 is solid; below 0.5 is punishing.
  • Skewness + excess kurtosis — a 1.5 Sharpe with −2.0 skew and 6.0 excess kurtosis is a 1.5 Sharpe that blows up one month in 24. Compute these directly via the Returns Distribution Analyzer.
  • 3σ tail mass — the fraction of observations more than three standard deviations from the mean. A normal distribution has ~0.27%. Strategies with >1% 3σ mass have a tail profile Sharpe cannot describe.
  • Time in drawdown — it is easier to hold a short deep drawdown than a long shallow one. A DD chart with long flat bottoms is a behavioural test the ratio table hides.

5 · Could you tell yourself the story?

The last check is qualitative and indispensable. A real edge comes with a testable explanation — why the market structurally leaves this alpha unclaimed, by whom, for how long.

  • Factor decomposition — when you regress the strategy's returns on market, size, value, momentum, quality, and low-volatility, how much alpha survives? An unexplained residual is the candidate edge; an explained return is just a factor in a trench coat.
  • Capacity story — who is on the other side of the trade, why are they taking it, and at what portfolio size does that counterparty run out?
  • Trade log legibility — the top-10 winners and top-10 losers: do they make sense in the story? Or are the winners all flukes on illiquid days?

A backtest that scores perfectly on 1–4 but cannot tell a story for #5 is either a data artefact or an edge about to disappear.

Putting it together

A proper backtest report has the following structure. In order:

  1. One-paragraph story (what is the edge, why does it exist, who pays).
  2. Universe + period + rebalance frequency + commission model.
  3. Sharpe, Deflated Sharpe, PBO, walk-forward efficiency.
  4. Net of frictions: turnover, slippage, borrow costs.
  5. Drawdown chart + skew/kurtosis + 3σ tail mass.
  6. Top-10 winners + losers with human-readable descriptions.
  7. What would falsify the story (the kill criterion).

Missing #3, #5, or #7 is a red flag. Missing #1 is a marketing deck.

Tools referenced