TL;DR
A useful backtest report answers five questions, in order: was the edge real?, was it persistent?, was it cheap enough to trade?, was the risk bearable?, and could you tell yourself the story?. The first is Sharpe + Deflated Sharpe + Probability of Backtest Overfitting (PBO). The second is walk-forward efficiency. The third is turnover-adjusted return. The fourth is max drawdown + kurtosis + 3σ tail mass. The fifth is a per-trade log you can scroll without flinching. Skip any one and you are reading a sales deck, not a backtest.
Why this cheat sheet exists
Retail backtest reports range from thoughtful to marketing. The difference is which statistics get surfaced, which get buried, and which get omitted. This is a checklist for reading someone else's backtest (or your own) without getting fooled. Every item links to a browser tool that runs the calculation on a returns CSV; you can verify a report while you are reading it.
1 · Was the edge real?
The first question is whether the reported Sharpe would have been reachable by random chance, given how many candidates the author tested.
- Sharpe ratio (annualized) — the headline. Anything below 0.5 is noise at daily frequencies; 1.0 is decent; above 2.0 requires scrutiny.
- Deflated Sharpe Ratio (DSR) — Bailey & Lopez de Prado's closed-form correction for multiple testing. If the author tested 50 strategies and reports the best one at Sharpe 2.1, the DSR tells you the probability that a zero-edge null would have produced that Sharpe. Use the Backtest Overfitting Score to compute it directly.
- Probability of Backtest Overfitting (PBO) via Combinatorially-Symmetric Cross-Validation (CSCV). Measures how often the in-sample winner underperforms the median out-of-sample. PBO > 0.5 means the selection procedure has zero discriminating power.
Red flag: a Sharpe above 1.5 reported without a Deflated Sharpe or PBO number is either naïve or evasive.
2 · Was the edge persistent?
A strategy that wins 2024 and loses 2023 may be fitted to 2024. Walk-forward analysis slices history into rolling (or expanding) train/test windows and reports how the strategy performs when parameters are locked in on one window and evaluated on the next.
- Walk-forward efficiency (WFE) — ratio of out-of-sample Sharpe to in-sample Sharpe. Above 0.5 is good; below 0.3 is overfit.
- Per-window Sharpe stability — a single blow-up window can hide inside a respectable average. Scan the window-by-window table, not just the summary number.
- Regime coverage — does the backtest span at least one material regime shift (e.g. rate cycle, volatility regime)? If it is three years of calm bull market, the out-of-sample performance in a crash is a pure extrapolation.
The Walk-Forward Validator generates both numbers and a concatenated OOS equity curve from a returns CSV.
3 · Was it cheap enough to trade?
Gross Sharpe is decorative. Net Sharpe — after commissions, slippage, borrow fees, tax-loss drag — is what you trade.
- Turnover — portfolio turns per year. Higher turnover multiplies per-trade frictions. Anything above 500% annual turnover at a 0.05%-per-trade cost eats 2.5% of annual return; more at smaller portfolios.
- Implementation-shortfall model — does the report include a slippage model, and is it conservative (e.g. half-spread + one tick for market orders)?
- Borrow costs — for short positions, include the hard-to-borrow fee for the actual universe. For small-cap shorts, this frequently exceeds the gross edge.
- Minimum scalable size — at what AUM does the strategy cease to be executable without moving the market? If the author does not say, assume the answer is uncomfortable.
Red flag: a "net Sharpe" that differs from gross Sharpe by less than 0.1 in a high-turnover strategy. The author forgot something.
4 · Was the risk bearable?
Sharpe bakes up-side and down-side variance into one number. The risk profile of a realistic portfolio needs more than that.
- Max drawdown — as a percentage of capital. A 1.5-Sharpe strategy with 40% peak-to-trough drawdown is unholdable by most retail operators.
- Calmar ratio — annualized return / max drawdown. Above 1.0 is solid; below 0.5 is punishing.
- Skewness + excess kurtosis — a 1.5 Sharpe with −2.0 skew and 6.0 excess kurtosis is a 1.5 Sharpe that blows up one month in 24. Compute these directly via the Returns Distribution Analyzer.
- 3σ tail mass — the fraction of observations more than three standard deviations from the mean. A normal distribution has ~0.27%. Strategies with >1% 3σ mass have a tail profile Sharpe cannot describe.
- Time in drawdown — it is easier to hold a short deep drawdown than a long shallow one. A DD chart with long flat bottoms is a behavioural test the ratio table hides.
5 · Could you tell yourself the story?
The last check is qualitative and indispensable. A real edge comes with a testable explanation — why the market structurally leaves this alpha unclaimed, by whom, for how long.
- Factor decomposition — when you regress the strategy's returns on market, size, value, momentum, quality, and low-volatility, how much alpha survives? An unexplained residual is the candidate edge; an explained return is just a factor in a trench coat.
- Capacity story — who is on the other side of the trade, why are they taking it, and at what portfolio size does that counterparty run out?
- Trade log legibility — the top-10 winners and top-10 losers: do they make sense in the story? Or are the winners all flukes on illiquid days?
A backtest that scores perfectly on 1–4 but cannot tell a story for #5 is either a data artefact or an edge about to disappear.
Putting it together
A proper backtest report has the following structure. In order:
- One-paragraph story (what is the edge, why does it exist, who pays).
- Universe + period + rebalance frequency + commission model.
- Sharpe, Deflated Sharpe, PBO, walk-forward efficiency.
- Net of frictions: turnover, slippage, borrow costs.
- Drawdown chart + skew/kurtosis + 3σ tail mass.
- Top-10 winners + losers with human-readable descriptions.
- What would falsify the story (the kill criterion).
Missing #3, #5, or #7 is a red flag. Missing #1 is a marketing deck.
Tools referenced
- Backtest Overfitting Score — PBO + Deflated Sharpe on a returns CSV
- Walk-Forward Validator — rolling / expanding IS/OOS evaluation
- Risk-Adjusted Returns Calculator — Sharpe, Sortino, Calmar, Omega, alpha, beta, IR
- Returns Distribution Analyzer — skew, kurtosis, Jarque-Bera, 3σ tail mass
- Correlation Matrix Visualizer — cross-strategy redundancy before allocation