TL;DR
If you backtested many candidate strategies and picked the best Sharpe, the best strategy is probably overfit. Two complementary tests quantify how much: PBO via Combinatorially-Symmetric Cross-Validation (how often the in-sample winner under-performs out of sample), and DSR, the Deflated Sharpe Ratio (whether the winner's Sharpe is statistically real given the number of trials and the return distribution's non-normality). A PBO above 0.5 or a DSR below 50% is a red flag. This article shows how to compute both in ~80 lines of Python, or upload your CSV directly to /tools/backtest-overfitting-score/.
The failure mode
You have 50 candidate strategies. You run each on 10 years of history. The best one has Sharpe 2.1. You write a breathless Reddit post about it. You deploy it live. Three months later it's flat.
This is the single most-documented failure mode in retail algo trading. It has a well-studied statistical signature — best-of-many strategies nearly always picks a lucky outlier rather than a real edge. The fix is not to stop testing many strategies; it's to measure the multiple-testing correction and discount the winner accordingly.
Test 1: Deflated Sharpe Ratio
Bailey & Lopez de Prado's 2014 Deflated Sharpe Ratio is the closed-form answer to "is this Sharpe real or lucky?"1 It takes the observed Sharpe and deflates it for:
- Selection bias — the number of strategies you tested to find this winner
- Non-normality — skewness and excess kurtosis of the return distribution
The formula:
DSR = Φ( (SR − E[max SR*]) · √(T − 1)
/ √(1 − γ₃·SR + ((γ₄ − 1)/4)·SR²) )
where SR is the observed Sharpe, T is the number of observations, γ₃ is skewness, γ₄ is excess kurtosis, and E[max SR*] is the expected maximum Sharpe under the null that all N strategies have zero true Sharpe.
E[max SR*] is approximated as:
E[max SR*] ≈ (1 − γ) · Φ⁻¹(1 − 1/N) + γ · Φ⁻¹(1 − 1/(N·e))
where γ ≈ 0.5772 is the Euler–Mascheroni constant.
In Python
import numpy as np
from scipy.stats import norm, skew, kurtosis
def expected_max_sr(n):
if n <= 1:
return 0.0
gamma = 0.5772156649
return (1 - gamma) * norm.ppf(1 - 1/n) + gamma * norm.ppf(1 - 1/(n * np.e))
def deflated_sharpe(returns, n_strategies):
mean = returns.mean()
std = returns.std(ddof=1)
sr = mean / std # per-period Sharpe (NOT annualized)
t = len(returns)
s3 = skew(returns, bias=False)
s4 = kurtosis(returns, fisher=True, bias=False) # excess kurtosis
e_max = expected_max_sr(n_strategies)
denom = np.sqrt(1 - s3 * sr + ((s4 - 1) / 4) * sr ** 2)
z = ((sr - e_max) * np.sqrt(t - 1)) / denom
return norm.cdf(z)
DSR returns a probability in [0, 1]. Values above ~95% mean the Sharpe is very likely real after correcting for the selection bias. Below 50%: probably a coincidence.
Test 2: Probability of Backtest Overfitting (PBO)
Bailey, Borwein, Lopez de Prado & Zhu's Combinatorially-Symmetric Cross-Validation (CSCV) is complementary to DSR2. Where DSR asks "is the Sharpe real?", PBO asks "does the in-sample best strategy actually win out-of-sample?"
The algorithm:
- Split the observation axis into
2·Sequal chunks.S = 8is common. - Enumerate all
C(2S, S)ways to pickSchunks as in-sample (IS), leaving the otherSas out-of-sample (OOS). - For each combination:
- Find the strategy with highest IS Sharpe → call it
n*. - Compute
n*'s rank in the OOS Sharpes of all strategies. - Compute the logit of the fractional rank.
- Find the strategy with highest IS Sharpe → call it
- PBO = fraction of combinations where the logit is negative, i.e. the IS winner ranked below median OOS.
In Python
from itertools import combinations
import numpy as np
def pbo(returns, s=8):
"""returns shape: (N_strategies, T_observations)"""
n, t = returns.shape
chunk_size = t // (2 * s)
chunks = [returns[:, i*chunk_size:(i+1)*chunk_size] for i in range(2*s)]
below = 0
total = 0
for is_idx in combinations(range(2*s), s):
is_set = set(is_idx)
is_mat = np.concatenate([chunks[i] for i in is_set], axis=1)
oos_mat = np.concatenate([chunks[i] for i in range(2*s) if i not in is_set], axis=1)
is_sr = is_mat.mean(axis=1) / is_mat.std(axis=1, ddof=1)
oos_sr = oos_mat.mean(axis=1) / oos_mat.std(axis=1, ddof=1)
best = int(np.argmax(is_sr))
oos_best = oos_sr[best]
rank = np.sum(oos_sr < oos_best) / (n - 1)
clamped = max(0.001, min(0.999, rank))
logit = np.log(clamped / (1 - clamped))
if logit < 0:
below += 1
total += 1
return below / total
For S = 8, C(16, 8) = 12,870 combinations. In practice, 500 random combinations is enough for ±4% precision. The browser tool uses exactly this sampling rule; see /methodology/backtest-overfitting-score/.
Interpreting the two together
| PBO | DSR | Interpretation |
|---|---|---|
| < 0.2 | > 95% | Edge appears to generalize. Still verify with forward paper-trading. |
| < 0.2 | 50–95% | IS / OOS consistency is good, but Sharpe statistical significance is marginal. Small edge or noisy returns. |
| 0.2–0.5 | > 95% | Sharpe is statistically real but cross-validation is uncertain. Possible regime change; investigate. |
| > 0.5 | < 50% | Classic overfit signature. Stop. |
What this doesn't catch
PBO + DSR are necessary but not sufficient. They do not detect:
- Survivor bias in the universe selection (testing only strategies that survived to the end of the dataset)
- Look-ahead bias from data leakage (using information not available at the decision time)
- Regime dependence — a strategy that worked in 2019–2022 may not work in 2024–2025, even if PBO / DSR look fine on the combined window
- Trading cost omission — backtest returns need to be after costs for DSR to mean anything for live trading
Rule of thumb: if your backtest does not deduct commissions + slippage + market impact, PBO on those returns is overstated good news.
Run it on your backtest
/tools/backtest-overfitting-score/ accepts a wide-format CSV:
date,strategy_1,strategy_2,strategy_3
2020-01-02,0.0012,-0.0005,0.0003
2020-01-03,0.0041,0.0009,-0.0002
One row per observation, one column per candidate strategy. Returns as simple (non-log) daily returns. Computation runs entirely in the browser; your data never leaves your machine.
Load the synthetic demo first to see the shape, then upload your own. If PBO > 0.5, do not trade the winner.
Then: size calibrated, not full Kelly
If you pass PBO + DSR, the next question is how much to bet. On an estimated edge, full Kelly is too aggressive because it assumes your probability estimate is exact. Use fractional Kelly — quarter or eighth — and stress-test the drawdown distribution. The Fractional Kelly Sizer does exactly that with Monte Carlo paths.
References
Additional references:
- Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28.
- White, H. (2000). "A Reality Check for Data Snooping." Econometrica 68(5), 1097–1126.
Footnotes
-
Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107. ↩
-
Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70. ↩