Did You Overfit? PBO and Deflated Sharpe

TL;DR

If you backtested many candidate strategies and picked the best Sharpe, the best strategy is probably overfit. Two complementary tests quantify how much: PBO via Combinatorially-Symmetric Cross-Validation (how often the in-sample winner under-performs out of sample), and DSR, the Deflated Sharpe Ratio (whether the winner's Sharpe is statistically real given the number of trials and the return distribution's non-normality). A PBO above 0.5 or a DSR below 50% is a red flag. This article shows how to compute both in ~80 lines of Python, or upload your CSV directly to /tools/backtest-overfitting-score/.

The failure mode

You have 50 candidate strategies. You run each on 10 years of history. The best one has Sharpe 2.1. You write a breathless Reddit post about it. You deploy it live. Three months later it's flat.

This is the single most-documented failure mode in retail algo trading. It has a well-studied statistical signature — best-of-many strategies nearly always picks a lucky outlier rather than a real edge. The fix is not to stop testing many strategies; it's to measure the multiple-testing correction and discount the winner accordingly.

Test 1: Deflated Sharpe Ratio

Bailey & Lopez de Prado's 2014 Deflated Sharpe Ratio is the closed-form answer to "is this Sharpe real or lucky?"¹ It takes the observed Sharpe and deflates it for:

Selection bias — the number of strategies you tested to find this winner
Non-normality — skewness and excess kurtosis of the return distribution

The formula:

DSR = Φ( (SR − E[max SR*]) · √(T − 1)
         / √(1 − γ₃·SR + ((γ₄ − 1)/4)·SR²) )

where SR is the observed Sharpe, T is the number of observations, γ₃ is skewness, γ₄ is excess kurtosis, and E[max SR*] is the expected maximum Sharpe under the null that all N strategies have zero true Sharpe.

E[max SR*] is approximated as:

E[max SR*] ≈ (1 − γ) · Φ⁻¹(1 − 1/N) + γ · Φ⁻¹(1 − 1/(N·e))

where γ ≈ 0.5772 is the Euler–Mascheroni constant.

In Python

import numpy as np
from scipy.stats import norm, skew, kurtosis

def expected_max_sr(n):
    if n <= 1:
        return 0.0
    gamma = 0.5772156649
    return (1 - gamma) * norm.ppf(1 - 1/n) + gamma * norm.ppf(1 - 1/(n * np.e))

def deflated_sharpe(returns, n_strategies):
    mean = returns.mean()
    std = returns.std(ddof=1)
    sr = mean / std  # per-period Sharpe (NOT annualized)
    t = len(returns)
    s3 = skew(returns, bias=False)
    s4 = kurtosis(returns, fisher=True, bias=False)  # excess kurtosis
    e_max = expected_max_sr(n_strategies)
    denom = np.sqrt(1 - s3 * sr + ((s4 - 1) / 4) * sr ** 2)
    z = ((sr - e_max) * np.sqrt(t - 1)) / denom
    return norm.cdf(z)

DSR returns a probability in [0, 1]. Values above ~95% mean the Sharpe is very likely real after correcting for the selection bias. Below 50%: probably a coincidence.

Test 2: Probability of Backtest Overfitting (PBO)

Bailey, Borwein, Lopez de Prado & Zhu's Combinatorially-Symmetric Cross-Validation (CSCV) is complementary to DSR². Where DSR asks "is the Sharpe real?", PBO asks "does the in-sample best strategy actually win out-of-sample?"

The algorithm:

Split the observation axis into 2·S equal chunks. S = 8 is common.
Enumerate all C(2S, S) ways to pick S chunks as in-sample (IS), leaving the other S as out-of-sample (OOS).
For each combination:
- Find the strategy with highest IS Sharpe → call it n*.
- Compute n*'s rank in the OOS Sharpes of all strategies.
- Compute the logit of the fractional rank.
PBO = fraction of combinations where the logit is negative, i.e. the IS winner ranked below median OOS.

In Python

from itertools import combinations
import numpy as np

def pbo(returns, s=8):
    """returns shape: (N_strategies, T_observations)"""
    n, t = returns.shape
    chunk_size = t // (2 * s)
    chunks = [returns[:, i*chunk_size:(i+1)*chunk_size] for i in range(2*s)]

    below = 0
    total = 0
    for is_idx in combinations(range(2*s), s):
        is_set = set(is_idx)
        is_mat = np.concatenate([chunks[i] for i in is_set], axis=1)
        oos_mat = np.concatenate([chunks[i] for i in range(2*s) if i not in is_set], axis=1)

        is_sr = is_mat.mean(axis=1) / is_mat.std(axis=1, ddof=1)
        oos_sr = oos_mat.mean(axis=1) / oos_mat.std(axis=1, ddof=1)

        best = int(np.argmax(is_sr))
        oos_best = oos_sr[best]
        rank = np.sum(oos_sr < oos_best) / (n - 1)
        clamped = max(0.001, min(0.999, rank))
        logit = np.log(clamped / (1 - clamped))
        if logit < 0:
            below += 1
        total += 1
    return below / total

For S = 8, C(16, 8) = 12,870 combinations. In practice, 500 random combinations is enough for ±4% precision. The browser tool uses exactly this sampling rule; see /methodology/backtest-overfitting-score/.

Interpreting the two together

PBO	DSR	Interpretation
< 0.2	> 95%	Edge appears to generalize. Still verify with forward paper-trading.
< 0.2	50–95%	IS / OOS consistency is good, but Sharpe statistical significance is marginal. Small edge or noisy returns.
0.2–0.5	> 95%	Sharpe is statistically real but cross-validation is uncertain. Possible regime change; investigate.
> 0.5	< 50%	Classic overfit signature. Stop.

What this doesn't catch

PBO + DSR are necessary but not sufficient. They do not detect:

Survivor bias in the universe selection (testing only strategies that survived to the end of the dataset)
Look-ahead bias from data leakage (using information not available at the decision time)
Regime dependence — a strategy that worked in 2019–2022 may not work in 2024–2025, even if PBO / DSR look fine on the combined window
Trading cost omission — backtest returns need to be after costs for DSR to mean anything for live trading

Rule of thumb: if your backtest does not deduct commissions + slippage + market impact, PBO on those returns is overstated good news.

Run it on your backtest

/tools/backtest-overfitting-score/ accepts a wide-format CSV:

date,strategy_1,strategy_2,strategy_3
2020-01-02,0.0012,-0.0005,0.0003
2020-01-03,0.0041,0.0009,-0.0002

One row per observation, one column per candidate strategy. Returns as simple (non-log) daily returns. Computation runs entirely in the browser; your data never leaves your machine.

Load the synthetic demo first to see the shape, then upload your own. If PBO > 0.5, do not trade the winner.

Then: size calibrated, not full Kelly

If you pass PBO + DSR, the next question is how much to bet. On an estimated edge, full Kelly is too aggressive because it assumes your probability estimate is exact. Use fractional Kelly — quarter or eighth — and stress-test the drawdown distribution. The Fractional Kelly Sizer does exactly that with Monte Carlo paths.

References

Additional references:

Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28.
White, H. (2000). "A Reality Check for Data Snooping." Econometrica 68(5), 1097–1126.

Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107. ↩
Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70. ↩

TL;DR

The failure mode

Test 1: Deflated Sharpe Ratio

In Python

Test 2: Probability of Backtest Overfitting (PBO)

In Python

Interpreting the two together

What this doesn't catch

Run it on your backtest

Then: size calibrated, not full Kelly

References

Footnotes