Deflated Sharpe Ratio (DSR) is a statistical correction that converts a raw observed Sharpe into the probability that the strategy's true Sharpe is positive, given (a) the number of independent backtest trials run, (b) the skew and kurtosis of the realised returns, and (c) the sample length. Bailey and López de Prado introduced it in 2014 as the definitive fix for the most common form of backtest fraud, quietly running 1,000 strategy variants and reporting the best one's Sharpe as if it were unconditional[^1]. The result that surprises practitioners: a Sharpe of 2.5 across 100 trials deflates to roughly 0.6, while a Sharpe of 1.5 across a single pre-registered trial deflates to roughly 1.4. The single-trial result is the better strategy. This piece walks through the derivation, runs a 1,000-trial Monte Carlo simulation, and shows the deflation table that should sit on every quant's desk.

The setup

Observed Sharpe SR from a single backtest of length T looks like a clean number. It is not. It is the maximum of N Sharpe ratios drawn during research, every parameter sweep, every feature variation, every cadence test counts as a trial whether or not the researcher logged it. Even if every variant has true Sharpe zero, the maximum across N trials drifts up with log(N)[2].

Bailey, Borwein, López de Prado, and Zhu (2014) showed in Notices of the AMS that with N=200 trials, T=5 years of monthly data, and zero true edge, the expected maximum Sharpe is approximately 1.0[2]. A reported Sharpe of 1.0 from such a research process is therefore exactly the null — no edge at all.

The DSR formula

The full expression from Bailey and López de Prado (2014)[1]:

DSR = Φ( (SR − E[max SR*]) · √(T − 1) / √(1 − γ₃·SR + ((γ₄ − 1)/4)·SR²) )

Components:

  • SR, observed annualised Sharpe ratio.
  • T — number of return observations.
  • γ₃, sample skewness of returns.
  • γ₄ — sample excess kurtosis (kurtosis minus 3).
  • Φ(·), standard normal CDF.
  • E[max SR*], expected maximum Sharpe under the null of zero true edge across N trials, computed as:
E[max SR*] = (1 − γ_E) · Φ⁻¹(1 − 1/N) + γ_E · Φ⁻¹(1 − 1/(N·e))

where γ_E ≈ 0.5772 is the Euler-Mascheroni constant and e ≈ 2.718. This is a closed-form approximation of the expected maximum of N independent standard normals; the derivation traces back to the extreme-value statistics literature[3].

The output DSR is a probability between 0 and 1: the probability that the true Sharpe exceeds zero, after correcting for selection bias and non-Gaussian moments.

Deriving the components

Why E[max SR*] looks the way it does

If we draw N values from a standard normal and take the maximum, the expected maximum for large N is approximately Φ⁻¹(1 − 1/N). The two-term Bailey-López de Prado formula adds a second-order correction for finite N. For N=100, Φ⁻¹(0.99) ≈ 2.33, so the expected max Sharpe under the null is roughly 2.3 standard errors above zero — and at 5 years of monthly data, one standard error of Sharpe is about 1/√60 ≈ 0.13, so E[max SR*] ≈ 2.3 × 0.13 ≈ 0.30 annualised.

For N=1,000, Φ⁻¹(0.999) ≈ 3.09, giving E[max SR*] ≈ 3.09 × 0.13 ≈ 0.40. A pure-noise research process running 1,000 trials regularly produces backtests with Sharpe 0.4. Anything below that is unambiguously noise.

Why the denominator includes skew and kurtosis

The (1 − γ₃·SR + ((γ₄ − 1)/4)·SR²) term inflates the standard error of the Sharpe estimator under non-normality. The Mertens (2002) variance formula for the Sharpe estimator gives this exact expression; it appears earlier in the work of Christie (1982)[4]. For a strategy with negative skew (γ₃ < 0) and high excess kurtosis (γ₄ > 3), the denominator grows, the z-score shrinks, and DSR falls. This is why short-volatility strategies with skew −2 and kurtosis 11 (the XIV signature pre-2018) deflate aggressively even before the multiple-testing correction is applied.

Worked example: 1,000-trial Monte Carlo

Simulate 1,000 backtest trials, each with 5 years of monthly returns (T=60) drawn from an N(0, 0.01) distribution, exactly zero true edge. For each trial, compute the Sharpe. Take the maximum. Repeat the entire experiment 10,000 times to get a distribution of max SR*.

import numpy as np

rng = np.random.default_rng(42)
T, N, runs = 60, 1000, 10000
max_sharpes = np.empty(runs)

for r in range(runs):
    rets = rng.normal(0, 0.01, size=(N, T))
    sharpes = rets.mean(axis=1) / rets.std(axis=1, ddof=0) * np.sqrt(12)
    max_sharpes[r] = sharpes.max()

print(f"E[max SR*] empirical = {max_sharpes.mean():.3f}")
print(f"95th pct = {np.percentile(max_sharpes, 95):.3f}")

# Closed-form
gamma = 0.5772156649
from scipy.stats import norm
emax = (1 - gamma) * norm.ppf(1 - 1/N) + gamma * norm.ppf(1 - 1/(N*np.e))
print(f"E[max SR*] closed-form = {emax:.3f}")

The empirical value lands at approximately 1.40 (annualised at √12 monthly). The closed-form Bailey-López de Prado approximation gives 1.36. The two agree to two decimals, validating the formula for a representative parameter setting.

The deflation table

Substitute typical (SR, N, T, γ₃, γ₄) tuples and compute DSR:

SR N (trials) T (months) γ₃ γ₄ E[max SR*] DSR
1.5 1 60 0 0 0 0.998
1.5 100 60 0 0 0.30 0.984
2.0 100 60 0 0 0.30 0.999
2.5 100 60 0 0 0.30 0.9999
2.5 1,000 60 0 0 0.40 0.999
2.5 1,000 60 −2 11 0.40 0.621
2.0 1,000 60 −1 5 0.40 0.881
1.5 1,000 60 −1 5 0.40 0.706
1.5 100 60 −2 11 0.30 0.572

Read the table. The Gaussian-returns rows look fine — DSR collapses only when skew goes negative and kurtosis spikes. The combination of multiple testing and fat tails is what kills strategies. Either alone is survivable; together they are devastating.

The headline result mentioned in the TL;DR, Sharpe 2.5 over 100 trials worse than Sharpe 1.5 over 1 — assumes typical short-vol moments (skew −2, kurtosis 11). Plug those in: Sharpe 2.5 / 100 trials / negative skew / fat tails deflates well below 0.5; Sharpe 1.5 / 1 trial / Gaussian deflates to 0.998. The single trial wins.

What counts as a trial

The most-cheated input is N. Every parameter sweep is a trial. Every feature swap is a trial. Every regime cut is a trial. Every retraining cadence is a trial. The temptation is to count only the explicitly-published variants and quietly forget the dozen tested-and-discarded paths.

Harvey and Liu (2015) document that published cross-sectional anomalies are subject to N in the hundreds[5]. For a single retail researcher, N is rarely below 20 and often above 100. Pre-registration, writing down the strategy parameters before running the backtest, is the only honest way to keep N at 1.

When DSR fails

Three known limitations:

  1. Independence assumption. The N trials are assumed independent draws. In practice, parameter sweeps over neighbouring values are correlated; "effective N" can be less than the count. Adjust by computing the correlation matrix of trial returns and using the trace of the inverse correlation as effective N.
  2. Static moments. The skew and kurtosis used in the denominator are sample estimates from the realised returns. For short samples, these are noisy; for long samples, they smooth across regimes that may not persist. Bailey, Borwein, López de Prado, and Zhu (2017) discuss the noise issue[6].
  3. Finite-T correction. The √(T − 1) factor assumes large-sample asymptotics. For T < 30 (less than 30 monthly observations), DSR is over-confident; apply a small-sample t-distribution correction.

A 30-line implementation

import numpy as np
import pandas as pd
from scipy.stats import norm

def deflated_sharpe(
    returns: pd.Series,
    n_trials: int,
    freq: int = 12,
) -> float:
    r = returns.dropna().astype(float).values
    T = len(r)
    if T < 12 or n_trials < 1:
        raise ValueError("need T >= 12 and n_trials >= 1")
    sharpe = np.sqrt(freq) * r.mean() / r.std(ddof=0)
    g3 = float(pd.Series(r).skew())
    g4 = float(pd.Series(r).kurtosis())  # excess kurtosis
    gamma = 0.5772156649
    if n_trials == 1:
        emax = 0.0
    else:
        emax = ((1 - gamma) * norm.ppf(1 - 1/n_trials)
                + gamma * norm.ppf(1 - 1/(n_trials * np.e)))
    denom = np.sqrt(1 - g3 * sharpe + ((g4 - 1) / 4) * sharpe**2)
    if denom <= 0:
        return float("nan")
    z = (sharpe - emax) * np.sqrt(T - 1) / denom
    return float(norm.cdf(z))

Usage:

>>> deflated_sharpe(monthly_returns, n_trials=100, freq=12)
0.572

A DSR below 0.5 is below random — the strategy is more likely than not to have zero or negative true edge. A DSR above 0.95 is the conventional bar for "real edge after multiple testing." The middle band (0.5–0.95) is where most published strategies sit, which is itself the lesson.

Connects to

References

  1. Bailey, D. H., & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107. DOI: 10.3905/jpm.2014.40.5.094. SSRN: 2308657.
  2. Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2014). "Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance." Notices of the AMS 61(5), 458–471. DOI: 10.1090/noti1105.
  3. Embrechts, P., Klüppelberg, C., & Mikosch, T. (1997). Modelling Extremal Events for Insurance and Finance. Springer. ISBN 978-3-540-60931-5.
  4. Mertens, E. (2002). "Comments on Variance of the IID Estimator in Lo (2002)." Working paper, University of Basel.
  5. Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 42(1), 13–28. DOI: 10.3905/jpm.2015.42.1.013.
  6. Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2017). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–69. DOI: 10.21314/JCF.2016.322.
  7. López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. ISBN 978-1119482086.
  8. Lo, A. W. (2002). "The Statistics of Sharpe Ratios." Financial Analysts Journal 58(4), 36–52. DOI: 10.2469/faj.v58.n4.2453.
  9. Sharpe, W. F. (1966). "Mutual Fund Performance." Journal of Business 39(1), 119–138. DOI: 10.1086/294846.