The short answer

The minimum annualized Sharpe ratio you must beat to claim real edge depends on how many strategy variants you tried (N) and how long your backtest is (T). The benchmark is zero at N=1, rises with N, and shrinks as T grows. At N=100 over 24 months it is 1.83; at 240 months it is 0.57.

The minimum annualized Sharpe ratio you must beat to claim real edge is not a fixed number like 1.0. It depends on two things you control: how many strategy variants you tried (N) and how long your backtest is (T). The table below is the selection-bias benchmark SR_0, computed live from the Deflated Sharpe Ratio engine over a grid of N and T. Read off your cell, and any observed Sharpe at or below that value is statistically indistinguishable from the best of N lucky coin-flips.

TL;DR

  • The benchmark SR_0 is the expected maximum Sharpe under the null that every one of your N trials has zero true edge.
  • It rises with the number of trials N and falls as the backtest gets longer (T).
  • At N = 1 it is 0.00: a single pre-registered hypothesis has no selection bias to clear.
  • At N = 100 trials over 24 months it is 1.83; over 240 months it drops to 0.57.
  • Short backtests punish search hardest. The same N is far more dangerous on 24 months of data than on 240.
  • Every cell is the direct output of the /deflated-sharpe-ratio/ engine, recomputed at build time. No number here is hand-typed.

The benchmark table

SR_0 is the minimum annualized Sharpe an observed strategy must exceed before its result is more than selection noise. Rows are the number of strategy variants tried (N); columns are the backtest sample length T in months. The grid uses monthly returns (periods_per_year = 12), zero skew, and normal kurtosis.

N \ T (months) 24 36 60 120 240
1 0.00 0.00 0.00 0.00 0.00
10 1.14 0.92 0.71 0.50 0.35
50 1.64 1.33 1.03 0.72 0.51
100 1.83 1.48 1.14 0.80 0.57
500 2.20 1.79 1.38 0.97 0.68
1000 2.35 1.91 1.47 1.03 0.73

Values are SR_0 to two decimal places, annualized. A backtest with a higher observed Sharpe than its cell has cleared the selection-bias bar; one at or below it has not.

How to read this

Find the row for the number of strategy variants you tested before settling on the one you are reporting. That count is N, and it includes the silent ones: every parameter you swept, every entry rule you tried and dropped, every asset you screened. Then find the column closest to your backtest length in months. The cell is the Sharpe you have to beat.

Three worked reads:

  • You ran one named hypothesis on 5 years of data (N = 1, T = 60). SR_0 = 0.00. There is no multiple-testing correction to apply, so judge the Sharpe on its own standard error.
  • You swept 100 parameter sets on 2 years of data (N = 100, T = 24). SR_0 = 1.83. An observed Sharpe of 1.5 is below the bar, so it is consistent with pure luck across 100 tries.
  • You ran the same 100 sweeps on 20 years of data (N = 100, T = 240). SR_0 = 0.57. Now a 1.5 Sharpe sits well clear of the bar, because the longer sample tightens every estimate.

The table makes the trade-off concrete. Moving right (more data) lowers the bar; moving down (more trials) raises it.

Why short backtests punish search hardest

The benchmark factors into two parts. The first is an extreme-value term that depends only on N: the expected maximum of N independent standard normals. The second is the standard error of a Sharpe estimate, which scales as the square root of one over the sample length. Multiply them and you get SR_0.

That second factor is the whole story behind the columns. The standard error of a Sharpe ratio shrinks slowly, as one over the square root of T. A 240-month backtest has a Sharpe standard error roughly a third of a 24-month one, because the square root of 240 over 24 is about 3.16. So the same trial count produces a benchmark about 3x lower at 240 months than at 24. Selection bias is not a fixed haircut. It is a haircut measured in units of estimation noise, and short samples are noisy.

This inverts the comfortable intuition. People assume a long backtest is safe and a short one is risky on its own terms. The deeper point is that the danger of searching is amplified on short data. Run 100 variants on 2 years and the luckiest one will look spectacular; run them on 20 years and luck has far less room to fabricate a high Sharpe.

The methodology

The benchmark is the Bailey and Lopez de Prado deflated-Sharpe construction1. Under the null that all N trials have zero true Sharpe, the expected maximum observed Sharpe is approximately:

SR_0 = E[max of N standard normals] x sqrt( periods_per_year / (n - 1) )

E[max of N standard normals] ~= (1 - g) * Z(1 - 1/N) + g * Z(1 - 1/(N*e))

where Z is the inverse standard-normal CDF, g is the Euler-Mascheroni constant (about 0.5772), n is the number of return observations, and periods_per_year annualizes the Sharpe. For the table, returns are monthly, so n equals T and periods_per_year is 12.

The first factor grows in N but slowly: from N = 10 to N = 1000 the expected-maximum term roughly doubles, not 100-folds, because order statistics of the normal pile up. The second factor is the Sharpe standard error, and it carries the dependence on sample length. SR_0 does not depend on the observed Sharpe at all; it is the bar, set before you look at your result.

A note on what this is. The table is a computed reference, not a measured backtest result. It tells you the threshold; it does not claim any strategy beat it. To test a real strategy against its own cell, including its skew and kurtosis, run it through the /deflated-sharpe-ratio/ engine, which also returns the probabilistic Sharpe and the deflated Sharpe.

Counting N honestly is the hard part

The table is only as honest as the N you plug into it. The common failure is undercounting trials. N is not "the number of strategies I wrote up." It is every distinct configuration the data saw before you picked a winner. A grid search over 5 entry thresholds, 4 exits, and 5 holding periods is 100 trials, even if you report one. Walk-forward folds, asset screens, and feature subsets all multiply in.

If you cannot count N, you cannot use any deflation honestly, and the defensible move is to pre-register a small fixed set of named hypotheses before touching the data. That collapses N toward 1, where the benchmark is zero and the result stands on its own standard error. Bounding the trial budget in advance is the cheapest way to buy yourself a low bar.

Failure modes

  • Reporting the observed Sharpe without the bar. A 1.4 Sharpe means nothing until you state N and T. On the table it is a pass at N = 10, T = 24 and a fail at N = 500, T = 24.
  • Undercounting N. Forgetting the swept-and-dropped variants understates the bar and lets luck through.
  • Assuming a long backtest fixes search. Length helps, but the table shows N = 1000 on 240 months still demands 0.73, well above zero.
  • Treating SR_0 as the answer instead of the threshold. Clearing the bar is necessary, not sufficient; out-of-sample and live validation still apply.

Connects to

References

Footnotes

  1. Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94-107. pm-research.com

Verified engine output

Show the recompute-verified inputs and outputs
Selection-bias benchmark SR_0: N=1 trials, T=24 months
Inputs
observed_sr1.5
n24
skew0
kurt3
num_trials1
periods_per_year12
Result
psr0.9764646864518596
z1.9856628975878918
max expected sr0
effective benchmark0
deflated sr1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=1 trials, T=36 months
Inputs
observed_sr1.5
n36
skew0
kurt3
num_trials1
periods_per_year12
Result
psr0.9928470569777079
z2.4494897427831783
max expected sr0
effective benchmark0
deflated sr1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=1 trials, T=60 months
Inputs
observed_sr1.5
n60
skew0
kurt3
num_trials1
periods_per_year12
Result
psr0.9992643127082023
z3.180296482135858
max expected sr0
effective benchmark0
deflated sr1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=1 trials, T=120 months
Inputs
observed_sr1.5
n120
skew0
kurt3
num_trials1
periods_per_year12
Result
psr0.99999685555808
z4.5166359162544865
max expected sr0
effective benchmark0
deflated sr1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=1 trials, T=240 months
Inputs
observed_sr1.5
n240
skew0
kurt3
num_trials1
periods_per_year12
Result
psr0.9999999999224058
z6.4008927948707735
max expected sr0
effective benchmark0
deflated sr1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=24 months
Inputs
observed_sr1.5
n24
skew0
kurt3
num_trials10
periods_per_year12
Result
psr0.6844072563511564
z0.48005894671873295
max expected sr1.1373561590173058
effective benchmark1.1373561590173058
deflated sr0.3626438409826942

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=36 months
Inputs
observed_sr1.5
n36
skew0
kurt3
num_trials10
periods_per_year12
Result
psr0.8273860100746617
z0.943885791914019
max expected sr0.9219903585869584
effective benchmark0.9219903585869584
deflated sr0.5780096414130416

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=60 months
Inputs
observed_sr1.5
n60
skew0
kurt3
num_trials10
periods_per_year12
Result
psr0.9530027373706471
z1.674692531266699
max expected sr0.7101243355735859
effective benchmark0.7101243355735859
deflated sr0.7898756644264141

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=120 months
Inputs
observed_sr1.5
n120
skew0
kurt3
num_trials10
periods_per_year12
Result
psr0.9986981235686223
z3.011031965385327
max expected sr0.5000194764816396
effective benchmark0.5000194764816396
deflated sr0.9999805235183604

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=240 months
Inputs
observed_sr1.5
n240
skew0
kurt3
num_trials10
periods_per_year12
Result
psr0.9999995084932247
z4.895288844001613
max expected sr0.3528267069421108
effective benchmark0.3528267069421108
deflated sr1.147173293057889

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=24 months
Inputs
observed_sr1.5
n24
skew0
kurt3
num_trials50
periods_per_year12
Result
psr0.4243023109475148
z-0.19089916747833546
max expected sr1.6442081390377736
effective benchmark1.6442081390377736
deflated sr-0.14420813903777363

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=36 months
Inputs
observed_sr1.5
n36
skew0
kurt3
num_trials50
periods_per_year12
Result
psr0.6075455355348297
z0.2729276777169507
max expected sr1.332866613227674
effective benchmark1.332866613227674
deflated sr0.16713338677232592

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=60 months
Inputs
observed_sr1.5
n60
skew0
kurt3
num_trials50
periods_per_year12
Result
psr0.8422466713437592
z1.003734417069631
max expected sr1.0265845074314273
effective benchmark1.0265845074314273
deflated sr0.47341549256857274

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=120 months
Inputs
observed_sr1.5
n120
skew0
kurt3
num_trials50
periods_per_year12
Result
psr0.9903600567837493
z2.3400738511882584
max expected sr0.7228484115467027
effective benchmark0.7228484115467027
deflated sr0.7771515884532973

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=240 months
Inputs
observed_sr1.5
n240
skew0
kurt3
num_trials50
periods_per_year12
Result
psr0.9999880096540286
z4.224330729804546
max expected sr0.5100605809576373
effective benchmark0.5100605809576373
deflated sr0.9899394190423627

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=24 months
Inputs
observed_sr1.5
n24
skew0
kurt3
num_trials100
periods_per_year12
Result
psr0.3321238130191362
z-0.4340562855103524
max expected sr1.827892729957558
effective benchmark1.827892729957558
deflated sr-0.327892729957558

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=36 months
Inputs
observed_sr1.5
n36
skew0
kurt3
num_trials100
periods_per_year12
Result
psr0.5118750344019024
z0.029770559684933494
max expected sr1.4817693298537102
effective benchmark1.4817693298537102
deflated sr0.018230670146289762

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=60 months
Inputs
observed_sr1.5
n60
skew0
kurt3
num_trials100
periods_per_year12
Result
psr0.7765452802430594
z0.7605772990376134
max expected sr1.1412705686514406
effective benchmark1.1412705686514406
deflated sr0.3587294313485594

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=120 months
Inputs
observed_sr1.5
n120
skew0
kurt3
num_trials100
periods_per_year12
Result
psr0.9819995900016145
z2.0969167331562413
max expected sr0.8036022477670218
effective benchmark0.8036022477670218
deflated sr0.6963977522329782

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=240 months
Inputs
observed_sr1.5
n240
skew0
kurt3
num_trials100
periods_per_year12
Result
psr0.9999656965263144
z3.9811736117725283
max expected sr0.567042581552976
effective benchmark0.567042581552976
deflated sr0.932957418447024

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=24 months
Inputs
observed_sr1.5
n24
skew0
kurt3
num_trials500
periods_per_year12
Result
psr0.17538112020017838
z-0.9331117318003248
max expected sr2.2048868160858275
effective benchmark2.2048868160858275
deflated sr-0.7048868160858275

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=36 months
Inputs
observed_sr1.5
n36
skew0
kurt3
num_trials500
periods_per_year12
Result
psr0.31943301847666294
z-0.4692848866050394
max expected sr1.7873771290455527
effective benchmark1.7873771290455527
deflated sr-0.2873771290455527

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=60 months
Inputs
observed_sr1.5
n60
skew0
kurt3
num_trials500
periods_per_year12
Result
psr0.6031548943326397
z0.26152185274764106
max expected sr1.3766521356342198
effective benchmark1.3766521356342198
deflated sr0.12334786436578016

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=120 months
Inputs
observed_sr1.5
n120
skew0
kurt3
num_trials500
periods_per_year12
Result
psr0.9449630759403237
z1.597861286866269
max expected sr0.9693413472461175
effective benchmark0.9693413472461175
deflated sr0.5306586527538825

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=240 months
Inputs
observed_sr1.5
n240
skew0
kurt3
num_trials500
periods_per_year12
Result
psr0.9997512223407629
z3.4821181654825555
max expected sr0.6839923873730049
effective benchmark0.6839923873730049
deflated sr0.8160076126269951

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=24 months
Inputs
observed_sr1.5
n24
skew0
kurt3
num_trials1000
periods_per_year12
Result
psr0.12990752817147133
z-1.1268285877789788
max expected sr2.351223479937965
effective benchmark2.351223479937965
deflated sr-0.851223479937965

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=36 months
Inputs
observed_sr1.5
n36
skew0
kurt3
num_trials1000
periods_per_year12
Result
psr0.2536646490737231
z-0.6630017425836934
max expected sr1.9060039919765324
effective benchmark1.9060039919765324
deflated sr-0.40600399197653236

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=60 months
Inputs
observed_sr1.5
n60
skew0
kurt3
num_trials1000
periods_per_year12
Result
psr0.5270296377563595
z0.06780499676898694
max expected sr1.4680194926086971
effective benchmark1.4680194926086971
deflated sr0.03198050739130287

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=120 months
Inputs
observed_sr1.5
n120
skew0
kurt3
num_trials1000
periods_per_year12
Result
psr0.9198620262647988
z1.4041444308876148
max expected sr1.0336757964591385
effective benchmark1.0336757964591385
deflated sr0.46632420354086146

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=240 months
Inputs
observed_sr1.5
n240
skew0
kurt3
num_trials1000
periods_per_year12
Result
psr0.9994961503534184
z3.2884013095039015
max expected sr0.7293884427796551
effective benchmark0.7293884427796551
deflated sr0.7706115572203449

Computed live at build time.

Frequently asked questions

What is the selection-bias benchmark Sharpe ratio?
It is the highest annualized Sharpe you would expect from pure luck after trying N strategy variants on a backtest of a given length, under the null that none of them has real edge. An observed Sharpe must exceed this benchmark before it counts as evidence of skill rather than the best of many random tries.
Why does the minimum Sharpe fall as the backtest gets longer?
Because the benchmark scales with the standard error of a Sharpe estimate, which shrinks as one over the square root of the sample length. A 240-month backtest has roughly a third the Sharpe estimation noise of a 24-month one, so the same number of trials produces a bar about three times lower on the longer sample.
How do I count the number of trials N for my own strategy?
Count every distinct configuration the data was tested against before you chose the reported one, including swept parameters, dropped entry and exit rules, screened assets, and feature subsets. It is almost always far larger than the number of strategies you write up, and undercounting it makes any deflation dishonest.