What is the selection-bias benchmark Sharpe ratio?

It is the highest annualized Sharpe you would expect from pure luck after trying N strategy variants on a backtest of a given length, under the null that none of them has real edge. An observed Sharpe must exceed this benchmark before it counts as evidence of skill rather than the best of many random tries.

Why does the minimum Sharpe fall as the backtest gets longer?

Because the benchmark scales with the standard error of a Sharpe estimate, which shrinks as one over the square root of the sample length. A 240-month backtest has roughly a third the Sharpe estimation noise of a 24-month one, so the same number of trials produces a bar about three times lower on the longer sample.

How do I count the number of trials N for my own strategy?

Count every distinct configuration the data was tested against before you chose the reported one, including swept parameters, dropped entry and exit rules, screened assets, and feature subsets. It is almost always far larger than the number of strategies you write up, and undercounting it makes any deflation dishonest.

The Selection-Bias Sharpe Benchmark by Trials and Sample

The short answer

The minimum annualized Sharpe ratio you must beat to claim real edge depends on how many strategy variants you tried (N) and how long your backtest is (T). The benchmark is zero at N=1, rises with N, and shrinks as T grows. At N=100 over 24 months it is 1.83; at 240 months it is 0.57.

The minimum annualized Sharpe ratio you must beat to claim real edge is not a fixed number like 1.0. It depends on two things you control: how many strategy variants you tried (N) and how long your backtest is (T). The table below is the selection-bias benchmark SR_0, computed live from the Deflated Sharpe Ratio engine over a grid of N and T. Read off your cell, and any observed Sharpe at or below that value is statistically indistinguishable from the best of N lucky coin-flips.

TL;DR

The benchmark SR_0 is the expected maximum Sharpe under the null that every one of your N trials has zero true edge.
It rises with the number of trials N and falls as the backtest gets longer (T).
At N = 1 it is 0.00: a single pre-registered hypothesis has no selection bias to clear.
At N = 100 trials over 24 months it is 1.83; over 240 months it drops to 0.57.
Short backtests punish search hardest. The same N is far more dangerous on 24 months of data than on 240.
Every cell is the direct output of the /deflated-sharpe-ratio/ engine, recomputed at build time. No number here is hand-typed.

The benchmark table

SR_0 is the minimum annualized Sharpe an observed strategy must exceed before its result is more than selection noise. Rows are the number of strategy variants tried (N); columns are the backtest sample length T in months. The grid uses monthly returns (periods_per_year = 12), zero skew, and normal kurtosis.

N \ T (months)	24	36	60	120	240
1	0.00	0.00	0.00	0.00	0.00
10	1.14	0.92	0.71	0.50	0.35
50	1.64	1.33	1.03	0.72	0.51
100	1.83	1.48	1.14	0.80	0.57
500	2.20	1.79	1.38	0.97	0.68
1000	2.35	1.91	1.47	1.03	0.73

Values are SR_0 to two decimal places, annualized. A backtest with a higher observed Sharpe than its cell has cleared the selection-bias bar; one at or below it has not.

How to read this

Find the row for the number of strategy variants you tested before settling on the one you are reporting. That count is N, and it includes the silent ones: every parameter you swept, every entry rule you tried and dropped, every asset you screened. Then find the column closest to your backtest length in months. The cell is the Sharpe you have to beat.

Three worked reads:

You ran one named hypothesis on 5 years of data (N = 1, T = 60). SR_0 = 0.00. There is no multiple-testing correction to apply, so judge the Sharpe on its own standard error.
You swept 100 parameter sets on 2 years of data (N = 100, T = 24). SR_0 = 1.83. An observed Sharpe of 1.5 is below the bar, so it is consistent with pure luck across 100 tries.
You ran the same 100 sweeps on 20 years of data (N = 100, T = 240). SR_0 = 0.57. Now a 1.5 Sharpe sits well clear of the bar, because the longer sample tightens every estimate.

The table makes the trade-off concrete. Moving right (more data) lowers the bar; moving down (more trials) raises it.

Why short backtests punish search hardest

The benchmark factors into two parts. The first is an extreme-value term that depends only on N: the expected maximum of N independent standard normals. The second is the standard error of a Sharpe estimate, which scales as the square root of one over the sample length. Multiply them and you get SR_0.

That second factor is the whole story behind the columns. The standard error of a Sharpe ratio shrinks slowly, as one over the square root of T. A 240-month backtest has a Sharpe standard error roughly a third of a 24-month one, because the square root of 240 over 24 is about 3.16. So the same trial count produces a benchmark about 3x lower at 240 months than at 24. Selection bias is not a fixed haircut. It is a haircut measured in units of estimation noise, and short samples are noisy.

This inverts the comfortable intuition. People assume a long backtest is safe and a short one is risky on its own terms. The deeper point is that the danger of searching is amplified on short data. Run 100 variants on 2 years and the luckiest one will look spectacular; run them on 20 years and luck has far less room to fabricate a high Sharpe.

The methodology

The benchmark is the Bailey and Lopez de Prado deflated-Sharpe construction¹. Under the null that all N trials have zero true Sharpe, the expected maximum observed Sharpe is approximately:

SR_0 = E[max of N standard normals] x sqrt( periods_per_year / (n - 1) )

E[max of N standard normals] ~= (1 - g) * Z(1 - 1/N) + g * Z(1 - 1/(N*e))

where Z is the inverse standard-normal CDF, g is the Euler-Mascheroni constant (about 0.5772), n is the number of return observations, and periods_per_year annualizes the Sharpe. For the table, returns are monthly, so n equals T and periods_per_year is 12.

The first factor grows in N but slowly: from N = 10 to N = 1000 the expected-maximum term roughly doubles, not 100-folds, because order statistics of the normal pile up. The second factor is the Sharpe standard error, and it carries the dependence on sample length. SR_0 does not depend on the observed Sharpe at all; it is the bar, set before you look at your result.

A note on what this is. The table is a computed reference, not a measured backtest result. It tells you the threshold; it does not claim any strategy beat it. To test a real strategy against its own cell, including its skew and kurtosis, run it through the /deflated-sharpe-ratio/ engine, which also returns the probabilistic Sharpe and the deflated Sharpe.

Counting N honestly is the hard part

The table is only as honest as the N you plug into it. The common failure is undercounting trials. N is not "the number of strategies I wrote up." It is every distinct configuration the data saw before you picked a winner. A grid search over 5 entry thresholds, 4 exits, and 5 holding periods is 100 trials, even if you report one. Walk-forward folds, asset screens, and feature subsets all multiply in.

If you cannot count N, you cannot use any deflation honestly, and the defensible move is to pre-register a small fixed set of named hypotheses before touching the data. That collapses N toward 1, where the benchmark is zero and the result stands on its own standard error. Bounding the trial budget in advance is the cheapest way to buy yourself a low bar.

Failure modes

Reporting the observed Sharpe without the bar. A 1.4 Sharpe means nothing until you state N and T. On the table it is a pass at N = 10, T = 24 and a fail at N = 500, T = 24.
Undercounting N. Forgetting the swept-and-dropped variants understates the bar and lets luck through.
Assuming a long backtest fixes search. Length helps, but the table shows N = 1000 on 240 months still demands 0.73, well above zero.
Treating SR_0 as the answer instead of the threshold. Clearing the bar is necessary, not sufficient; out-of-sample and live validation still apply.

Connects to

Deflated Sharpe in Low-Trial Regimes: the same engine at a fixed sample length, varying only N.
Selection Bias in LLM Strategy Research: how agentic search inflates the effective N.
Deflated Sharpe vs PBO on the Same Tape: two overfitting diagnostics compared head to head.
PBO Score on an Eight-Strategy Matrix: the combinatorial overfitting probability alternative.
Walk-Forward Window Sizing: how fold counts feed back into N.
Deflated Sharpe Ratio: test your own strategy against its cell.
Deflated Sharpe Ratio methodology: full input and output specification.

References

Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94-107. pm-research.com ↩

Verified engine output

Show the recompute-verified inputs and outputs

Selection-bias benchmark SR_0: N=1 trials, T=24 months

Inputs
observed_sr	1.5
n	24
skew	0
kurt	3
num_trials	1
periods_per_year	12

Result
psr	0.9764646864518596
z	1.9856628975878918
max expected sr	0
effective benchmark	0
deflated sr	1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=1 trials, T=36 months

Inputs
observed_sr	1.5
n	36
skew	0
kurt	3
num_trials	1
periods_per_year	12

Result
psr	0.9928470569777079
z	2.4494897427831783
max expected sr	0
effective benchmark	0
deflated sr	1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=1 trials, T=60 months

Inputs
observed_sr	1.5
n	60
skew	0
kurt	3
num_trials	1
periods_per_year	12

Result
psr	0.9992643127082023
z	3.180296482135858
max expected sr	0
effective benchmark	0
deflated sr	1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=1 trials, T=120 months

Inputs
observed_sr	1.5
n	120
skew	0
kurt	3
num_trials	1
periods_per_year	12

Result
psr	0.99999685555808
z	4.5166359162544865
max expected sr	0
effective benchmark	0
deflated sr	1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=1 trials, T=240 months

Inputs
observed_sr	1.5
n	240
skew	0
kurt	3
num_trials	1
periods_per_year	12

Result
psr	0.9999999999224058
z	6.4008927948707735
max expected sr	0
effective benchmark	0
deflated sr	1.5

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=24 months

Inputs
observed_sr	1.5
n	24
skew	0
kurt	3
num_trials	10
periods_per_year	12

Result
psr	0.6844072563511564
z	0.48005894671873295
max expected sr	1.1373561590173058
effective benchmark	1.1373561590173058
deflated sr	0.3626438409826942

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=36 months

Inputs
observed_sr	1.5
n	36
skew	0
kurt	3
num_trials	10
periods_per_year	12

Result
psr	0.8273860100746617
z	0.943885791914019
max expected sr	0.9219903585869584
effective benchmark	0.9219903585869584
deflated sr	0.5780096414130416

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=60 months

Inputs
observed_sr	1.5
n	60
skew	0
kurt	3
num_trials	10
periods_per_year	12

Result
psr	0.9530027373706471
z	1.674692531266699
max expected sr	0.7101243355735859
effective benchmark	0.7101243355735859
deflated sr	0.7898756644264141

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=120 months

Inputs
observed_sr	1.5
n	120
skew	0
kurt	3
num_trials	10
periods_per_year	12

Result
psr	0.9986981235686223
z	3.011031965385327
max expected sr	0.5000194764816396
effective benchmark	0.5000194764816396
deflated sr	0.9999805235183604

Computed live at build time.

Selection-bias benchmark SR_0: N=10 trials, T=240 months

Inputs
observed_sr	1.5
n	240
skew	0
kurt	3
num_trials	10
periods_per_year	12

Result
psr	0.9999995084932247
z	4.895288844001613
max expected sr	0.3528267069421108
effective benchmark	0.3528267069421108
deflated sr	1.147173293057889

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=24 months

Inputs
observed_sr	1.5
n	24
skew	0
kurt	3
num_trials	50
periods_per_year	12

Result
psr	0.4243023109475148
z	-0.19089916747833546
max expected sr	1.6442081390377736
effective benchmark	1.6442081390377736
deflated sr	-0.14420813903777363

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=36 months

Inputs
observed_sr	1.5
n	36
skew	0
kurt	3
num_trials	50
periods_per_year	12

Result
psr	0.6075455355348297
z	0.2729276777169507
max expected sr	1.332866613227674
effective benchmark	1.332866613227674
deflated sr	0.16713338677232592

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=60 months

Inputs
observed_sr	1.5
n	60
skew	0
kurt	3
num_trials	50
periods_per_year	12

Result
psr	0.8422466713437592
z	1.003734417069631
max expected sr	1.0265845074314273
effective benchmark	1.0265845074314273
deflated sr	0.47341549256857274

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=120 months

Inputs
observed_sr	1.5
n	120
skew	0
kurt	3
num_trials	50
periods_per_year	12

Result
psr	0.9903600567837493
z	2.3400738511882584
max expected sr	0.7228484115467027
effective benchmark	0.7228484115467027
deflated sr	0.7771515884532973

Computed live at build time.

Selection-bias benchmark SR_0: N=50 trials, T=240 months

Inputs
observed_sr	1.5
n	240
skew	0
kurt	3
num_trials	50
periods_per_year	12

Result
psr	0.9999880096540286
z	4.224330729804546
max expected sr	0.5100605809576373
effective benchmark	0.5100605809576373
deflated sr	0.9899394190423627

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=24 months

Inputs
observed_sr	1.5
n	24
skew	0
kurt	3
num_trials	100
periods_per_year	12

Result
psr	0.3321238130191362
z	-0.4340562855103524
max expected sr	1.827892729957558
effective benchmark	1.827892729957558
deflated sr	-0.327892729957558

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=36 months

Inputs
observed_sr	1.5
n	36
skew	0
kurt	3
num_trials	100
periods_per_year	12

Result
psr	0.5118750344019024
z	0.029770559684933494
max expected sr	1.4817693298537102
effective benchmark	1.4817693298537102
deflated sr	0.018230670146289762

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=60 months

Inputs
observed_sr	1.5
n	60
skew	0
kurt	3
num_trials	100
periods_per_year	12

Result
psr	0.7765452802430594
z	0.7605772990376134
max expected sr	1.1412705686514406
effective benchmark	1.1412705686514406
deflated sr	0.3587294313485594

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=120 months

Inputs
observed_sr	1.5
n	120
skew	0
kurt	3
num_trials	100
periods_per_year	12

Result
psr	0.9819995900016145
z	2.0969167331562413
max expected sr	0.8036022477670218
effective benchmark	0.8036022477670218
deflated sr	0.6963977522329782

Computed live at build time.

Selection-bias benchmark SR_0: N=100 trials, T=240 months

Inputs
observed_sr	1.5
n	240
skew	0
kurt	3
num_trials	100
periods_per_year	12

Result
psr	0.9999656965263144
z	3.9811736117725283
max expected sr	0.567042581552976
effective benchmark	0.567042581552976
deflated sr	0.932957418447024

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=24 months

Inputs
observed_sr	1.5
n	24
skew	0
kurt	3
num_trials	500
periods_per_year	12

Result
psr	0.17538112020017838
z	-0.9331117318003248
max expected sr	2.2048868160858275
effective benchmark	2.2048868160858275
deflated sr	-0.7048868160858275

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=36 months

Inputs
observed_sr	1.5
n	36
skew	0
kurt	3
num_trials	500
periods_per_year	12

Result
psr	0.31943301847666294
z	-0.4692848866050394
max expected sr	1.7873771290455527
effective benchmark	1.7873771290455527
deflated sr	-0.2873771290455527

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=60 months

Inputs
observed_sr	1.5
n	60
skew	0
kurt	3
num_trials	500
periods_per_year	12

Result
psr	0.6031548943326397
z	0.26152185274764106
max expected sr	1.3766521356342198
effective benchmark	1.3766521356342198
deflated sr	0.12334786436578016

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=120 months

Inputs
observed_sr	1.5
n	120
skew	0
kurt	3
num_trials	500
periods_per_year	12

Result
psr	0.9449630759403237
z	1.597861286866269
max expected sr	0.9693413472461175
effective benchmark	0.9693413472461175
deflated sr	0.5306586527538825

Computed live at build time.

Selection-bias benchmark SR_0: N=500 trials, T=240 months

Inputs
observed_sr	1.5
n	240
skew	0
kurt	3
num_trials	500
periods_per_year	12

Result
psr	0.9997512223407629
z	3.4821181654825555
max expected sr	0.6839923873730049
effective benchmark	0.6839923873730049
deflated sr	0.8160076126269951

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=24 months

Inputs
observed_sr	1.5
n	24
skew	0
kurt	3
num_trials	1000
periods_per_year	12

Result
psr	0.12990752817147133
z	-1.1268285877789788
max expected sr	2.351223479937965
effective benchmark	2.351223479937965
deflated sr	-0.851223479937965

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=36 months

Inputs
observed_sr	1.5
n	36
skew	0
kurt	3
num_trials	1000
periods_per_year	12

Result
psr	0.2536646490737231
z	-0.6630017425836934
max expected sr	1.9060039919765324
effective benchmark	1.9060039919765324
deflated sr	-0.40600399197653236

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=60 months

Inputs
observed_sr	1.5
n	60
skew	0
kurt	3
num_trials	1000
periods_per_year	12

Result
psr	0.5270296377563595
z	0.06780499676898694
max expected sr	1.4680194926086971
effective benchmark	1.4680194926086971
deflated sr	0.03198050739130287

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=120 months

Inputs
observed_sr	1.5
n	120
skew	0
kurt	3
num_trials	1000
periods_per_year	12

Result
psr	0.9198620262647988
z	1.4041444308876148
max expected sr	1.0336757964591385
effective benchmark	1.0336757964591385
deflated sr	0.46632420354086146

Computed live at build time.

Selection-bias benchmark SR_0: N=1000 trials, T=240 months

Inputs
observed_sr	1.5
n	240
skew	0
kurt	3
num_trials	1000
periods_per_year	12

Result
psr	0.9994961503534184
z	3.2884013095039015
max expected sr	0.7293884427796551
effective benchmark	0.7293884427796551
deflated sr	0.7706115572203449

Computed live at build time.

Frequently asked questions

What is the selection-bias benchmark Sharpe ratio?: It is the highest annualized Sharpe you would expect from pure luck after trying N strategy variants on a backtest of a given length, under the null that none of them has real edge. An observed Sharpe must exceed this benchmark before it counts as evidence of skill rather than the best of many random tries.
Why does the minimum Sharpe fall as the backtest gets longer?: Because the benchmark scales with the standard error of a Sharpe estimate, which shrinks as one over the square root of the sample length. A 240-month backtest has roughly a third the Sharpe estimation noise of a 24-month one, so the same number of trials produces a bar about three times lower on the longer sample.
How do I count the number of trials N for my own strategy?: Count every distinct configuration the data was tested against before you chose the reported one, including swept parameters, dropped entry and exit rules, screened assets, and feature subsets. It is almost always far larger than the number of strategies you write up, and undercounting it makes any deflation dishonest.

TL;DR

The benchmark table

How to read this

Why short backtests punish search hardest

The methodology

Counting N honestly is the hard part

Failure modes

Connects to

References

Footnotes

Verified engine output

Frequently asked questions