The Selection-Bias Sharpe Benchmark by Trials and Sample
A computed reference: the minimum annualized Sharpe needed to claim real edge, by strategy trials (N) and backtest length T. Short backtests punish search most.
The minimum annualized Sharpe ratio you must beat to claim real edge depends on how many strategy variants you tried (N) and how long your backtest is (T). The benchmark is zero at N=1, rises with N, and shrinks as T grows. At N=100 over 24 months it is 1.83; at 240 months it is 0.57.
The minimum annualized Sharpe ratio you must beat to claim real edge is not a fixed number like 1.0. It depends on two things you control: how many strategy variants you tried (N) and how long your backtest is (T). The table below is the selection-bias benchmark SR_0, computed live from the Deflated Sharpe Ratio engine over a grid of N and T. Read off your cell, and any observed Sharpe at or below that value is statistically indistinguishable from the best of N lucky coin-flips.
TL;DR
The benchmark SR_0 is the expected maximum Sharpe under the null that every one of your N trials has zero true edge.
It rises with the number of trials N and falls as the backtest gets longer (T).
At N = 1 it is 0.00: a single pre-registered hypothesis has no selection bias to clear.
At N = 100 trials over 24 months it is 1.83; over 240 months it drops to 0.57.
Short backtests punish search hardest. The same N is far more dangerous on 24 months of data than on 240.
Every cell is the direct output of the /deflated-sharpe-ratio/ engine, recomputed at build time. No number here is hand-typed.
The benchmark table
SR_0 is the minimum annualized Sharpe an observed strategy must exceed before its result is more than selection noise. Rows are the number of strategy variants tried (N); columns are the backtest sample length T in months. The grid uses monthly returns (periods_per_year = 12), zero skew, and normal kurtosis.
N \ T (months)
24
36
60
120
240
1
0.00
0.00
0.00
0.00
0.00
10
1.14
0.92
0.71
0.50
0.35
50
1.64
1.33
1.03
0.72
0.51
100
1.83
1.48
1.14
0.80
0.57
500
2.20
1.79
1.38
0.97
0.68
1000
2.35
1.91
1.47
1.03
0.73
Values are SR_0 to two decimal places, annualized. A backtest with a higher observed Sharpe than its cell has cleared the selection-bias bar; one at or below it has not.
How to read this
Find the row for the number of strategy variants you tested before settling on the one you are reporting. That count is N, and it includes the silent ones: every parameter you swept, every entry rule you tried and dropped, every asset you screened. Then find the column closest to your backtest length in months. The cell is the Sharpe you have to beat.
Three worked reads:
You ran one named hypothesis on 5 years of data (N = 1, T = 60). SR_0 = 0.00. There is no multiple-testing correction to apply, so judge the Sharpe on its own standard error.
You swept 100 parameter sets on 2 years of data (N = 100, T = 24). SR_0 = 1.83. An observed Sharpe of 1.5 is below the bar, so it is consistent with pure luck across 100 tries.
You ran the same 100 sweeps on 20 years of data (N = 100, T = 240). SR_0 = 0.57. Now a 1.5 Sharpe sits well clear of the bar, because the longer sample tightens every estimate.
The table makes the trade-off concrete. Moving right (more data) lowers the bar; moving down (more trials) raises it.
Why short backtests punish search hardest
The benchmark factors into two parts. The first is an extreme-value term that depends only on N: the expected maximum of N independent standard normals. The second is the standard error of a Sharpe estimate, which scales as the square root of one over the sample length. Multiply them and you get SR_0.
That second factor is the whole story behind the columns. The standard error of a Sharpe ratio shrinks slowly, as one over the square root of T. A 240-month backtest has a Sharpe standard error roughly a third of a 24-month one, because the square root of 240 over 24 is about 3.16. So the same trial count produces a benchmark about 3x lower at 240 months than at 24. Selection bias is not a fixed haircut. It is a haircut measured in units of estimation noise, and short samples are noisy.
This inverts the comfortable intuition. People assume a long backtest is safe and a short one is risky on its own terms. The deeper point is that the danger of searching is amplified on short data. Run 100 variants on 2 years and the luckiest one will look spectacular; run them on 20 years and luck has far less room to fabricate a high Sharpe.
The methodology
The benchmark is the Bailey and Lopez de Prado deflated-Sharpe construction1. Under the null that all N trials have zero true Sharpe, the expected maximum observed Sharpe is approximately:
SR_0 = E[max of N standard normals] x sqrt( periods_per_year / (n - 1) )
E[max of N standard normals] ~= (1 - g) * Z(1 - 1/N) + g * Z(1 - 1/(N*e))
where Z is the inverse standard-normal CDF, g is the Euler-Mascheroni constant (about 0.5772), n is the number of return observations, and periods_per_year annualizes the Sharpe. For the table, returns are monthly, so n equals T and periods_per_year is 12.
The first factor grows in N but slowly: from N = 10 to N = 1000 the expected-maximum term roughly doubles, not 100-folds, because order statistics of the normal pile up. The second factor is the Sharpe standard error, and it carries the dependence on sample length. SR_0 does not depend on the observed Sharpe at all; it is the bar, set before you look at your result.
A note on what this is. The table is a computed reference, not a measured backtest result. It tells you the threshold; it does not claim any strategy beat it. To test a real strategy against its own cell, including its skew and kurtosis, run it through the /deflated-sharpe-ratio/ engine, which also returns the probabilistic Sharpe and the deflated Sharpe.
Counting N honestly is the hard part
The table is only as honest as the N you plug into it. The common failure is undercounting trials. N is not "the number of strategies I wrote up." It is every distinct configuration the data saw before you picked a winner. A grid search over 5 entry thresholds, 4 exits, and 5 holding periods is 100 trials, even if you report one. Walk-forward folds, asset screens, and feature subsets all multiply in.
If you cannot count N, you cannot use any deflation honestly, and the defensible move is to pre-register a small fixed set of named hypotheses before touching the data. That collapses N toward 1, where the benchmark is zero and the result stands on its own standard error. Bounding the trial budget in advance is the cheapest way to buy yourself a low bar.
Failure modes
Reporting the observed Sharpe without the bar. A 1.4 Sharpe means nothing until you state N and T. On the table it is a pass at N = 10, T = 24 and a fail at N = 500, T = 24.
Undercounting N. Forgetting the swept-and-dropped variants understates the bar and lets luck through.
Assuming a long backtest fixes search. Length helps, but the table shows N = 1000 on 240 months still demands 0.73, well above zero.
Treating SR_0 as the answer instead of the threshold. Clearing the bar is necessary, not sufficient; out-of-sample and live validation still apply.
Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94-107. pm-research.com↩
What is the selection-bias benchmark Sharpe ratio?
It is the highest annualized Sharpe you would expect from pure luck after trying N strategy variants on a backtest of a given length, under the null that none of them has real edge. An observed Sharpe must exceed this benchmark before it counts as evidence of skill rather than the best of many random tries.
Why does the minimum Sharpe fall as the backtest gets longer?
Because the benchmark scales with the standard error of a Sharpe estimate, which shrinks as one over the square root of the sample length. A 240-month backtest has roughly a third the Sharpe estimation noise of a 24-month one, so the same number of trials produces a bar about three times lower on the longer sample.
How do I count the number of trials N for my own strategy?
Count every distinct configuration the data was tested against before you chose the reported one, including swept parameters, dropped entry and exit rules, screened assets, and feature subsets. It is almost always far larger than the number of strategies you write up, and undercounting it makes any deflation dishonest.