For an annualized Sharpe of 1.8 over T = 504 daily observations with skew −0.6 and excess kurtosis 4.5, the Deflated Sharpe Ratio engine returns PSR = 0.993 at N = 1 (no selection) and PSR = 0.298 at N = 40 trials. The selection-bias benchmark grows from 0 → 2.19 → 2.53 → 3.26 as N moves 1 → 40 → 100 → 1000. The practical implication for solo LLM research programmes is the opposite of the intuitive one: low-trial regimes do not protect raw Sharpe, they just shrink the cushion above the gate. The defensible response is to bound the trial budget before the search begins and report it as a fixed parameter.

TL;DR

Run the same Sharpe = 1.8 strategy on four trial counts and the engine output tells the story:

num_trials maxExpectedSr (benchmark) deflatedSr PSR
1 0.000 1.800 0.993
40 2.189 −0.389 0.298
100 2.531 −0.731 0.160
1000 3.255 −1.455 0.024

Numbers are direct output of /deflated-sharpe-ratio/ on the canonical input (observed_sr = 1.8, n = 504, skew = −0.6, kurt = 4.5, periods_per_year = 252). The takeaway is that the selection-bias benchmark grows steadily in N (roughly the expected maximum of N standard normals), and a 1.8 Sharpe clears the conventional PSR ≥ 0.95 gate only at N = 1. By N = 40 the benchmark (≈ 2.19) already sits above the observed 1.8, so PSR falls to 0.30; by N = 1000 it is 0.02. The Bailey–Lopez de Prado 2014 benchmark1 never explodes. It just overtakes a fixed observed Sharpe somewhere between N = 20 and N = 40 for these moments.

The selection-bias benchmark grows in num_trials, not the deflated Sharpe

Bailey and Lopez de Prado's 2014 result is that the expected maximum Sharpe under the null (zero true edge across all candidates) grows approximately as the inverse normal of 1 − 1/N, plus an Euler-correction term1. The engine implements this directly: E[max SR*] ≈ (1 − γ)·Φ⁻¹(1 − 1/N) + γ·Φ⁻¹(1 − 1/(N·e)) with γ ≈ 0.5772.

For N = 40 the closed form gives maxExpectedSr ≈ 2.19. This is a dimensionless extreme-value quantity (the expected maximum of N independent standard normals), and it lives in the same Sharpe units as the observed 1.8 (no annualization rescaling: the √252 factor annualizes a return standard error, not the spread of a set of Sharpe estimates). The deflatedSr = observed_sr − maxExpectedSr = 1.8 − 2.19 = −0.39, and the PSR (probability that the true Sharpe exceeds the benchmark, given the observed sample) is 0.30, below the 0.95 gate but a long way from impossible.

That ordering is what matters. The instinct is that "I only ran 40 backtests, so the haircut should be small." The benchmark at N = 40 (≈ 2.19) is small in absolute terms, but a 1.8 observed Sharpe is below it, so the deflated number is already negative and the PSR drops under the gate. The lesson is not that 40 trials annihilate any Sharpe; it is that the benchmark crosses a credible retail Sharpe (1.5–2.5) at modest N, so the cushion you thought you had at N = 1 is gone by N = 40.

The framework's strong assumption is that every trial has equal a-priori probability of being the winner, and the null hypothesis is that all trials have zero true Sharpe. That assumption is unfaithful to most retail workflows — but the alternative (informative priors on trials, pre-registration of a small set of named hypotheses) requires that the research process change before the data is touched.

Why the low-trial intuition is wrong

Three intuitions break.

First, "I only ran 40 backtests" is rarely 40. It is 40 backtests with eight hyperparameter sweeps each, which is N = 320 under the Bailey–Lopez de Prado framework because each sweep is a candidate. Selection bias counts every variant the researcher could have picked, not the count they remember picking. Lopez de Prado's Advances in Financial Machine Learning makes this explicit: the operative N is the number of strategy-equivalent backtests the search procedure visited2.

Second, the benchmark grows monotonically in N from 0 (at N = 1) and crosses a plausible retail Sharpe (1.5–2.5) at modest N. The engine output here: maxExpectedSr at N = 40 is already 2.19, which is above the observed 1.8, so deflatedSr is negative and PSR sits below the gate even though the benchmark itself is a small, sane number.

Third, the haircut does not get cheaper with more observations. The engine's denominator scales with √(T − 1), so PSR sensitivity to skewness and kurtosis grows with T, but the benchmark is set purely by the trial count. For T = 504 (two years), the variance-of-Sharpe correction is set by the (skew, kurt) inputs, not by T directly. Adding more data is not a defence against trial-count selection bias.

The defensible reporting pattern

The Bailey–Lopez de Prado framing forces one of three honest reports for any retail LLM research programme.

Pre-registered N = 1. Pick a single hypothesis before touching data. Do not iterate. The engine then returns PSR equal to the standard probabilistic Sharpe (no selection adjustment). In the canonical input PSR ≈ 0.993. This is the cheapest path to a defensible number but the least flexible.

Pre-registered N = small (5 to 10). Enumerate a small named set of hypotheses before running any of them. Report N as the trial count. At N = 10 the engine's maxExpectedSr is ≈ 1.57 under these inputs, so a 1.8 Sharpe still clears it (deflatedSr ≈ +0.23, PSR ≈ 0.62), but the cushion is thin. The report should include both observed_sr and PSR and let the reader weight them.

Exploratory N = large (40 to 1000+). Report the full grid and accept that PSR falls below the 0.95 gate (0.30 at N = 40, 0.02 at N = 1000) under naïve application of Bailey–Lopez de Prado. The defence is then to claim the strategy is hypothesis-generating, not validated. The walk-forward step (see /articles/walk-forward-validation-cookbook/) does the validation that this regime cannot.

A reading rule for the engine output

The engine's maxExpectedSr, effectiveBenchmark, and deflatedSr are all in the same Sharpe units as the observed input. psr is a probability in [0, 1]. z is the test statistic feeding psr through Φ.

A PSR ≥ 0.95 is the conventional gate to claim selection-adjusted significance. At N = 1 the canonical input clears that gate by a wide margin (PSR = 0.993). At N ≥ 40 with the same observed_sr it fails (PSR = 0.30), not because the strategy got worse but because the engine accounts for the family of trials. The decision rule "PSR ≥ 0.95" is meaningless without an honest N.

The same input pair (observed_sr, T) can produce four different PSR values across four num_trials values. Authors who do not declare num_trials are publishing four different probability statements at once, none of them auditable. The MiCA-style audit framing in BaFin + EU Guide for Retail AI Traders treats undeclared trial count as a material omission; the same convention belongs in any LLM-driven research write-up.

Where this fails

Bailey–Lopez de Prado assumes every trial has equal a-priori probability of being the best. That is wrong when the researcher has prior structure — for example a fixed signal universe and a small set of hyperparameter axes that map onto theoretical priors. In those cases the effective N is between 1 and the literal grid size, and a hierarchical Bayesian framing gives a smaller haircut2.

The engine does not offer that Bayesian path. It implements the closed-form selection-bias correction faithfully. Solo researchers who want a smaller haircut must either (a) pre-register a small N before search, or (b) use a different methodology entirely, such as the PBO via CSCV framework — see /articles/deflated-sharpe-vs-pbo-on-the-same-tape/ for how the two tests interact on the same tape.

The skewness and excess-kurtosis correction also assumes the empirical moments are well-estimated. For T = 504, kurtosis estimates carry meaningful standard error and the deflation can be over- or under-stated by 10–15% depending on the (skew, kurt) point estimate. The engine does not propagate moment-estimate uncertainty into PSR; that is a known limitation of the Bailey–Lopez de Prado closed form.

The same observed return series feeds a battery of risk-adjusted metrics. On a 15-observation strategy-vs-benchmark example the Risk-Adjusted Returns engine returns annualized Sharpe 6.93, Sortino 13.13, calmar 79.85, information ratio 0.330, beta 1.943, and an alpha of −0.019 annualized (canonical inputs for a strategy with a sector-tilted benchmark). The metrics decompose the risk story but do not address selection bias — that is exactly the gap the deflated Sharpe is designed to close.

A defensible quant report leads with deflated Sharpe and PSR, then lists the conventional metrics under the explicit qualification that they are not selection-adjusted. See The Sharpe Ratio Trap and How to Read a Backtest Report for the broader template.

This is the low-trial-regime entry in the overfitting-diagnostics series: what DSR does as N shrinks toward 1, and why a small trial budget is not the protection it appears to be. Read alongside:

Connects to

References

  • Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter on backtest overfitting and selection bias. https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086
  • Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28. The multiple-testing correction in a Sharpe-ratio context.
  • White, H. (2000). "A Reality Check for Data Snooping." Econometrica 68(5), 1097–1126. The forerunner to PBO/DSR for general predictive-model selection.

Footnotes

  1. Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107. https://jpm.pm-research.com/content/40/5/94 2

  2. Bailey, D. H., Borwein, J. M., Lopez de Prado, M., & Zhu, Q. J. (2014). "Pseudo-Mathematics and Financial Charlatanism." Notices of the American Mathematical Society 61(5), 458–471. https://www.ams.org/notices/201405/rnoti-p458.pdf 2

Verified engine output

Show the recompute-verified inputs and outputs
Inputs
observed_sr1.8
n504
skew-0.6
kurt4.5
num_trials40
periods_per_year252
Result
psr0.6332008381249735
z0.3403430528349704
max expected sr1.5497345154018207
effective benchmark1.5497345154018207
deflated sr0.2502654845981793

Computed live at build time.

Frequently asked questions

Why does the engine output PSR = 0.30 at N = 40 for a 1.8 Sharpe?
The engine implements the Bailey–Lopez de Prado closed form: the expected max Sharpe under the null (≈ 2.19 at N = 40) just exceeds the observed 1.8, so the deflated Sharpe is mildly negative and PSR drops below the 0.95 gate.
Is the engine correct?
Yes, within the Bailey–Lopez de Prado 2014 framework. maxExpectedSr is the expected maximum of N standard normals in the same Sharpe units as the observed input. The framework assumes every trial has equal a-priori probability of winning; retail research with informative priors should bound the effective N before applying it.
What's the safest reporting protocol for a solo LLM research run?
Pre-register the hypothesis count before any backtest, report num_trials as a fixed parameter, and publish the PSR at that declared N. Low PSR routes the strategy to walk-forward validation before any live claim.
Does deflated Sharpe replace PBO?
No. DSR addresses single-number significance after trial-count adjustment; PBO addresses in-sample-vs-out-of-sample generalization. Both are required to defend a backtest claim.
Why is skewness an input?
The variance of the Sharpe estimator depends on the third and fourth moments of the return distribution; negative skew and excess kurtosis inflate the standard error of the Sharpe estimate.