Methodology: Backtest Overfitting Score

Scope

The tool answers a specific question: among a basket of candidate strategies you backtested, how likely is the winner's edge to generalize? It does this without requiring new out-of-sample data — it uses combinatorial splits of the observations you already have.

It does not:

detect data-mining bias across separate model selection rounds not reflected in the returns table,
correct for survivor bias in the universe selection,
identify overfit parameters within a single strategy — it compares candidates against each other.

Input format

Wide CSV, one column per candidate strategy, one row per observation (typically daily):

date,strategy_1,strategy_2,strategy_3,...
2020-01-02,0.0012,-0.0005,0.0003,...
2020-01-03,0.0041,0.0009,-0.0002,...

Returns are interpreted as simple returns (not log). The date column is optional. All computation runs client-side in the browser; the file never leaves the device.

Sharpe ratio

For each strategy, daily Sharpe is mean(r) / stdev(r) on the simple-return series. Annualized by √252. No risk-free rate subtracted (assumed effectively zero for the cross-validation window; adjust inputs if material).

Deflated Sharpe Ratio (DSR)

Per Bailey & Lopez de Prado (2014):

DSR = Φ( (SR − E[max SR*]) · √(T − 1) /
            √(1 − γ₃·SR + ((γ₄ − 1)/4)·SR²) )

where:

SR = observed per-period Sharpe
T = number of observations
γ₃ = skewness of returns
γ₄ = excess kurtosis of returns
E[max SR*] = expected maximum Sharpe under the null of zero true Sharpe across N trials, approximated by (1 − γ)·Φ⁻¹(1 − 1/N) + γ·Φ⁻¹(1 − 1/(N·e)), where γ is Euler–Mascheroni (0.5772…).

DSR returns a probability in [0, 1]. Values above ~95% indicate the observed Sharpe is likely real after correcting for the selection bias (number of strategies tested) and the return distribution's non-normality.

PBO via CSCV

Per Bailey, Borwein, Lopez de Prado & Zhu (2016):

Partition the observation axis into S = 2·s equal chunks.
Enumerate all C(2s, s) ways to choose s chunks as in-sample (IS). The remaining s form out-of-sample (OOS). The tool samples at most 500 combinations for browser-compute bounds.
For each combination:
1. Compute Sharpe of every candidate over the IS chunks → pick the best (n*).
2. Compute Sharpe of every candidate over the OOS chunks.
3. Find n*'s fractional rank r ∈ [0, 1] in OOS Sharpes.
4. Compute logit λ = log(r / (1 − r)).
PBO = fraction of combinations where λ < 0 — i.e. the IS-best ranks below median OOS.

Intuition: if the IS-best is genuinely good, it should tend to rank above median OOS. A PBO near 0.5 means OOS rank is a coin flip — the IS-best was probably lucky. Near 0.0 = robust. Above 0.5 = worse than random (negative selection).

Parameter: S

S (user-configurable 4–16) controls partition count. Higher S = finer cross-validation but more combinations. Default: 8 → 2S = 16 chunks, C(16, 8) = 12,870 combinations — sampled to 500.

Verdict bands

PBO	Verdict
< 20%	Low overfitting risk — edge appears to generalize
20–50%	Moderate overfitting risk — further OOS testing advised
≥ 50%	High overfitting risk — classic signature of a lucky fit

Assumptions + limitations

Independent observations within chunks. The tool does not account for autocorrelation inside a chunk. Heavily serial-correlated returns will have inflated Sharpe and under-sampled variance.
Stationary across chunks. Structural breaks (regime change) can either make PBO look better or worse than reality. Inspect a rolling Sharpe plot of each strategy externally.
Equal chunk size. The last chunk absorbs any remainder when T is not divisible by 2S.
Sampling for large S. Combinations are capped at 500 via random shuffle; PBO is a Monte Carlo estimate with convergence error roughly 1/√500 ≈ 4.5pp.
No transaction costs. The tool uses the returns you provide; if they are gross, conclusions apply to gross Sharpe.
DSR E[max] is an approximation. Bailey-Lopez de Prado give a tight asymptotic form; with N < 10 the approximation error grows.

Reproducibility

Source at src/components/tools/backtest-overfitting/BacktestOverfitting.tsx in the aifinhub repository. All math is client-side JavaScript. The same CSV + same S parameter will yield the same Sharpe and DSR values deterministically; PBO will vary run-to-run within a few percentage points due to sampling. Run 3–5 times for tighter bounds.

References

Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107.
Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70.
Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28.
White, H. (2000). "A Reality Check for Data Snooping." Econometrica 68(5), 1097–1126.
Mertens, E. (2002). "Variance of the IID estimator in Lo (2002)." Working paper.

Changelog

2026-04-20 — Initial release. PBO via CSCV with 500-combination sampling cap. DSR with Bailey-Lopez de Prado E[max SR*] approximation.