aifinhub

Methodology · Tool · Last updated 2026-04-20

How Backtest Overfitting Score works

How the Backtest Overfitting Score computes PBO and Deflated Sharpe Ratio from a returns CSV.

Scope

The tool answers a specific question: among a basket of candidate strategies you backtested, how likely is the winner's edge to generalize? It does this without requiring new out-of-sample data — it uses combinatorial splits of the observations you already have.

It does not:

  • detect data-mining bias across separate model selection rounds not reflected in the returns table,
  • correct for survivor bias in the universe selection,
  • identify overfit parameters within a single strategy — it compares candidates against each other.

Input format

Wide CSV, one column per candidate strategy, one row per observation (typically daily):

date,strategy_1,strategy_2,strategy_3,...
2020-01-02,0.0012,-0.0005,0.0003,...
2020-01-03,0.0041,0.0009,-0.0002,...

Returns are interpreted as simple returns (not log). The date column is optional. All computation runs client-side in the browser; the file never leaves the device.

Sharpe ratio

For each strategy, daily Sharpe is mean(r) / stdev(r) on the simple-return series. Annualized by √252. No risk-free rate subtracted (assumed effectively zero for the cross-validation window; adjust inputs if material).

Deflated Sharpe Ratio (DSR)

Per Bailey & Lopez de Prado (2014):

DSR = Φ( (SR − E[max SR*]) · √(T − 1) /
            √(1 − γ₃·SR + ((γ₄ − 1)/4)·SR²) )

where:

  • SR = observed per-period Sharpe
  • T = number of observations
  • γ₃ = skewness of returns
  • γ₄ = excess kurtosis of returns
  • E[max SR*] = expected maximum Sharpe under the null of zero true Sharpe across N trials, approximated by (1 − γ)·Φ⁻¹(1 − 1/N) + γ·Φ⁻¹(1 − 1/(N·e)), where γ is Euler–Mascheroni (0.5772…).

DSR returns a probability in [0, 1]. Values above ~95% indicate the observed Sharpe is likely real after correcting for the selection bias (number of strategies tested) and the return distribution's non-normality.

PBO via CSCV

Per Bailey, Borwein, Lopez de Prado & Zhu (2016):

  1. Partition the observation axis into S = 2·s equal chunks.
  2. Enumerate all C(2s, s) ways to choose s chunks as in-sample (IS). The remaining s form out-of-sample (OOS). The tool samples at most 500 combinations for browser-compute bounds.
  3. For each combination:
    1. Compute Sharpe of every candidate over the IS chunks → pick the best (n*).
    2. Compute Sharpe of every candidate over the OOS chunks.
    3. Find n*'s fractional rank r ∈ [0, 1] in OOS Sharpes.
    4. Compute logit λ = log(r / (1 − r)).
  4. PBO = fraction of combinations where λ < 0 — i.e. the IS-best ranks below median OOS.

Intuition: if the IS-best is genuinely good, it should tend to rank above median OOS. A PBO near 0.5 means OOS rank is a coin flip — the IS-best was probably lucky. Near 0.0 = robust. Above 0.5 = worse than random (negative selection).

Parameter: S

S (user-configurable 4–16) controls partition count. Higher S = finer cross-validation but more combinations. Default: 8 → 2S = 16 chunks, C(16, 8) = 12,870 combinations — sampled to 500.

Verdict bands

PBOVerdict
< 20%Low overfitting risk — edge appears to generalize
20–50%Moderate overfitting risk — further OOS testing advised
≥ 50%High overfitting risk — classic signature of a lucky fit

Assumptions + limitations

  1. Independent observations within chunks. The tool does not account for autocorrelation inside a chunk. Heavily serial-correlated returns will have inflated Sharpe and under-sampled variance.
  2. Stationary across chunks. Structural breaks (regime change) can either make PBO look better or worse than reality. Inspect a rolling Sharpe plot of each strategy externally.
  3. Equal chunk size. The last chunk absorbs any remainder when T is not divisible by 2S.
  4. Sampling for large S. Combinations are capped at 500 via random shuffle; PBO is a Monte Carlo estimate with convergence error roughly 1/√500 ≈ 4.5pp.
  5. No transaction costs. The tool uses the returns you provide; if they are gross, conclusions apply to gross Sharpe.
  6. DSR E[max] is an approximation. Bailey-Lopez de Prado give a tight asymptotic form; with N < 10 the approximation error grows.

Reproducibility

Source at src/components/tools/backtest-overfitting/BacktestOverfitting.tsx in the aifinhub repository. All math is client-side JavaScript. The same CSV + same S parameter will yield the same Sharpe and DSR values deterministically; PBO will vary run-to-run within a few percentage points due to sampling. Run 3–5 times for tighter bounds.

References

  • Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107.
  • Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70.
  • Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28.
  • White, H. (2000). "A Reality Check for Data Snooping." Econometrica 68(5), 1097–1126.
  • Mertens, E. (2002). "Variance of the IID estimator in Lo (2002)." Working paper.

Changelog

  • 2026-04-20 — Initial release. PBO via CSCV with 500-combination sampling cap. DSR with Bailey-Lopez de Prado E[max SR*] approximation.
Planning estimates only — not financial, tax, or investment advice.