Methodology · Tool · Last updated 2026-04-20
How Backtest Overfitting Score works
How the Backtest Overfitting Score computes PBO and Deflated Sharpe Ratio from a returns CSV.
Scope
The tool answers a specific question: among a basket of candidate strategies you backtested, how likely is the winner's edge to generalize? It does this without requiring new out-of-sample data — it uses combinatorial splits of the observations you already have.
It does not:
- detect data-mining bias across separate model selection rounds not reflected in the returns table,
- correct for survivor bias in the universe selection,
- identify overfit parameters within a single strategy — it compares candidates against each other.
Input format
Wide CSV, one column per candidate strategy, one row per observation (typically daily):
date,strategy_1,strategy_2,strategy_3,...
2020-01-02,0.0012,-0.0005,0.0003,...
2020-01-03,0.0041,0.0009,-0.0002,...
Returns are interpreted as simple returns (not log).
The date column is optional. All computation runs
client-side in the browser; the file never leaves the device.
Sharpe ratio
For each strategy, daily Sharpe is mean(r) / stdev(r) on the
simple-return series. Annualized by √252. No risk-free
rate subtracted (assumed effectively zero for the cross-validation
window; adjust inputs if material).
Deflated Sharpe Ratio (DSR)
Per Bailey & Lopez de Prado (2014):
DSR = Φ( (SR − E[max SR*]) · √(T − 1) /
√(1 − γ₃·SR + ((γ₄ − 1)/4)·SR²) ) where:
SR= observed per-period SharpeT= number of observationsγ₃= skewness of returnsγ₄= excess kurtosis of returnsE[max SR*]= expected maximum Sharpe under the null of zero true Sharpe acrossNtrials, approximated by(1 − γ)·Φ⁻¹(1 − 1/N) + γ·Φ⁻¹(1 − 1/(N·e)), whereγis Euler–Mascheroni (0.5772…).
DSR returns a probability in [0, 1]. Values above ~95% indicate the observed Sharpe is likely real after correcting for the selection bias (number of strategies tested) and the return distribution's non-normality.
PBO via CSCV
Per Bailey, Borwein, Lopez de Prado & Zhu (2016):
- Partition the observation axis into
S = 2·sequal chunks. - Enumerate all
C(2s, s)ways to chooseschunks as in-sample (IS). The remainingsform out-of-sample (OOS). The tool samples at most 500 combinations for browser-compute bounds. - For each combination:
- Compute Sharpe of every candidate over the IS chunks → pick the best (n*).
- Compute Sharpe of every candidate over the OOS chunks.
- Find n*'s fractional rank
r ∈ [0, 1]in OOS Sharpes. - Compute logit
λ = log(r / (1 − r)).
- PBO = fraction of combinations where
λ < 0— i.e. the IS-best ranks below median OOS.
Intuition: if the IS-best is genuinely good, it should tend to rank above median OOS. A PBO near 0.5 means OOS rank is a coin flip — the IS-best was probably lucky. Near 0.0 = robust. Above 0.5 = worse than random (negative selection).
Parameter: S
S (user-configurable 4–16) controls partition count.
Higher S = finer cross-validation but more combinations.
Default: 8 → 2S = 16 chunks, C(16, 8) = 12,870
combinations — sampled to 500.
Verdict bands
| PBO | Verdict |
|---|---|
| < 20% | Low overfitting risk — edge appears to generalize |
| 20–50% | Moderate overfitting risk — further OOS testing advised |
| ≥ 50% | High overfitting risk — classic signature of a lucky fit |
Assumptions + limitations
- Independent observations within chunks. The tool does not account for autocorrelation inside a chunk. Heavily serial-correlated returns will have inflated Sharpe and under-sampled variance.
- Stationary across chunks. Structural breaks (regime change) can either make PBO look better or worse than reality. Inspect a rolling Sharpe plot of each strategy externally.
- Equal chunk size. The last chunk absorbs any remainder when
Tis not divisible by2S. - Sampling for large
S. Combinations are capped at 500 via random shuffle; PBO is a Monte Carlo estimate with convergence error roughly1/√500 ≈ 4.5pp. - No transaction costs. The tool uses the returns you provide; if they are gross, conclusions apply to gross Sharpe.
- DSR E[max] is an approximation. Bailey-Lopez de Prado give a tight asymptotic form; with
N < 10the approximation error grows.
Reproducibility
Source at src/components/tools/backtest-overfitting/BacktestOverfitting.tsx
in the aifinhub repository. All math is client-side JavaScript. The
same CSV + same S parameter will yield the same Sharpe
and DSR values deterministically; PBO will vary run-to-run within a
few percentage points due to sampling. Run 3–5 times for tighter
bounds.
References
- Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107.
- Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70.
- Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28.
- White, H. (2000). "A Reality Check for Data Snooping." Econometrica 68(5), 1097–1126.
- Mertens, E. (2002). "Variance of the IID estimator in Lo (2002)." Working paper.
Changelog
- 2026-04-20 — Initial release. PBO via CSCV with 500-combination sampling cap. DSR with Bailey-Lopez de Prado
E[max SR*]approximation.