Two overfitting tests reject the same backtest winner for different reasons, which is exactly why a defensible report runs both. On the Backtest Overfitting Score sample tape (8 candidate strategies × 252 observations of noisy returns), the engine returns PBO = 0.726 (well above the 0.5 overfitting threshold), best strategy index 6 ("config_7", annualized Sharpe 1.22), 500 CSCV combinations tested. The Deflated Sharpe Ratio on a representative observed Sharpe of 1.6, n = 504, skew −0.4, kurt 4.0, num_trials = 200 returns PSR = 0.054, deflatedSr = −1.166, maxExpectedSr = 2.766. Both tests reject the tape's apparent winner, and they reject it on different grounds: PBO on out-of-sample rank decay, the deflated Sharpe on multiple-testing-adjusted significance.
TL;DR
Same overfit tape (8 candidate strategies of noisy returns), two tests, two failure signatures:
| Test | What it measures | Output | Verdict |
|---|---|---|---|
| PBO (CSCV) | In-sample vs out-of-sample rank consistency | 0.726 | Reject: the in-sample winner is below median out-of-sample > 72% of the time |
| Deflated Sharpe | Sharpe significance after num_trials adjustment | PSR = 0.054 | Reject: the observed Sharpe is below the selection-bias benchmark |
PBO answers "does the in-sample best generalize." DSR answers "is the headline Sharpe statistically real after correcting for trials." Both reject the canonical overfit example; they would reject overfit examples constructed differently in different patterns. The complementary information is the entire point of running both.
What each test captures
PBO via Combinatorially-Symmetric Cross-Validation (CSCV):
- Split the observation axis into 16 chunks (S = 8 means 2S = 16 chunks).
- Enumerate combinations of 8 chunks as in-sample (IS), 8 as out-of-sample (OOS).
- For each combination, find the IS Sharpe winner across the 8 strategies; check its OOS Sharpe rank.
- PBO = fraction of combinations where the IS winner ranks below OOS median.
PBO = 0.726 on the sample tape means: out of 500 CSCV combinations, about 363 had the IS winner ranking below OOS median. The IS-winner-out-of-sample-loser pattern is the textbook overfit signature, and a 72.6% rate is severe.
Deflated Sharpe (Bailey–Lopez de Prado 2014):
- Take the observed Sharpe, sample size T, skew, kurt, and num_trials.
- Compute the expected max Sharpe under the null (zero true edge across all candidates).
- Compute the z-score: (observed Sharpe − expected max Sharpe under null) × √(T−1) / (variance-of-Sharpe-correction).
- PSR = Φ(z). Reject the strategy if PSR < 0.95.
On the canonical input, the engine returns maxExpectedSr = 2.766 (the expected maximum of 200 standard normals, in the same Sharpe units as the observed input), deflatedSr = −1.166, PSR = 0.054. The observed Sharpe (1.6) is below the engine's selection-bias benchmark of 2.77, so PSR sits well under the 0.95 gate and the strategy is rejected.
Why the two answer different questions
PBO and DSR measure different aspects of overfitting:
- PBO measures generalization. Does the strategy that wins in-sample also win out-of-sample, or does it fall to a different strategy? PBO catches strategies that are "best at fitting the in-sample tape" without being "best at predicting the OOS tape." This is overfitting of the selection procedure.
- DSR measures Sharpe inflation. Given the number of trials, how much of the observed Sharpe is consistent with pure luck? DSR catches strategies whose Sharpe looks good only because the researcher tested many alternatives. This is overfitting of the test statistic.
A strategy can pass DSR (Sharpe is robust to trial count) but fail PBO (the same in-sample winner under-performs OOS). A strategy can pass PBO (the in-sample winner generalizes) but fail DSR (the Sharpe is not statistically distinguishable from luck once you account for trial count).
For the sample tape, both fail: the tape is 8 candidate strategies of noisy returns with no genuine edge, which is the textbook overfit-by-search setup — picking the in-sample best of 8 noise series produces an apparent winner that does not generalize. Real backtests fail in mixed patterns: some fail DSR but pass PBO (the in-sample winner does generalize, but the Sharpe is too small to defend after trial correction), some fail PBO but pass DSR (the headline Sharpe is real but the selection procedure is unstable).
The four-quadrant decision table
A defensible quant report runs both and reports the four-quadrant verdict:
| PBO | DSR | Interpretation |
|---|---|---|
| < 0.2 | > 0.95 | Edge appears to generalize; size up cautiously. |
| < 0.2 | 0.50–0.95 | IS/OOS consistency is good but Sharpe significance is marginal. Small edge or noisy returns. |
| 0.2–0.5 | > 0.95 | Sharpe is statistically real but cross-validation is uncertain. Investigate regime dependence. |
| > 0.5 | < 0.5 | Classic overfit signature. Stop. |
The sample tape lands in the fourth row: PBO = 0.726 > 0.5 AND DSR PSR = 0.054 < 0.5. Stop.
When the two disagree: the diagnostic value
The interesting cases are off-diagonal. A strategy with PBO 0.15 (passes generalization) and DSR PSR 0.40 (fails Sharpe-significance) is a strategy with a small but real edge that needs more data, the IS-OOS pattern is consistent, but the absolute Sharpe is too small to defend at the current trial count. The fix: extend the tape, not redo the search.
A strategy with PBO 0.65 (fails generalization) and DSR PSR 0.96 (passes Sharpe-significance) is a strategy where the headline Sharpe is statistically real but the specific candidate was picked unstably. The fix: re-run the search with a different random seed, see if the same candidate emerges. If a different candidate wins, both candidates are suspect; if the same candidate wins repeatedly, the PBO false-alarm explanation deserves weight (rare, but possible for strategies with very stable parameter sensitivities).
The cost of running both
PBO via CSCV with S = 8 enumerates C(16, 8) = 12,870 combinations. The engine samples 500 of these for ±4% precision; full enumeration takes about 25× as long but improves precision to ±0.5%. For a 252-observation tape with 8 strategies, the 500-combination run completes in well under a second on a typical browser.
DSR is a closed-form computation: one normal-CDF evaluation plus a small amount of arithmetic. Sub-millisecond.
Together the two tests cost essentially nothing in compute. There is no defensible reason to skip either; the gap they leave when run together is the regime-change risk, which is what walk-forward validation addresses (see /articles/walk-forward-window-sizing-decision/).
Where the joint approach breaks
Both tests assume the candidate set is fixed before testing. A workflow that adds strategies after seeing initial results (the most common retail pattern) has an effective trial count larger than the literal candidate count. Neither test can detect this; the discipline of pre-registering the candidate set is the only defence.
Both tests assume returns are stationary across the tape. A regime switch mid-tape (volatility regime change, structural break in fundamental conditions) makes the OOS rank a property of the regime, not the strategy. PBO will flag the regime-fragility as overfitting; DSR will flag the inflated mid-tape Sharpe as luck. Neither distinguishes from genuine regime change, additional tests (see The Sharpe Ratio Trap) are required.
The PBO test's sensitivity to the chunking parameter S is moderate; S = 4 (8 chunks) is too coarse for tapes under 100 observations; S = 16 (32 chunks) is overkill for tapes above 1000 observations. The canonical S = 8 is appropriate for tapes between 100 and 1000 observations.
Related in this series
This is the head-to-head entry in the overfitting-diagnostics series: both tests run on a single return matrix to show they reject the same winner for different reasons. Read alongside:
- Deflated Sharpe Ratio: Derivation and Worked Example: the series pillar, covering the DSR formula, its extreme-value origins, and the deflation table.
- PBO Explained for LLM Strategies: the PBO formulation in depth and why LLM searches inflate it.
- Deflated Sharpe in Low-Trial Regimes: how DSR behaves as the trial budget shrinks toward 1.
- Did You Overfit? PBO + Deflated Sharpe: the runnable ~80-line how-to that computes both from a CSV.
- PBO Score on an Eight-Strategy Matrix: a fully worked PBO computation on a fixed 8-strategy candidate matrix.
Connects to
- Did You Overfit? PBO and Deflated Sharpe, the runnable Python tutorial implementing both tests in 80 lines.
- Deflated Sharpe Derivation Worked Example, step-by-step derivation of the DSR formula.
- Backtest Overfitting in LLM Strategies: PBO Explained — LLM-strategy context for the PBO test.
- Deflated Sharpe Ratio — engine endpoint.
- Backtest Overfitting Score — engine endpoint.
- Walk-Forward Validator — companion test for regime-change risk.
References
- Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70. SSRN abstract https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2326253
- Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107. https://jpm.pm-research.com/content/40/5/94
- Bailey, D. H., Borwein, J. M., Lopez de Prado, M., & Zhu, Q. J. (2014). "Pseudo-Mathematics and Financial Charlatanism." Notices of the American Mathematical Society 61(5), 458–471. https://www.ams.org/notices/201405/rnoti-p458.pdf
- Lopez de Prado, M. (2018). Advances in Financial Machine Learning, chapter on backtest statistics. https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086
- Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28.
Verified engine output
Show the recompute-verified inputs and outputs
| s | 8 |
|---|---|
| max_combos | 500 |
| seed | 42 |
| names (8 items) | [...] |
| returns (8 items) | [...] |
| n strategies | 8 |
|---|---|
| n observations | 252 |
| strategies (8 items) | [...] |
| pbo | 0.726 |
| combinations tested | 500 |
| s | 8 |
| best strategy index | 6 |
| best strategy name | config_7 |
Computed live at build time.
| observed_sr | 1.6 |
|---|---|
| n | 504 |
| skew | -0.4 |
| kurt | 4 |
| num_trials | 200 |
| periods_per_year | 252 |
| psr | 0.3108853828672815 |
|---|---|
| z | -0.49334226197195025 |
| max expected sr | 1.9574636000454002 |
| effective benchmark | 1.9574636000454002 |
| deflated sr | -0.3574636000454001 |
Computed live at build time.
Frequently asked questions
- Why does the PBO test sample only 500 combinations instead of all 12,870?
- Sampling 500 produces standard error around 0.02, below the 0.05 decision-threshold sensitivity. Full enumeration takes 25× longer for only 4× improvement in precision — not worth it for retail-scale tapes.
- Can DSR PSR exceed 1.0?
- No — it is a probability bounded in [0, 1]. On well-behaved inputs the engine returns values in (0, 1); on degenerate inputs it can return 0 or 1 at the boundaries.
- What's a good PBO?
- Below 0.2 is conventional for a well-behaved strategy. 0.2–0.5 is the gray zone needing walk-forward validation. Above 0.5 is the classic overfit signature.
- Can a strategy fail PBO and DSR independently?
- Yes. PBO can fail when the IS rank is unstable but absolute Sharpe is fine; DSR can fail when absolute Sharpe is too small relative to trial count. The four-quadrant table distinguishes these patterns.
- Are there other tests I should run?
- Walk-forward validation catches regime change; survivor-bias audit catches universe-selection issues; trading-cost audit catches the 'looks good before costs' mode. The two tests are necessary but not sufficient.