Eight candidate strategies tested on 252 daily observations of a documented synthetic equity-return matrix produce a probability of backtest overfitting (PBO) of 0.726 on 500 CSCV combinations. The in-sample best strategy — config_7, the highest annualised Sharpe at 1.22 — under-performs the out-of-sample median in roughly 73% of the tested splits. The Backtest Overfitting Score tool's verdict on this scenario is unambiguous: this is overfitting. And every one of the eight deflated Sharpes returns 0, so even the in-sample leader has no statistically significant edge after the multiple-testing correction. High PBO plus zero deflated Sharpe is the textbook discard signal. The full run, including all eight per-strategy statistics, is in the Verified engine output section below.

TL;DR

  • Scenario: 8 strategies × 252 daily observations × 500 CSCV combinations, seed 42.
  • Best in-sample strategy: config_7 (annualised Sharpe 1.22 — the highest in the table).
  • PBO = 0.726: the in-sample winner ranks below the OOS median in ~73% of tested combinations.
  • All 8 deflated Sharpes return 0 — the test detects no statistically significant edge after deflation.
  • The two results together: the in-sample winner does not generalise, and no candidate has provable edge. Classic overfit — discard.

The scenario

The Backtest Overfitting Score engine took the published contract sample matrix: 8 strategies × 252 daily observations of synthetic returns. The eight series have small, mixed means and similar standard deviations near 0.009. Their annualised Sharpes span a realistic range — several negative, the best a modest 1.22:

Strategy Annualised Sharpe (live engine output)
config_1 −0.27
config_2 −1.04
config_3 0.86
config_4 −1.97
config_5 0.31
config_6 1.00
config_7 1.22 (best in-sample)
config_8 1.16

None of these Sharpes is implausible — they are the spread you would expect from eight noisy candidate strategies over a single year of daily data. The engine returns deflated Sharpe = 0 for every one of them. (The exact per-strategy mean, stdev, skew, kurtosis, and Sharpe values are rendered live in the Verified engine output block below.)

What PBO = 0.726 means

The CSCV algorithm splits the 252-observation series into 16 contiguous blocks (s = 8, so 2s = 16 blocks of ~15 observations each), then enumerates ways to assign 8 blocks to in-sample and 8 to out-of-sample. There are $\binom{16}{8} = 12{,}870$ such splits; the engine sampled 500 of them with a seeded shuffle. In each split it finds the strategy with the highest in-sample mean, then checks where that strategy ranks out of sample. PBO is the fraction of splits where the in-sample winner lands below the out-of-sample median.

Here PBO = 0.726. In nearly three-quarters of the splits, the strategy that looked best in-sample under-performed out of sample. For independent strategies with real edge, PBO < 0.2 is the threshold for "the in-sample winner generalises." This scenario sits far on the wrong side of that line: the in-sample ranking is not predictive of out-of-sample ranking.

Why this result is the honest, common case

A high PBO on a matrix of broadly similar candidate strategies is exactly what you should expect when none of the candidates has durable edge. With eight noisy series and only 252 observations, whichever strategy happens to lead in any given in-sample window is largely luck, and luck does not repeat out of sample. The engine is detecting that directly: rank the candidates in-sample, and the leader is a coin-flip-or-worse to lead again out of sample.

A real-world equivalent is eight parameter variants of the same momentum signal, each fit slightly differently. The one with the best in-sample Sharpe is the one that best fit the in-sample noise — and it pays for that fit out of sample. PBO = 0.726 is the number that catches this before the capital does.

What the deflated Sharpe says

The engine returns deflatedSharpe = 0 for every strategy. The Deflated Sharpe Ratio1 tests whether the in-sample Sharpe is statistically significant given:

  1. The number of trials (8 in this case).
  2. The skew and kurtosis of the return distribution.
  3. The sample size (252 observations).

A deflated Sharpe of 0 means the test cannot reject the null that the true Sharpe is zero, after adjusting for selection bias and non-normality. The best annualised Sharpe of 1.22 looks respectable on paper; the deflated value of 0 says it does not survive multiple-testing correction2, the same data-snooping problem White's reality check formalises3.

The two results in combination: PBO = 0.726 (the in-sample winner does not generalise) and deflated Sharpe = 0 for all candidates (no real edge). The honest interpretation: these eight candidates are noise dressed as strategies, and the apparent leader is a selection artefact.

When PBO is informative

PBO is informative when the candidate strategies are genuinely independent and the sample is long enough to give the test power. Three checks before trusting the PBO number:

  1. Pairwise correlation matrix. Expect mostly off-diagonal values below 0.3 for truly diverse candidates. Highly correlated candidates make ranking artificially stable; that produces an artificially low PBO, the opposite failure mode from this scenario.
  2. Strategy diversity by construction. A momentum strategy and a mean-reversion strategy on the same universe are more independent than two momentum strategies with different lookback windows.
  3. Sample size scaling. PBO at n = 60 (the engine's hard floor) is barely informative; at n = 500+, the test has real power. This run's 252 observations give moderate power. The Walk-Forward Validator supplements with rolling out-of-sample windows.

How PBO and Deflated Sharpe combine

Four combinations:

PBO Deflated Sharpe Interpretation
Low (< 0.2) High (> 95%) Edge appears real and generalises. Best case.
Low (< 0.2) Low (< 50%) Ranking stable, edge not significant. Often correlated candidates.
High (> 0.5) High (> 95%) Statistically real, doesn't generalise. Regime change suspect.
High (> 0.5) Low (< 50%) Classic overfit. Discard. This scenario (PBO 0.726, DSR 0).

This run lands squarely in the bottom row: high PBO, zero deflated Sharpe. The in-sample winner does not survive out of sample and has no significant edge to begin with. It is the cleanest "do not deploy" signal the two tests produce together4.

Recommendations on this output

  1. Do not deploy config_7. It is the in-sample leader and the out-of-sample laggard; that is the definition of an overfit selection.
  2. Treat the whole candidate set as edgeless at this sample size. Deflated Sharpe = 0 across the board means none of the eight clears the multiple-testing bar.
  3. Diversify by construction, not by parameter. Eight variants of one signal are one bet, not eight. Add genuinely different strategy families.
  4. Run on real returns. Synthetic-based PBO is a sanity check on the test infrastructure and a teaching example, not a verdict on any live candidate.

Failure modes

  • Quoting annualised Sharpe without deflation. config_7's 1.22 looks fine; the deflated value (0) tells the real story.
  • Reading a high PBO as a data error. PBO = 0.726 is not a bug; it is the correct answer for an edgeless candidate set.
  • Quoting PBO without checking strategy independence. A very low PBO can be an artefact of correlated candidates, the inverse trap to this one.
  • Sample size too small. Under 60 observations the test refuses to run; under 200, the test has weak power even when assumptions hold.

FAQ

Is a high PBO always bad?

A PBO of 0.726 means the in-sample winner under-performs out of sample in most splits — for this candidate set, that is a clear discard signal, reinforced by the zero deflated Sharpe. Read PBO and deflated Sharpe together: high PBO with zero DSR is the textbook overfit, not a near-miss.

How many strategies do I need to test for PBO to be meaningful?

The CSCV algorithm runs on as few as 2 strategies, but the test is most informative with 5-20. With more than 20, the multiple-comparison adjustment in deflated Sharpe dominates and individual strategies' significance collapses.

What sample size does PBO need?

Minimum 60 observations (the engine's hard floor). For useful power against 0.5 vs 0.2 PBO, plan for 500+ observations. This run's 252 daily observations give moderate power; daily data on a 2-year window is stronger, monthly data needs 4+ years.

This is the worked-matrix entry in the overfitting-diagnostics series: a single fixed 8-strategy candidate matrix carried all the way through a PBO computation. Read alongside:

Connects to

References

Footnotes

  1. Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107. jpm.pm-research.com

  2. Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28. pm-research.com

  3. White, H. (2000). "A Reality Check for Data Snooping." Econometrica 68(5), 1097–1126. jstor.org

  4. Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70. papers.ssrn.com

Verified engine output

Show the recompute-verified inputs and outputs
Backtest Overfitting Score — 8 strategies × 252 synthetic daily observations, s=8, 500 combinations, seed 42
Inputs
s8
max_combos500
seed42
returns (8 items)[...]
names (8 items)[...]
Result
n strategies8
n observations252
strategies (8 items)[...]
pbo0.726
combinations tested500
s8
best strategy index6
best strategy nameconfig_7

Computed live at build time.