PBO Score on an Eight-Strategy Matrix

Q: Is a high PBO always bad?

A PBO of 0.726 means the in-sample winner under-performs out of sample in most splits — for this candidate set, that is a clear discard signal, reinforced by every deflated Sharpe falling short of the 0.95 gate. Read PBO and deflated Sharpe together: high PBO with no DSR clearing the gate is the textbook overfit.

Q: How many strategies do I need to test for PBO to be meaningful?

The CSCV algorithm runs on as few as 2 strategies, but the test is most informative with 5-20. With more than 20, the multiple-comparison adjustment in deflated Sharpe dominates and individual strategies' significance collapses.

Q: What sample size does PBO need?

Minimum 60 observations (the engine's hard floor). For useful power against 0.5 vs 0.2 PBO, plan for 500+ observations. The 252 daily observations in this run give moderate power; daily data on a 2-year window meets this; monthly data needs 4+ years.

Eight candidate strategies tested on 252 daily observations of a documented synthetic equity-return matrix produce a probability of backtest overfitting (PBO) of 0.726 on 500 CSCV combinations. The in-sample best strategy — config_7, the highest annualised Sharpe at 1.22 — under-performs the out-of-sample median in roughly 73% of the tested splits. The Backtest Overfitting Score tool's verdict on this scenario is unambiguous: this is overfitting. And not one of the eight deflated Sharpes clears the gate; the best, config_7, reaches only 0.41 (a 41% probability the edge is real), so even the in-sample leader is not statistically significant after the multiple-testing correction. High PBO plus a deflated Sharpe well below the 0.95 bar is the textbook discard signal. The full run, including all eight per-strategy statistics, is in the Verified engine output section below.

TL;DR

Scenario: 8 strategies × 252 daily observations × 500 CSCV combinations, seed 42.
Best in-sample strategy: config_7 (annualised Sharpe 1.22 — the highest in the table).
PBO = 0.726: the in-sample winner ranks below the OOS median in ~73% of tested combinations.
All 8 deflated Sharpes fall below the 0.95 gate — the best is config_7 at 0.41, so no candidate has a statistically significant edge after deflation.
The two results together: the in-sample winner does not generalise, and no candidate has provable edge. Classic overfit — discard.

The scenario

The Backtest Overfitting Score engine took the published contract sample matrix: 8 strategies × 252 daily observations of synthetic returns. The eight series have small, mixed means and similar standard deviations near 0.009. Their annualised Sharpes span a realistic range — several negative, the best a modest 1.22:

Strategy	Annualised Sharpe (live engine output)
config_1	−0.27
config_2	−1.04
config_3	0.86
config_4	−1.97
config_5	0.31
config_6	1.00
config_7	1.22 (best in-sample)
config_8	1.16

None of these Sharpes is implausible; they are the spread you would expect from eight noisy candidate strategies over a single year of daily data. The engine returns a deflated Sharpe below the 0.95 gate for every one of them. The highest, config_7, is just 0.41. (The exact per-strategy mean, stdev, skew, kurtosis, and Sharpe values are rendered live in the Verified engine output block below.)

What PBO = 0.726 means

The CSCV algorithm splits the 252-observation series into 16 contiguous blocks (s = 8, so 2s = 16 blocks of ~15 observations each), then enumerates ways to assign 8 blocks to in-sample and 8 to out-of-sample. There are $\binom{16}{8} = 12{,}870$ such splits; the engine sampled 500 of them with a seeded shuffle. In each split it finds the strategy with the highest in-sample mean, then checks where that strategy ranks out of sample. PBO is the fraction of splits where the in-sample winner lands below the out-of-sample median.

Here PBO = 0.726. In nearly three-quarters of the splits, the strategy that looked best in-sample under-performed out of sample. For independent strategies with real edge, PBO < 0.2 is the threshold for "the in-sample winner generalises." This scenario sits far on the wrong side of that line: the in-sample ranking is not predictive of out-of-sample ranking.

Why this result is the honest, common case

A high PBO on a matrix of broadly similar candidate strategies is exactly what you should expect when none of the candidates has durable edge. With eight noisy series and only 252 observations, whichever strategy happens to lead in any given in-sample window is largely luck, and luck does not repeat out of sample. The engine is detecting that directly: rank the candidates in-sample, and the leader is a coin-flip-or-worse to lead again out of sample.

A real-world equivalent is eight parameter variants of the same momentum signal, each fit slightly differently. The one with the best in-sample Sharpe is the one that best fit the in-sample noise — and it pays for that fit out of sample. PBO = 0.726 is the number that catches this before the capital does.

What the deflated Sharpe says

The engine returns a deflated Sharpe below the 0.95 gate for every strategy; the best, config_7, reaches only 0.41. The Deflated Sharpe Ratio¹ tests whether the in-sample Sharpe is statistically significant given:

The number of trials (8 in this case).
The skew and kurtosis of the return distribution.
The sample size (252 observations).

A deflated Sharpe below the gate means the test cannot reject the null that the true Sharpe is zero, after adjusting for selection bias and non-normality. The best annualised Sharpe of 1.22 looks respectable on paper; its deflated value of 0.41 (a 41% probability the edge is real, well short of the 0.95 bar) says it does not survive multiple-testing correction², the same data-snooping problem White's reality check formalises³.

The two results in combination: PBO = 0.726 (the in-sample winner does not generalise) and deflated Sharpe below the gate for all candidates (no real edge). The honest interpretation: these eight candidates are noise dressed as strategies, and the apparent leader is a selection artefact.

When PBO is informative

PBO is informative when the candidate strategies are genuinely independent and the sample is long enough to give the test power. Three checks before trusting the PBO number:

Pairwise correlation matrix. Expect mostly off-diagonal values below 0.3 for truly diverse candidates. Highly correlated candidates make ranking artificially stable; that produces an artificially low PBO, the opposite failure mode from this scenario.
Strategy diversity by construction. A momentum strategy and a mean-reversion strategy on the same universe are more independent than two momentum strategies with different lookback windows.
Sample size scaling. PBO at n = 60 (the engine's hard floor) is barely informative; at n = 500+, the test has real power. This run's 252 observations give moderate power. The Walk-Forward Validator supplements with rolling out-of-sample windows.

How PBO and Deflated Sharpe combine

Four combinations:

PBO	Deflated Sharpe	Interpretation
Low (< 0.2)	High (> 95%)	Edge appears real and generalises. Best case.
Low (< 0.2)	Low (< 50%)	Ranking stable, edge not significant. Often correlated candidates.
High (> 0.5)	High (> 95%)	Statistically real, doesn't generalise. Regime change suspect.
High (> 0.5)	Low (< 50%)	Classic overfit. Discard. This scenario (PBO 0.726, DSR 0.41 max).

This run lands squarely in the bottom row: high PBO, every deflated Sharpe below the gate (0.41 at best). The in-sample winner does not survive out of sample and has no significant edge to begin with. It is the cleanest "do not deploy" signal the two tests produce together⁴.

Recommendations on this output

Do not deploy config_7. It is the in-sample leader and the out-of-sample laggard; that is the definition of an overfit selection.
Treat the whole candidate set as edgeless at this sample size. No deflated Sharpe clears the 0.95 bar (0.41 at best), so none of the eight has a statistically significant edge.
Diversify by construction, not by parameter. Eight variants of one signal are one bet, not eight. Add genuinely different strategy families.
Run on real returns. Synthetic-based PBO is a sanity check on the test infrastructure and a teaching example, not a verdict on any live candidate.

Failure modes

Quoting annualised Sharpe without deflation. config_7's 1.22 looks fine; its deflated value of 0.41 tells the real story.
Reading a high PBO as a data error. PBO = 0.726 is not a bug; it is the correct answer for an edgeless candidate set.
Quoting PBO without checking strategy independence. A very low PBO can be an artefact of correlated candidates, the inverse trap to this one.
Sample size too small. Under 60 observations the test refuses to run; under 200, the test has weak power even when assumptions hold.

FAQ

Is a high PBO always bad?

A PBO of 0.726 means the in-sample winner under-performs out of sample in most splits — for this candidate set, that is a clear discard signal, reinforced by every deflated Sharpe falling short of the 0.95 gate. Read PBO and deflated Sharpe together: high PBO with no DSR clearing the gate is the textbook overfit, not a near-miss.

How many strategies do I need to test for PBO to be meaningful?

The CSCV algorithm runs on as few as 2 strategies, but the test is most informative with 5-20. With more than 20, the multiple-comparison adjustment in deflated Sharpe dominates and individual strategies' significance collapses.

What sample size does PBO need?

Minimum 60 observations (the engine's hard floor). For useful power against 0.5 vs 0.2 PBO, plan for 500+ observations. This run's 252 daily observations give moderate power; daily data on a 2-year window is stronger, monthly data needs 4+ years.

This is the worked-matrix entry in the overfitting-diagnostics series: a single fixed 8-strategy candidate matrix carried all the way through a PBO computation. Read alongside:

Deflated Sharpe Ratio: Derivation and Worked Example: the series pillar, covering the DSR formula, its extreme-value origins, and the deflation table.
PBO Explained for LLM Strategies: the PBO formulation in depth and why LLM searches inflate it.
Deflated Sharpe in Low-Trial Regimes: how DSR behaves as the trial budget shrinks toward 1.
Deflated Sharpe vs PBO on the Same Tape: both tests on one return matrix and the two distinct failure signatures.
Did You Overfit? PBO + Deflated Sharpe: the runnable ~80-line how-to that computes both from a CSV.

Connects to

Did You Overfit? PBO and Deflated Sharpe: runnable tutorial on the same two tests.
Backtest Overfitting for LLM Strategies: PBO Explained: the LLM-strategy angle.
Walk-Forward Validation Cookbook: complementary regime-change test.
Deflated Sharpe Derivation: A Worked Example: the closed-form derivation.
Backtest Overfitting Score: upload your candidate matrix.
Backtest Overfitting Score methodology: full input/output specification.

References

Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107. jpm.pm-research.com ↩
Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28. pm-research.com ↩
White, H. (2000). "A Reality Check for Data Snooping." Econometrica 68(5), 1097–1126. jstor.org ↩
Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70. papers.ssrn.com ↩

Verified engine output

Show the recompute-verified inputs and outputs

Backtest Overfitting Score — 8 strategies × 252 synthetic daily observations, s=8, 500 combinations, seed 42

Inputs
s	8
max_combos	500
seed	42
returns (8 items)	[...]
names (8 items)	[...]

Result
n strategies	8
n observations	252
strategies (8 items)	[...]
pbo	0.726
combinations tested	500
s	8
best strategy index	6
best strategy name	config_7

Computed live at build time.