Why does the PBO test sample only 500 combinations instead of all 12,870?

Sampling 500 produces standard error around 0.02, below the 0.05 decision-threshold sensitivity. Full enumeration takes 25× longer for only 4× improvement in precision — not worth it for retail-scale tapes.

Can DSR PSR exceed 1.0?

No — it is a probability bounded in [0, 1]. On well-behaved inputs the engine returns values in (0, 1); on degenerate inputs it can return 0 or 1 at the boundaries.

Below 0.2 is conventional for a well-behaved strategy. 0.2–0.5 is the gray zone needing walk-forward validation. Above 0.5 is the classic overfit signature.

Can a strategy fail PBO and DSR independently?

Yes. PBO can fail when the IS rank is unstable but absolute Sharpe is fine; DSR can fail when absolute Sharpe is too small relative to trial count. The four-quadrant table distinguishes these patterns.

Are there other tests I should run?

Walk-forward validation catches regime change; survivor-bias audit catches universe-selection issues; trading-cost audit catches the 'looks good before costs' mode. The two tests are necessary but not sufficient.

Deflated Sharpe vs PBO on the Same Tape

Two overfitting tests reject the same backtest winner for different reasons, which is exactly why a defensible report runs both. On the Backtest Overfitting Score sample tape (8 candidate strategies × 252 observations of noisy returns), the engine returns PBO = 0.726 (well above the 0.5 overfitting threshold), best strategy index 6 ("config_7", annualized Sharpe 1.22), 500 CSCV combinations tested. The Deflated Sharpe Ratio on a representative observed Sharpe of 1.6, n = 504, skew −0.4, kurt 4.0, num_trials = 200 returns PSR = 0.054, deflatedSr = −1.166, maxExpectedSr = 2.766. Both tests reject the tape's apparent winner, and they reject it on different grounds: PBO on out-of-sample rank decay, the deflated Sharpe on multiple-testing-adjusted significance.

TL;DR

Same overfit tape (8 candidate strategies of noisy returns), two tests, two failure signatures:

Test	What it measures	Output	Verdict
PBO (CSCV)	In-sample vs out-of-sample rank consistency	0.726	Reject: the in-sample winner is below median out-of-sample > 72% of the time
Deflated Sharpe	Sharpe significance after num_trials adjustment	PSR = 0.054	Reject: the observed Sharpe is below the selection-bias benchmark

PBO answers "does the in-sample best generalize." DSR answers "is the headline Sharpe statistically real after correcting for trials." Both reject the canonical overfit example; they would reject overfit examples constructed differently in different patterns. The complementary information is the entire point of running both.

What each test captures

PBO via Combinatorially-Symmetric Cross-Validation (CSCV):

Split the observation axis into 16 chunks (S = 8 means 2S = 16 chunks).
Enumerate combinations of 8 chunks as in-sample (IS), 8 as out-of-sample (OOS).
For each combination, find the IS Sharpe winner across the 8 strategies; check its OOS Sharpe rank.
PBO = fraction of combinations where the IS winner ranks below OOS median.

PBO = 0.726 on the sample tape means: out of 500 CSCV combinations, about 363 had the IS winner ranking below OOS median. The IS-winner-out-of-sample-loser pattern is the textbook overfit signature, and a 72.6% rate is severe.

Deflated Sharpe (Bailey–Lopez de Prado 2014):

Take the observed Sharpe, sample size T, skew, kurt, and num_trials.
Compute the expected max Sharpe under the null (zero true edge across all candidates).
Compute the z-score: (observed Sharpe − expected max Sharpe under null) × √(T−1) / (variance-of-Sharpe-correction).
PSR = Φ(z). Reject the strategy if PSR < 0.95.

On the canonical input, the engine returns maxExpectedSr = 2.766 (the expected maximum of 200 standard normals, in the same Sharpe units as the observed input), deflatedSr = −1.166, PSR = 0.054. The observed Sharpe (1.6) is below the engine's selection-bias benchmark of 2.77, so PSR sits well under the 0.95 gate and the strategy is rejected.

Why the two answer different questions

PBO and DSR measure different aspects of overfitting:

PBO measures generalization. Does the strategy that wins in-sample also win out-of-sample, or does it fall to a different strategy? PBO catches strategies that are "best at fitting the in-sample tape" without being "best at predicting the OOS tape." This is overfitting of the selection procedure.
DSR measures Sharpe inflation. Given the number of trials, how much of the observed Sharpe is consistent with pure luck? DSR catches strategies whose Sharpe looks good only because the researcher tested many alternatives. This is overfitting of the test statistic.

A strategy can pass DSR (Sharpe is robust to trial count) but fail PBO (the same in-sample winner under-performs OOS). A strategy can pass PBO (the in-sample winner generalizes) but fail DSR (the Sharpe is not statistically distinguishable from luck once you account for trial count).

For the sample tape, both fail: the tape is 8 candidate strategies of noisy returns with no genuine edge, which is the textbook overfit-by-search setup — picking the in-sample best of 8 noise series produces an apparent winner that does not generalize. Real backtests fail in mixed patterns: some fail DSR but pass PBO (the in-sample winner does generalize, but the Sharpe is too small to defend after trial correction), some fail PBO but pass DSR (the headline Sharpe is real but the selection procedure is unstable).

The four-quadrant decision table

A defensible quant report runs both and reports the four-quadrant verdict:

PBO	DSR	Interpretation
< 0.2	> 0.95	Edge appears to generalize; size up cautiously.
< 0.2	0.50–0.95	IS/OOS consistency is good but Sharpe significance is marginal. Small edge or noisy returns.
0.2–0.5	> 0.95	Sharpe is statistically real but cross-validation is uncertain. Investigate regime dependence.
> 0.5	< 0.5	Classic overfit signature. Stop.

The sample tape lands in the fourth row: PBO = 0.726 > 0.5 AND DSR PSR = 0.054 < 0.5. Stop.

When the two disagree: the diagnostic value

The interesting cases are off-diagonal. A strategy with PBO 0.15 (passes generalization) and DSR PSR 0.40 (fails Sharpe-significance) is a strategy with a small but real edge that needs more data, the IS-OOS pattern is consistent, but the absolute Sharpe is too small to defend at the current trial count. The fix: extend the tape, not redo the search.

A strategy with PBO 0.65 (fails generalization) and DSR PSR 0.96 (passes Sharpe-significance) is a strategy where the headline Sharpe is statistically real but the specific candidate was picked unstably. The fix: re-run the search with a different random seed, see if the same candidate emerges. If a different candidate wins, both candidates are suspect; if the same candidate wins repeatedly, the PBO false-alarm explanation deserves weight (rare, but possible for strategies with very stable parameter sensitivities).

The cost of running both

PBO via CSCV with S = 8 enumerates C(16, 8) = 12,870 combinations. The engine samples 500 of these for ±4% precision; full enumeration takes about 25× as long but improves precision to ±0.5%. For a 252-observation tape with 8 strategies, the 500-combination run completes in well under a second on a typical browser.

DSR is a closed-form computation: one normal-CDF evaluation plus a small amount of arithmetic. Sub-millisecond.

Together the two tests cost essentially nothing in compute. There is no defensible reason to skip either; the gap they leave when run together is the regime-change risk, which is what walk-forward validation addresses (see /articles/walk-forward-window-sizing-decision/).

Where the joint approach breaks

Both tests assume the candidate set is fixed before testing. A workflow that adds strategies after seeing initial results (the most common retail pattern) has an effective trial count larger than the literal candidate count. Neither test can detect this; the discipline of pre-registering the candidate set is the only defence.

Both tests assume returns are stationary across the tape. A regime switch mid-tape (volatility regime change, structural break in fundamental conditions) makes the OOS rank a property of the regime, not the strategy. PBO will flag the regime-fragility as overfitting; DSR will flag the inflated mid-tape Sharpe as luck. Neither distinguishes from genuine regime change, additional tests (see The Sharpe Ratio Trap) are required.

The PBO test's sensitivity to the chunking parameter S is moderate; S = 4 (8 chunks) is too coarse for tapes under 100 observations; S = 16 (32 chunks) is overkill for tapes above 1000 observations. The canonical S = 8 is appropriate for tapes between 100 and 1000 observations.

This is the head-to-head entry in the overfitting-diagnostics series: both tests run on a single return matrix to show they reject the same winner for different reasons. Read alongside:

Deflated Sharpe Ratio: Derivation and Worked Example: the series pillar, covering the DSR formula, its extreme-value origins, and the deflation table.
PBO Explained for LLM Strategies: the PBO formulation in depth and why LLM searches inflate it.
Deflated Sharpe in Low-Trial Regimes: how DSR behaves as the trial budget shrinks toward 1.
Did You Overfit? PBO + Deflated Sharpe: the runnable ~80-line how-to that computes both from a CSV.
PBO Score on an Eight-Strategy Matrix: a fully worked PBO computation on a fixed 8-strategy candidate matrix.

Connects to

Did You Overfit? PBO and Deflated Sharpe, the runnable Python tutorial implementing both tests in 80 lines.
Deflated Sharpe Derivation Worked Example, step-by-step derivation of the DSR formula.
Backtest Overfitting in LLM Strategies: PBO Explained — LLM-strategy context for the PBO test.
Deflated Sharpe Ratio — engine endpoint.
Backtest Overfitting Score — engine endpoint.
Walk-Forward Validator — companion test for regime-change risk.

References

Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70. SSRN abstract https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2326253
Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94–107. https://jpm.pm-research.com/content/40/5/94
Bailey, D. H., Borwein, J. M., Lopez de Prado, M., & Zhu, Q. J. (2014). "Pseudo-Mathematics and Financial Charlatanism." Notices of the American Mathematical Society 61(5), 458–471. https://www.ams.org/notices/201405/rnoti-p458.pdf
Lopez de Prado, M. (2018). Advances in Financial Machine Learning, chapter on backtest statistics. https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086
Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28.

Verified engine output

Show the recompute-verified inputs and outputs

PBO via CSCV on the 8-strategy overfit tape

Inputs
s	8
max_combos	500
seed	42
names (8 items)	[...]
returns (8 items)	[...]

Result
n strategies	8
n observations	252
strategies (8 items)	[...]
pbo	0.726
combinations tested	500
s	8
best strategy index	6
best strategy name	config_7

Computed live at build time.

Deflated Sharpe on a representative observed SR of 1.6

Inputs
observed_sr	1.6
n	504
skew	-0.4
kurt	4
num_trials	200
periods_per_year	252

Result
psr	0.3108853828672815
z	-0.49334226197195025
max expected sr	1.9574636000454002
effective benchmark	1.9574636000454002
deflated sr	-0.3574636000454001