Backtest Overfitting in LLM Trading Strategies: PBO Score Explained

TL;DR

The Probability of Backtest Overfitting (PBO) is the fraction of in-sample/out-of-sample partitions where the in-sample winning strategy performs worse than the median strategy out of sample. The threshold is straightforward. PBO above 0.5 means the in-sample winner is, on expectation, no better than picking randomly from the candidate pool. LLM-augmented backtests systematically inflate PBO relative to classical strategy searches, for three structural reasons: training-set leakage, prompt-encoded pattern bias, and unbounded candidate generation. This article walks through the formulation from Bailey, Borwein, López de Prado, Zhu (2014), shows how to compute the score, demonstrates a worked example with 200 candidate strategies, and lays out the mitigation pattern that holds up under adversarial review. Run your own data through /backtest-overfitting-score/ and combine it with /walk-forward-validator/ before paper trading.

What classical OOS testing gives you, and what it leaves on the table

The standard discipline before LLMs touched a backtest was in-sample fitting, out-of-sample evaluation on a held-out window, and walk-forward to test parameter stability across a rolling fit.

IS/OOS catches the obvious case where parameters were fit to noise. If your strategy returns a Sharpe of 2.4 on 2014 to 2020 and 0.3 on 2021 to 2023, the OOS gap is the diagnosis. The fix is simpler parameters or a different signal. Walk-forward extends this by repeating the IS/OOS cut at every step, so you see whether the OOS gap is a one-off artifact of the chosen split date or a persistent feature of the strategy.

These controls are necessary but not sufficient. The hole they miss is selection bias from the candidate pool. If you tested fifty crossover strategies, fifty mean-reversion strategies, and fifty pairs trades, then kept the one with the best walk-forward Sharpe, that single number is biased upward. The expected maximum of fifty noise draws is positive even when the true edge of every strategy is zero.

Walk-forward on the winning strategy alone tells you nothing about how many siblings it had. PBO closes that gap by checking whether the in-sample winner from a random partition also ranks near the top out of sample, or regresses to the middle of the pack.

The PBO formulation in plain English

The 2014 paper by Bailey, Borwein, López de Prado, and Zhu (The Probability of Backtest Overfitting, Journal of Computational Finance) defines PBO through Combinatorially-Symmetric Cross-Validation (CSCV). The procedure is mechanical.

Start with a returns matrix: rows are time, columns are candidate strategies, T observations, N strategies. Chop the time axis into 2S equal-length blocks (S is typically 8, giving 16 blocks; for ten years of daily data, blocks of roughly 150 trading days each). Enumerate every way to split the 2S blocks into two halves of S blocks each: C(2S, S) partitions, 12,870 for S = 8. One half is in-sample (IS), the other out-of-sample (OOS).

For each partition, compute every strategy's Sharpe ratio on the IS half and OOS half independently. Find the strategy with the highest IS Sharpe; call it n*. Rank n*'s Sharpe within the OOS distribution of all N strategies. If n* ranked in the top half of OOS, the IS winner generalized; if in the bottom half, it failed to generalize.

PBO is the fraction of partitions where the IS winner ranked below the OOS median:

PBO = P(λ_OOS_winner ≤ 0)

where λ is the logit of n*'s fractional rank in the OOS Sharpe distribution. Logit zero corresponds to the median, so the inequality is the formal version of "IS winner ranked below median OOS."

Computing PBO: the matrix you need

In code, the algorithm is roughly forty lines.

from itertools import combinations
import numpy as np

def pbo(returns, s=8, sample=None, seed=42):
    """
    returns: np.array shape (N_strategies, T_observations)
    s: half-block count; total partitions = C(2s, s)
    sample: if int, randomly draw that many partitions instead of enumerating
    """
    n, t = returns.shape
    chunk = t // (2 * s)
    blocks = [returns[:, i * chunk:(i + 1) * chunk] for i in range(2 * s)]

    all_partitions = list(combinations(range(2 * s), s))
    if sample is not None and sample < len(all_partitions):
        rng = np.random.default_rng(seed)
        idx = rng.choice(len(all_partitions), size=sample, replace=False)
        partitions = [all_partitions[i] for i in idx]
    else:
        partitions = all_partitions

    below = 0
    for is_idx in partitions:
        is_set = set(is_idx)
        is_mat = np.concatenate([blocks[i] for i in is_set], axis=1)
        oos_mat = np.concatenate(
            [blocks[i] for i in range(2 * s) if i not in is_set], axis=1
        )

        is_sr = is_mat.mean(axis=1) / is_mat.std(axis=1, ddof=1)
        oos_sr = oos_mat.mean(axis=1) / oos_mat.std(axis=1, ddof=1)

        winner = int(np.argmax(is_sr))
        rank = (oos_sr < oos_sr[winner]).sum() / (n - 1)
        clamped = max(0.001, min(0.999, rank))
        logit = np.log(clamped / (1 - clamped))
        if logit < 0:
            below += 1

    return below / len(partitions)

For S = 8 the full enumeration is 12,870 partitions; on a 200-strategy, 2,500-day matrix that runs in under a minute on modern hardware. For larger N or longer T, sampling 500 partitions gives a PBO estimate accurate to ±4 percent. The browser implementation at /backtest-overfitting-score/ uses this exact sampling rule.

What the 0.5 threshold means, and what it does not mean

A PBO of 0.5 says the in-sample winner of a random partition has a 50/50 chance of ranking below the OOS median: the same odds as picking a strategy at random. If the selection process adds zero information, expect PBO around 0.5.

PBO above 0.5 is worse than random. It indicates negative information transfer from IS to OOS, usually because the IS winner is a strategy whose in-sample fit was driven by features that reversed out of sample. PBO below 0.5 indicates positive information transfer: in-sample winning predicts OOS performance better than chance.

Interpretation buckets:

PBO	Reading
< 0.2	In-sample winners generalize well. Selection process adds real information.
0.2 to 0.4	Modest selection edge. Combine with PBO confidence interval before deploying.
0.4 to 0.6	Indistinguishable from random selection. Stop.
> 0.6	Anti-selection. The in-sample winner is systematically a worse OOS bet than the median.

What PBO does not tell you: it does not bound the expected OOS Sharpe of the winner. A PBO of 0.1 with a candidate pool of three trivial strategies tells you the best ranks above OOS median, but the absolute OOS Sharpe could still be 0.2. Pair PBO with the Deflated Sharpe Ratio for the absolute-performance question.

PBO also does not catch survivorship bias in the universe (testing only stocks that survived the period), look-ahead leakage in feature construction, or non-stationary regime shifts. It is one statistical control inside a larger validation flow, not a single-number verdict.

Why LLM trading strategies have inflated PBO

LLM-generated and LLM-tuned strategies sit at higher PBO than classical strategy searches for three compounding reasons.

1. The model has seen futures. Pre-training data for any commercial LLM as of 2026 includes financial commentary, academic backtest papers, retail trading blogs, and earnings transcripts dated through the model's cutoff. Ask a model to propose a momentum strategy on USD/JPY for the 2018 to 2023 window and the model has already read post-hoc analyses describing what worked in that exact window. The strategy is not generated from a hypothesis; it is retrieved from a corpus that includes the answer. Data leakage at the strategy-design level, not the price level.

2. The prompt encodes pattern-matched bias. Even with a careful prompt, instructions like "find a stable signal" or "minimize drawdown" prime the model toward strategies over-represented in the training corpus, and corpus over-representation correlates with strategies that worked historically. The class of generated strategies is non-uniform with respect to the actual hypothesis space, in a direction that flatters in-sample performance.

3. The candidate pool is enormous. With a classical strategy search, the candidate count is bounded by what the researcher manually specifies. With LLM-driven generation, it balloons by orders of magnitude: ten variants per iteration, ten iterations, three time windows produces 300 candidates from a single prompt. PBO scales with the multiplicity correction, which scales with N. The same edge that produced PBO of 0.3 on 30 candidates routinely produces 0.55 on 300.

The combined effect is multiplicative, not additive. The same backtest run with a human-designed list of 30 versus an LLM-generated list of 300 can shift PBO from 0.25 to 0.65 with no change in underlying signal quality. This is the single most important diagnostic for any LLM-augmented research stack and the one most often skipped in production.

A worked example: 200 candidates, 10 IS/OOS partitions

A synthetic example calibrated to make the math concrete. N = 200 candidate strategies generated by an LLM-driven loop, T = 2,520 daily observations (ten years), S = 5 (10 blocks of 252 days each, C(10, 5) = 252 partitions).

Case A: the true Sharpe of every candidate is zero. IS Sharpes are normally distributed with mean zero. The IS winner's OOS Sharpe is independent of its IS Sharpe under the null, so the expected OOS rank is the median and PBO converges to 0.5.

Case B: half the candidates have a true Sharpe of 0.3, half have zero. The IS winner is, in expectation, drawn from the positive half, and its OOS rank is biased upward. Simulation gives PBO around 0.18.

Case C, with LLM contamination: 20 of the 200 candidates were lifted, knowingly or not, from training data describing strategies that worked specifically in the test window. These 20 have an inflated IS Sharpe because they encode the answer, and a true OOS Sharpe of zero. The IS winner is now drawn disproportionately from the contaminated 20, whose OOS rank reverts to the population median or below. Simulation gives PBO around 0.42, even though 90% of the candidate pool is honest.

The lesson: even a small fraction of contaminated candidates can drag PBO from clean territory into the borderline-rejection band. The contamination does not need to be deliberate, or even known to the researcher.

Mitigations

Sorted by cost.

Pre-registered strategy spec. Write down the strategy structure, parameter ranges, and evaluation window before the LLM sees any data. Have the model generate strategies that fit the spec, evaluate on data the model was not asked to inspect, and reject any candidate that uses features not listed. This is the discipline equivalent of pre-registering an academic experiment, and it is the single highest-impact control. PBO drops by 10 to 20 percentage points in practice when the spec is tight and was written first.

Cap on candidate count. Hard ceiling on strategies tested per research loop. A reasonable default is 30 to 50 candidates per project, generated in two or three rounds of LLM brainstorming, with each round seeing only the previous round's failure analysis (not its successes). The cap forces the researcher to engage with each candidate rather than treating the search as a brute-force scan.

Reduce-to-features rather than reduce-to-strategies. Instead of generating 200 strategies and picking the best, ask the LLM to generate 20 candidate features and combine them into a single ensemble model with regularization. The multiplicity correction now applies to features (a much smaller pool) rather than strategies, and regularization handles within-model selection. This is the pattern that survives adversarial review most consistently.

The three combine. Pre-registered spec narrows the hypothesis space, candidate cap bounds the multiplicity, and feature reduction shifts the search from a high-N strategy pool to a low-N feature pool. Each control independently reduces PBO; together they typically halve it.

The three-gate validation flow

PBO is one of three gates: walk-forward, PBO, paper-trading.

Gate one: walk-forward. Run the strategy through /walk-forward-validator/ on a rolling or anchored schedule. Output is a stitched OOS equity curve and a per-fold parameter table. Reject if OOS Sharpe is below 0.4 of IS Sharpe, or if best parameters flip discontinuously fold-over-fold.

Gate two: PBO. Run the candidate pool (every variant tested, not just the winner) through /backtest-overfitting-score/. Reject if PBO is above 0.4. Be honest about the pool: every variant the LLM proposed counts, including the ones you discarded after a quick eyeball.

Gate three: paper trading and probability calibration. Trade the strategy on paper for at least 60 trading days, sized per the LLM-generated probability distribution. Run the resulting probability/outcome pairs through /calibration-dojo/. Reject if the model's stated probabilities are miscalibrated by more than 10 percentage points; miscalibration is what kills Kelly sizing in production.

A strategy that passes all three is not guaranteed to make money. It is guaranteed to be free of the three best-documented retail failure modes: in-sample overfit, selection bias, and probability miscalibration. That clears the runway for the strategy to fail honestly on regime change or execution cost, not on a statistical artifact you should have caught at validation.

Connects to

Backtest Overfitting Score accepts a wide-format CSV of strategy returns and computes PBO with adjustable S and partition sampling.
Walk-Forward Validator runs anchored or rolling fits with configurable IS/OOS lengths and stitches the OOS equity.
Calibration Dojo checks whether your model's stated probabilities match outcome frequencies across bins.
Did You Overfit? PBO and Deflated Sharpe covers the complementary Deflated Sharpe Ratio test for absolute Sharpe significance.
Walk-Forward Validation: A Cookbook is the longer-form treatment of the first gate.

References

Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2014). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39 to 70. The CSCV formulation and the original PBO derivation.
Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2014). "Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance." Notices of the American Mathematical Society 61(5). The accessible companion paper.
Bailey, D. H., & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), 94 to 107. The complementary absolute-Sharpe test.
López de Prado, M. (2018). Advances in Financial Machine Learning, Wiley. Chapter 11 covers backtest overfitting in the ML context, with explicit notes on why naive CV breaks on time series.