Selection Bias in LLM Strategy Research

Q: Should I use fewer LLM-proposed strategies?

No. Use as many as you want, but count them honestly and deflate accordingly. The Deflated Sharpe correction scales with the square root of the log of N — slowly. Testing 100 strategies is not dramatically worse than testing 10 from a deflation standpoint.

Q: Is there a way to use an LLM for strategy discovery without selection bias?

Use the LLM for idea generation, not for strategy selection. Have the LLM propose a strategy class and define a single parameterisation. Test that one strategy without further LLM-driven refinement. The N for deflation is then 1, not the LLM's proposal count.

Q: How does this connect to PBO?

PBO catches the failure where the in-sample winner does not generalise OOS. DSR catches the failure where the in-sample winner's Sharpe is statistically inflated. Both belong in the validation stack; they address different parts of the same selection-bias problem.

An LLM that proposes ten strategies and lets the researcher pick the one with the best backtest is a high-throughput selection-bias generator. The deflated Sharpe ratio of the in-sample winner is roughly 50-60% of the raw Sharpe even before considering the LLM's own search bias¹. Bailey & Lopez de Prado (2014) formalised the correction; Harvey & Liu (2015) extended it to the multiple-testing case². The combined effect on LLM-discovery pipelines is that the typical retail "I found a Sharpe-2 strategy with Claude" claim deflates to roughly Sharpe 0.5-0.8 after honest correction.

TL;DR

An LLM proposing N strategies and a human picking the best is industrialised data mining.
Deflated Sharpe correction reduces the in-sample winner's apparent edge by 40-60% on typical N.
Multiple-testing correction (Harvey-Liu / Bonferroni / Holm) adds another deflation layer.
Combined: published "Sharpe 2.0 LLM-discovered strategies" typically true at Sharpe 0.5-0.8 after correction.
The fix is not to test fewer strategies; it is to measure the multiple-testing burden and discount the winner.

The failure mode

The pipeline:

Prompt an LLM: "Propose 10 mean-reversion strategies on US large-cap equities."
Run each backtest on the same historical window.
Pick the strategy with the highest Sharpe.
Publish or deploy.

The pipeline reproduces, at higher throughput, the same selection bias that has plagued retail backtesting for decades³. The only thing the LLM changes is the rate of strategy generation. The statistical correction is the same: every additional strategy tested raises the bar that the in-sample winner must clear to be statistically real.

The Deflated Sharpe Ratio¹ formalises the correction. For $N$ strategies tested on a series of length $T$ with sample skew $\gamma_3$ and excess kurtosis $\gamma_4$, the deflation factor is:

DSR = Φ( (SR_observed − E[max SR*]) · √(T−1)
       / √(1 − γ_3·SR + ((γ_4 − 1)/4)·SR²) )

where $E[\max SR^*]$ is the expected maximum Sharpe under the null that all $N$ strategies have zero true edge. For non-trivial $N$, the expected maximum grows with $\sqrt{2 \ln N}$ — the bar that the winning Sharpe must clear scales with the log of the strategy count.

A worked correction

For 10 candidate strategies tested on 1,000 daily observations, with observed best Sharpe 2.0, skew -0.4, excess kurtosis 5.0:

E[max SR*] ≈ (1 − 0.5772) · Φ⁻¹(1 − 1/10)
           + 0.5772 · Φ⁻¹(1 − 1/(10·e))
         ≈ 0.4228 · 1.2816 + 0.5772 · 1.5341 ≈ 1.428

So the bar that a winner needs to clear under the null is ~1.43 (per-period) before counting any non-normality correction.

denom = √(1 − (-0.4)·SR + ((5 − 1)/4)·SR²)
      = √(1 + 0.4·SR + SR²)

For SR = 2.0 (per-period): denom = √(1 + 0.8 + 4) = √5.8 ≈ 2.408.

DSR = Φ( (2.0 − 1.428) · √999 / 2.408 )
    = Φ( 0.572 · 31.61 / 2.408 )
    = Φ( 7.51 )
    ≈ 1.0

The correction returns DSR ≈ 1.0 — the observed SR 2.0 survives the deflation. The reason: 1,000 observations is enough sample to detect this magnitude of edge against the multiple-testing burden of 10 candidates.

Same correction at 100 observations:

DSR = Φ( (2.0 − 1.428) · √99 / 2.408 ) = Φ( 2.36 ) ≈ 0.99

Still passing. The number that flips DSR is the per-period Sharpe — at SR 0.05 per-period (vs the more realistic SR 0.10 above for an annualised 2.0), the deflation can produce DSR ≈ 0.5 if N is large or the sample is short.

When the LLM adds bias the formula misses

The Deflated Sharpe handles the explicit multiple-testing burden ("we tested N strategies and picked the best"). It does not handle two LLM-specific biases:

1. Implicit prior selection

The LLM's training data already includes thousands of mean-reversion variants. The N strategies it proposes are not independent draws from strategy-space; they are draws from the space of "strategies the LLM has seen work in its training corpus." The effective N for the deflation is much larger than the 10 explicit candidates — closer to the LLM's training-corpus size.

There is no closed-form correction for this. The practical adjustment: assume the effective $N$ is 100x the explicit count, which raises $E[\max SR^*]$ by roughly $\sqrt{2 \ln 100/10} \approx 1.4$x.

2. Iterative refinement

The pipeline often runs: propose 10, backtest, ask the LLM to refine the best 3, repeat. Each iteration is an additional selection step. The Harvey-Liu multiple-testing correction² handles iterative search but requires the researcher to count every candidate that was considered at any stage. In practice, no one counts honestly. The iteration adds a hidden bias factor of 2-5x on top of the explicit N⁴.

The four primitives that contain selection bias

A defensible LLM-discovery pipeline encodes four primitives:

1. Pre-register the test

Before running any backtest, write down: the exact strategy class to be tested, the universe, the sample window, the metrics that will be computed, and the decision rule that will determine "this strategy works." This is a research-diary discipline; the Research Diary Schema walks the format.

A pre-registered test cannot be HARKed (hypothesised after results known). The deflation correction applies to the registered candidate count, not to the implicit count.

2. Compute PBO and DSR every time

The Backtest Overfitting Score tool computes both PBO (Probability of Backtest Overfitting)⁵ via Combinatorially-Symmetric Cross-Validation and the Deflated Sharpe Ratio. Every candidate strategy goes through both tests; the winner is discarded if either fails. Specifically:

PBO > 0.5: discard, the in-sample winner is unlikely to be the OOS winner.
DSR < 0.5: discard, the Sharpe is not statistically real after deflation.

The two tests in combination catch most candidates that look great in-sample and fail OOS.

3. Out-of-sample validate before deploy

A held-out window that the LLM and the researcher have never seen is the only honest validation step. The window must be (1) long enough to give statistical power and (2) genuinely held out — no peeking during strategy refinement. The Walk-Forward Validator automates the discipline.

4. Document the search

Every candidate strategy that was considered, including the ones that didn't make it to backtesting, goes in the research diary. The honest deflation N is the total considered count, not the explicit backtested count.

Why the LLM-specific bias matters

Without correction, an LLM-discovery pipeline produces apparent Sharpes that look 1.5-2x better than the real edge. A trader who deploys at the apparent Sharpe is sizing 1.5-2x too aggressively. On a quarter-Kelly bet, that means deploying at half-Kelly effective — directly into the over-betting regime that Kelly literature warns against⁶.

The Bailey-Lopez de Prado framework explicitly addresses this: "the proliferation of computational power and the rise of machine learning have made false discoveries the norm in quantitative finance, not the exception." The framework's prescription — measure the multiple-testing burden, deflate the apparent Sharpe — is the only defensible posture for LLM-discovery pipelines.

What this means for retail

For a retail trader using an LLM to propose trading strategies:

Assume the LLM proposes ~100 strategies even when you see 10. Pre-register the explicit candidate set.
Run PBO + DSR on every backtest. Discard candidates that fail either.
Validate on a held-out window that the LLM has not seen. Use a recent window (last 6 months) and pretend it does not exist during strategy refinement.
Size on the OOS Sharpe, not the in-sample Sharpe.

The combination is more work than running the LLM ten times and picking the best. It is also the only honest workflow.

Failure modes

Treating the LLM's first-pass output as candidates. The LLM has already selected; the candidates are not independent draws.
Skipping the deflation step because "the sample is large." Sample size cancels some of the deflation but not all. Always compute DSR.
Iterating after seeing OOS results. Once OOS contaminates the strategy refinement, it is no longer OOS. Treat OOS as a one-shot test.
Not counting "fast eliminations" in the candidate count. An LLM-proposed strategy that you discarded after reading the prompt is still a candidate for deflation purposes.

FAQ

Should I use fewer LLM-proposed strategies?

No. Use as many as you want, but count them honestly and deflate accordingly. The Deflated Sharpe correction scales with $\sqrt{\ln N}$ — slowly. Testing 100 strategies is not dramatically worse than testing 10 from a deflation standpoint.

Is there a way to use an LLM for strategy discovery without selection bias?

Use the LLM for idea generation, not for strategy selection. Have the LLM propose a strategy class (e.g., "mean reversion on opening range") and define a single parameterisation. Test that one strategy without further LLM-driven refinement. The N for deflation is then 1, not the LLM's proposal count.

How does this connect to PBO?