Skip to main content
aifinhub
Backtesting & Validation Explainer

Overfitting

Overfitting occurs when model parameters are tuned tightly enough to historical data that the strategy describes random structure rather than persistent signal. The hallmark: in-sample Sharpe is excellent, out-of-sample Sharpe collapses. The mechanism is multiple-testing — every parameter explored is a hypothesis tested, and with enough trials, random noise produces strategies that look profitable purely by chance.

By Orbyd Editorial · AI Fin Hub Team

On This Page

Definition

Overfitting

Overfitting occurs when model parameters are tuned tightly enough to historical data that the strategy describes random structure rather than persistent signal. The hallmark: in-sample Sharpe is excellent, out-of-sample Sharpe collapses. The mechanism is multiple-testing — every parameter explored is a hypothesis tested, and with enough trials, random noise produces strategies that look profitable purely by chance.

Why it matters

Most retail and a meaningful fraction of institutional backtests are overfit. The strategies look great until they go live, at which point the ratio between live and backtested Sharpe (the haircut) typically lands at 0.3 to 0.5. Diagnosing overfitting is more valuable than designing a new strategy.

How it works

Hold out a true out-of-sample period and never look at it during development. Use walk-forward validation. Compute Probability of Backtest Overfitting (PBO) by comparing in-sample and out-of-sample rankings across parameter combinations. Apply Bailey-Lopez de Prado's deflated Sharpe ratio to penalize multiple testing. Treat any strategy whose deflated Sharpe collapses to near-zero as overfit until proven otherwise.

Example

Ten thousand random strategies on equity returns

Strategies tested

10,000

Best in-sample Sharpe

2.4

Same strategy out-of-sample Sharpe

0.3

Deflated Sharpe

0.1 (not significant)

The 'best' strategy from a 10k search has an in-sample Sharpe that looks impressive and a deflated Sharpe that says you found random luck. Live trading would lose money.

Key Takeaways

1

Every parameter you tune is a hypothesis you test — multiple testing is real.

2

Deflated Sharpe is the cleanest single-number defense against overfitting claims.

3

If you can't reproduce the strategy on a held-out, never-touched dataset, it's overfit.

Try These Tools

Run the numbers next

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

There's no clean threshold, but the deflated-Sharpe penalty grows with log(N_trials). Twenty trials cuts roughly 0.3 off your Sharpe expectation under the null. A thousand trials cuts roughly 0.6. Track and report N_trials honestly.

Sources & References

Related Content

Keep the topic connected

Planning estimates only — not financial, tax, or investment advice.