A respectable 0.600 walk-forward efficiency can hide a single window where out-of-sample Sharpe goes deeply negative, and that window is the one that sizes your risk. On a 50-observation equity-curve segment with is_len = 25, oos_len = 10, step = 5, the Walk-Forward Validator returns nWindows = 4, mean in-sample Sharpe 1.089, mean out-of-sample Sharpe 0.653, and efficiency = 0.600. The window-by-window breakdown shows the failure mode the aggregate hides: window 3 has IS Sharpe 3.63 and OOS Sharpe −3.88. That single window justifies a stricter sizing rule than the means suggest, and it points to a decision procedure that keys window sizes to signal half-life rather than to tape length.
TL;DR
For a strategy whose alpha persists for h periods, pick train window = max(10·h, 100 trades) and OOS = 5·h with step = h. The 50-observation canonical run above (is_len = 25, oos_len = 10, step = 5) implies an assumed half-life of 5 periods, but the four windows the engine produced show an OOS-Sharpe range from −3.88 to +4.64, which is not the variance of a 5-period-half-life signal; it is the variance of a noise-dominated tape. Decision: shorten step to 2, lengthen IS to at least 50, or stop walk-forward and route the data through PBO via CSCV (see /articles/deflated-sharpe-vs-pbo-on-the-same-tape/).
The window-by-window read
The engine returns four windows for is_len = 25 / oos_len = 10 / step = 5 on the canonical 50-bar segment:
| Window | IS [start,end) | OOS [start,end) | IS Sharpe | OOS Sharpe | OOS return |
|---|---|---|---|---|---|
| 0 | [0, 25) | [25, 35) | −0.221 | +2.090 | +0.0141 |
| 1 | [5, 30) | [30, 40) | −0.044 | +4.643 | +0.0314 |
| 2 | [10, 35) | [35, 45) | +0.995 | −0.239 | −0.0020 |
| 3 | [15, 40) | [40, 50) | +3.627 | −3.881 | −0.0183 |
Mean IS Sharpe = 1.089. Mean OOS Sharpe = 0.653. Efficiency (mean OOS / mean IS) = 0.600. The aggregate looks like a strategy that loses 40% of its IS edge OOS, annoying but survivable. The window-by-window read tells a sharper story: windows 0 and 1 produced positive OOS Sharpe with negative IS Sharpe (the strategy appeared dead in-sample then revived OOS); window 3 produced an IS Sharpe of 3.63 and an OOS Sharpe of −3.88 (the strongest IS signal collapsed hardest OOS).
That pattern is the canonical fingerprint of regime-fragile alpha. The aggregate efficiency = 0.6 hides it; the per-window OOS Sharpe spread of (−3.88, +4.64) does not.
The decision rule
Three inputs drive the window sizing:
- Signal half-life h, the number of periods over which the predictive feature decays to half its initial coefficient in a rolling regression. For a momentum signal on daily data h is typically 5–15 days; for a mean-reversion signal h is 2–5 days. The Cointegration Half-Life Solver computes h directly for pairs and ratios.
- Trade count per window, the OOS window must contain at least 30 trades to make the OOS Sharpe statistically meaningful at all. For a strategy trading once per period that means oos_len ≥ 30.
- Step size — small enough to give the validator at least nWindows = 8 windows. Fewer than that and the mean OOS Sharpe has standard error too large to be a useful decision input.
The rule:
is_len = max(10 * h, 100)
oos_len = max(5 * h, 30)
step = h
n_wind = (T − is_len − oos_len) / step + 1, target ≥ 8
For a daily strategy with h = 5 days, T = 504 days: is_len = 100, oos_len = 30, step = 5, nWindows = (504 − 130) / 5 + 1 = 75. That is a defensible walk-forward configuration; 75 OOS windows of 30 trades each gives a mean OOS Sharpe with standard error around 0.18 if the per-window estimates are independent (which they are not, but the rough bound holds).
The canonical input (is_len = 25, oos_len = 10, step = 5, T = 50) yields nWindows = 4 and is below the rule's threshold for any of the three numeric checks. That is why the per-window OOS Sharpe spread is so wide: the validator is fitting noise within each window.
Why the mean OOS Sharpe is the wrong headline
The Walk-Forward Validator's efficiency = 0.600 looks like a clean summary. It is a mean ratio over four windows, two of which had negative IS Sharpe. The arithmetic of (mean OOS / mean IS) gets pulled around by sign flips in a way no reader notices.
Three complementary numbers are more honest.
The OOS Sharpe distribution. The four-window read above gives an empirical CDF: −3.88, −0.24, +2.09, +4.64. Pickle that distribution; report the 25th and 75th percentiles. The 50th-percentile OOS Sharpe is +0.93, but the 25th-percentile is −2.06. The Markov drawdown analyzer (see /articles/drawdown-markov-and-recovery-tail/) consumes exactly this kind of distribution.
The OOS-IS rank correlation. Spearman correlation between window-level IS Sharpe and OOS Sharpe answers "does in-sample success predict out-of-sample success." On the canonical four windows: IS = [−0.22, −0.04, 0.99, 3.63], OOS = [2.09, 4.64, −0.24, −3.88]. Spearman ρ = −0.80 (strong anti-correlation: IS ranks [1,2,3,4] map to OOS ranks [3,4,2,1], Σd² = 18, ρ = 1 − 6·18/(4·15) = −0.80), which means the strongest in-sample window was the weakest out-of-sample — a textbook regime flip.
The window-count adequacy check. nWindows = 4 is below the rule's threshold of 8. The Walk-Forward Validation Visualiser presents this as a configuration warning when num_windows is below 6; on this configuration both engines agree that the tape is too short to use walk-forward at all.
The 100-trade floor
The "≥ 100 trades in the IS window" half of the rule is the load-bearing one for retail LLM strategies. LLM-driven signals tend to be sparse: a position is opened only when the model emits a sufficiently confident forecast. A 100-day IS window with a 10% trade frequency produces 10 trades, which is not enough to estimate IS Sharpe with any standard error worth printing.
The defensive workflow is to size the IS window in trade count, not period count. If the strategy emits trades at 5% frequency, is_len_periods = 100 / 0.05 = 2000 periods (about eight years of daily data). That is more data than most retail backtests carry. The honest report is then "walk-forward is not feasible at this trade frequency; use combinatorially-symmetric cross-validation instead" — see Did You Overfit? PBO and Deflated Sharpe for the CSCV alternative.
Anchored vs rolling
Anchored walk-forward fixes is_start = 0 for every window, growing the training set. Rolling walk-forward (the engine default) slides the IS window of fixed length. For a stationary signal the two are equivalent in expectation; for a non-stationary signal anchored over-weights early data and rolling drops it.
The choice has to be declared. The Walk-Forward Validation Cookbook (/articles/walk-forward-validation-cookbook/) walks through the trade-offs end to end. The canonical run on this page is rolling because the validator defaults to step > 0; an anchored variant would require step = 0 and is documented in the engine's input schema.
Where the rule breaks
The half-life-keyed rule assumes h is stable. For a regime-fragile alpha, h itself drifts: a momentum signal that decayed in 8 days during 2019–2021 might decay in 3 days in 2022–2024. Walk-forward configured to the 8-day half-life then under-samples regimes; configured to the 3-day half-life it over-fits noise in the slow regime.
The defence is to estimate h on rolling windows, not on the full tape. A regime-switching half-life model (Markov-switching ARMA on the residual) is the right next step; the Drawdown Recovery Markov engine consumes the regime-switched moments directly.
The rule also assumes returns are independent across windows. Step-size overlap between consecutive windows breaks that assumption; the engine's nWindows = 4 with step = 5 / oos_len = 10 means each pair of consecutive OOS windows shares 5 observations. Reporting standard error of the mean OOS Sharpe under the assumption of independence is dishonest at step < oos_len. The correction is a block-bootstrap with block size = oos_len.
Connects to
- Walk-Forward Validation: A Cookbook — anchored vs rolling, the four parameters, runnable Python template.
- Walk-Forward Validation Pitfalls for LLM Strategies — LLM-specific failure modes including trade-frequency sparsity.
- Did You Overfit? PBO and Deflated Sharpe — the CSCV alternative when walk-forward is infeasible.
- Walk-Forward Validator — engine endpoint.
- Walk-Forward Validation Visualiser — companion visualiser.
References
- Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter on walk-forward validation and combinatorially-symmetric cross-validation. https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086
- Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2016). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39–70. SSRN abstract https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2326253
- Harvey, C. R., & Liu, Y. (2015). "Backtesting." Journal of Portfolio Management 41(4), 13–28.
- Pardo, R. (2008). The Evaluation and Optimization of Trading Strategies, 2nd ed. Wiley. Anchored vs rolling walk-forward conventions.
- Aronson, D. R. (2007). Evidence-Based Technical Analysis. Wiley. Bootstrap and randomization tests for trading-rule validation.
Verified engine output
Show the recompute-verified inputs and outputs
| returns (50 items) | [...] |
|---|---|
| is_len | 25 |
| oos_len | 10 |
| step | 5 |
| mode | rolling |
| windows › row 1 › index | 0 |
|---|---|
| windows › row 1 › is start | 0 |
| windows › row 1 › is end | 25 |
| windows › row 1 › oos start | 25 |
| windows › row 1 › oos end | 35 |
| windows › row 1 › is sharpe | -0.22147921511019814 |
| windows › row 1 › oos sharpe | 2.0897655072803043 |
| windows › row 1 › oos return | 0.01411157646538208 |
| windows › row 2 › index | 1 |
| windows › row 2 › is start | 5 |
| windows › row 2 › is end | 30 |
| windows › row 2 › oos start | 30 |
| windows › row 2 › oos end | 40 |
| windows › row 2 › is sharpe | -0.043948848702680766 |
| windows › row 2 › oos sharpe | 4.642601732123999 |
| windows › row 2 › oos return | 0.03139877858806628 |
| windows › row 3 › index | 2 |
| windows › row 3 › is start | 10 |
| windows › row 3 › is end | 35 |
| windows › row 3 › oos start | 35 |
| windows › row 3 › oos end | 45 |
| windows › row 3 › is sharpe | 0.994853256422037 |
| windows › row 3 › oos sharpe | -0.2385824989138949 |
| windows › row 3 › oos return | -0.0020461796084789707 |
| windows › row 4 › index | 3 |
| windows › row 4 › is start | 15 |
| windows › row 4 › is end | 40 |
| windows › row 4 › oos start | 40 |
| windows › row 4 › oos end | 50 |
| windows › row 4 › is sharpe | 3.6270655970069248 |
| windows › row 4 › oos sharpe | -3.8808219831393376 |
| windows › row 4 › oos return | -0.018331354770414254 |
| n windows | 4 |
| mean is sharpe | 1.0891226974040207 |
| mean oos sharpe | 0.6532406893377675 |
| efficiency | 0.5997861314384503 |
Computed live at build time.
Frequently asked questions
- Why does the engine produce nWindows = 4 from a 50-observation tape with is_len = 25, oos_len = 10, step = 5?
- Because (50 − 25 − 10) / 5 + 1 = 4. The validator slides the IS window by step until no full IS + OOS pair fits within the tape; the nWindows ≥ 8 threshold rejects this configuration as under-sampled.
- What does efficiency = 0.600 actually mean?
- It is mean OOS Sharpe / mean IS Sharpe = 0.6533 / 1.0891. The metric is sensitive to sign flips in either mean and should be reported alongside the per-window OOS Sharpe distribution and the IS-OOS rank correlation.
- How does this interact with the deflated Sharpe?
- Walk-forward returns a per-window OOS Sharpe distribution; deflated Sharpe returns a single number with a selection-bias adjustment. They answer complementary questions and a defensible report runs both.
- Should I use anchored or rolling walk-forward?
- Default to rolling. Anchored is appropriate when there is theoretical reason to weight all earlier data equally; for LLM-driven signals on non-stationary tapes, rolling is the safer default.
- What's the minimum number of windows before mean OOS Sharpe is meaningful?
- Eight is the floor; sixteen gives the mean a standard error worth quoting. Below eight the mean is dominated by single-window variance and the configuration is best reported as exploratory.