Synthetic Market Data for Backtests: Beyond GBM

TL;DR

Synthetic data is not a substitute for real out-of-sample validation, but it is the cheapest way to scaffold a backtest pipeline. Use it to unit-test the plumbing, stress-test the risk estimator against fat tails and vol clustering that the historical sample did not happen to contain, and cross-check whether a strategy's apparent edge survives when the data-generating process is known. The progression from simplest to most realistic: geometric Brownian motion (log-normal returns; fine for wiring), GARCH(1,1) (adds vol clustering; cite Bollerslev 1986), regime-switching (two-state Markov for bull/bear regimes; cite Hamilton 1989), Gaussian copula for pairs (joint distribution for multi-asset tests; with a strong warning on tail-dependence under-statement, see Embrechts-McNeil-Straumann 2002). Each with a working Python template and its honest limits.

Why scaffold with synthetic data

Three uses where synthetic data is the right tool:

Pipeline wiring. Before wasting real-data iterations, verify that the event loop, fill simulator, equity-curve assembly, and reporting code produce the correct numbers on a known input. Feed GBM paths through the pipeline, verify that the empirical Sharpe converges to the analytical value as sample size grows.
Risk-estimator stress tests. Historical data contains only the regimes that happened to occur. If a strategy has run only through low-volatility tape, its risk model has never been tested against 2008 or 2020. Synthetic data lets you inject worse tails than history provided and verify the risk layer catches them.
Overfitting diagnostics. If a strategy shows Sharpe 2 on a synthetic series known to have zero true edge, the research pipeline is fitting noise. Run the entire optimisation loop against pure-noise synthetic data before trusting the result on real data.

Three uses where synthetic data is the wrong tool:

Final performance validation. Always real out-of-sample data, ideally walk-forward. See Walk-Forward Validation: A Cookbook for Honest Backtests.
Parameter selection. Selecting thresholds on synthetic data is selecting against the assumptions of the generator, not against the market.
Execution cost modelling. Real slippage and market impact are structural properties of real order books. Synthetic data has no order book. See Execution Simulation: The Slippage and Impact You Can't Ignore.

Level 1: Geometric Brownian motion

The standard log-normal model behind Black-Scholes:

dS_t = μ · S_t · dt + σ · S_t · dW_t

Discretised:

log(S_{t+1} / S_t) ~ N(μ − σ²/2, σ²) · dt

GBM generates returns that are independent, identically distributed, and Gaussian. Properties:

No volatility clustering. Yesterday's big move has no effect on today's expected magnitude.
No fat tails. Tail probabilities drop off as in a normal distribution.
Stationary mean and variance. Parameters do not shift.

What GBM is good for: wiring a pipeline, generating "null" data for overfitting diagnostics, benchmarking an analytical result against Monte Carlo.

What GBM is bad for: anything tail-risk related. A VaR estimator tuned on GBM will silently under-state tail risk on real data by 2-3x.

import numpy as np
import pandas as pd

def gbm_returns(
    n_days: int,
    mu_annual: float = 0.08,
    sigma_annual: float = 0.16,
    dt: float = 1 / 252,
    seed: int | None = None,
) -> pd.Series:
    rng = np.random.default_rng(seed)
    drift = (mu_annual - 0.5 * sigma_annual ** 2) * dt
    shock = sigma_annual * np.sqrt(dt) * rng.standard_normal(n_days)
    log_rets = drift + shock
    return pd.Series(log_rets, name="log_return")

Level 2: GARCH(1,1)

Bollerslev's 1986 Generalised Autoregressive Conditional Heteroskedasticity model (Bollerslev 1986)¹ is the simplest formulation that captures the most prominent stylised fact of real returns: volatility clusters. High-vol days are followed by high-vol days; calm days are followed by calm days.

r_t     = σ_t · z_t,           z_t ~ N(0, 1)
σ²_t    = ω + α · r²_{t-1} + β · σ²_{t-1}

Parameters:

ω — long-run variance floor (positive, small).
α — reaction to yesterday's shock (typical 0.05–0.15 on equity indices).
β — persistence of volatility (typical 0.80–0.92).

The constraint α + β < 1 enforces mean-reverting volatility. Values close to 1 (e.g. 0.99) produce near-integrated, long-memory volatility.

GARCH captures clustering but still generates Gaussian conditional returns. Real equity returns have conditional tails fatter than Gaussian, which is why practitioners fit GARCH with Student-t innovations in production. For scaffolding purposes, Gaussian innovations are sufficient.

def garch_returns(
    n_days: int,
    omega: float = 1e-6,
    alpha: float = 0.08,
    beta: float = 0.90,
    seed: int | None = None,
) -> pd.Series:
    assert alpha + beta < 1, "non-stationary: alpha + beta must be < 1"
    rng = np.random.default_rng(seed)
    long_run_var = omega / (1 - alpha - beta)
    sigma2 = np.zeros(n_days)
    rets = np.zeros(n_days)
    sigma2[0] = long_run_var
    rets[0] = np.sqrt(sigma2[0]) * rng.standard_normal()
    for t in range(1, n_days):
        sigma2[t] = omega + alpha * rets[t - 1] ** 2 + beta * sigma2[t - 1]
        rets[t] = np.sqrt(sigma2[t]) * rng.standard_normal()
    return pd.Series(rets, name="return")

Validate a fit by computing autocorrelation of squared returns — real equity data shows positive autocorrelation out to ~50 lags, and GARCH(1,1) reproduces that signature. If a strategy's backtest on real data looks nothing like its backtest on GARCH-generated data of similar mean/variance, the strategy is relying on sequential dependence (trend or mean reversion in returns themselves, not just in volatility).

Level 3: Regime-switching

Hamilton's 1989 regime-switching model (Hamilton 1989)² posits that returns are drawn from one of K regimes, with a Markov chain governing transitions. The two-state case (bull / bear) is the retail default:

state s_t ∈ {bull, bear}, evolves per transition matrix P
r_t | s_t = bull  ~ N(μ_bull,  σ²_bull)
r_t | s_t = bear  ~ N(μ_bear,  σ²_bear)

Typical equity-index parameters fit on post-war US data:

Regime	μ (daily)	σ (daily)	P(stay)
Bull	+0.0005	0.008	0.99
Bear	−0.0010	0.022	0.96

The stay-probabilities produce regimes that last on the order of months to years, matching observed business-cycle structure.

Regime switching is the lightest-weight way to produce synthetic data with qualitatively different tape periods — useful for testing whether a strategy relies on one regime. A trend-following strategy that profits in bull regimes and loses in bear regimes will show this split cleanly on synthetic data with a known generator.

def regime_switching_returns(
    n_days: int,
    params=(
        {"mu": 0.0005, "sigma": 0.008},   # bull
        {"mu": -0.001, "sigma": 0.022},   # bear
    ),
    transition=np.array([[0.99, 0.01],    # bull -> [bull, bear]
                         [0.04, 0.96]]),  # bear -> [bull, bear]
    seed: int | None = None,
) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    K = len(params)
    assert transition.shape == (K, K)
    state = 0
    states = np.zeros(n_days, dtype=int)
    rets = np.zeros(n_days)
    for t in range(n_days):
        p = params[state]
        rets[t] = rng.normal(p["mu"], p["sigma"])
        states[t] = state
        state = rng.choice(K, p=transition[state])
    return pd.DataFrame({"return": rets, "regime": states})

Level 4: Gaussian copula for pair trading

Pair-trading, statistical arbitrage, and any multi-asset backtest need joint return series with controlled dependence. The simplest parametric joint distribution is the Gaussian copula: two series whose marginals can be anything, coupled by a correlation matrix applied in the Gaussian-quantile space.

For asset i, draw z_i ~ N(0, Σ)   # multivariate normal
Convert to uniform: u_i = Φ(z_i)
Apply marginal:    r_i = F_i^{-1}(u_i)

F_i is any univariate distribution — Gaussian, Student-t, or empirical.

Warning (the important one). The Gaussian copula understates joint tail dependence. Two assets with pairwise correlation 0.7 under a Gaussian copula will show benign joint-tail behaviour (simultaneous extreme losses are rare in simulation). Real markets do not behave that way — correlations rise toward 1 in tail events, a phenomenon the Gaussian copula cannot represent. This is the mathematical error at the heart of the 2007-2008 CDO pricing failure; see Embrechts, McNeil, and Straumann (2002)³ for the formal critique.

For pair-trading scaffolding where tail co-movement is not the central concern, Gaussian copula is fine. For any test where the joint tail matters — basket risk, portfolio VaR, factor crowding — use a Student-t copula with low degrees of freedom (typically ν ≈ 4–8) or an explicit tail-dependence model.

from scipy.stats import norm

def gaussian_copula_pair(
    n_days: int,
    rho: float = 0.7,
    sigma_a: float = 0.012,
    sigma_b: float = 0.010,
    seed: int | None = None,
) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    L = np.linalg.cholesky(np.array([[1.0, rho], [rho, 1.0]]))
    z = rng.standard_normal((n_days, 2)) @ L.T
    # keep Gaussian marginals with the target per-asset sigma
    return pd.DataFrame({
        "ret_a": z[:, 0] * sigma_a,
        "ret_b": z[:, 1] * sigma_b,
    })

Validation: does synthetic data match real data?

Whatever generator you pick, validate it against real returns before trusting its output. Four cheap checks:

Sharpe ratio of generated returns matches the design target within noise.
Autocorrelation of absolute returns reproduces real-data volatility clustering (GBM: zero; GARCH: decaying; real data: decaying slowly, 50+ lags).
Tail ratio (95th-percentile gain / 5th-percentile loss) — real equity index data is near 1; option-income strategies are below 1; option-buying is above 1.
QQ plot of empirical returns vs Gaussian quantiles — real equity daily returns deviate noticeably in the tails beyond ±2σ.

If the synthetic series fails any of these checks against the target, fix the generator before running the strategy on it.

The one rule that matters

Synthetic-first is safe. Synthetic-only is dangerous.

Use synthetic data to build confidence in the pipeline, the risk layer, and the overfitting diagnostics. Once the pipeline is trustworthy, move to real data for parameter selection and validation. Never commit real capital to a strategy whose performance has been demonstrated only on synthetic series: the generator is by construction a simplification, and the strategy may be exploiting exactly the structure the generator imposed.

Connects to

Synthetic Market Data Generator — browser tool implementing GBM, GARCH(1,1), regime-switching, and Gaussian copula with downloadable CSVs.
Backtest Overfitting Score — the natural next step: run your optimisation on a pure-noise synthetic series and verify the score reflects randomness.
Walk-Forward Validator — once real data is loaded, run walk-forward; synthetic should only be used for scaffolding.
Walk-Forward Validation: A Cookbook for Honest Backtests — the real-data counterpart to this article.
The Sharpe Ratio Trap: Why Standalone Sharpe Lies — synthetic data with known true Sharpe is the clearest way to build intuition for why selection bias inflates observed Sharpe.

References

Engle, R. F. (1982). "Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation." Econometrica 50(4). The ARCH predecessor to GARCH.
Cont, R. (2001). "Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues." Quantitative Finance 1(2), pp. 223–236. The standard reference for which generator features matter.
McNeil, A. J., Frey, R., & Embrechts, P. (2015). Quantitative Risk Management (2nd ed.), Princeton University Press. Chapter 7 on copulas, including tail-dependence properties.
Black, F., & Scholes, M. (1973). "The Pricing of Options and Corporate Liabilities." Journal of Political Economy 81(3). GBM as the foundation of the continuous-time option-pricing framework.

Bollerslev, T. (1986). "Generalized Autoregressive Conditional Heteroskedasticity." Journal of Econometrics 31(3), pp. 307–327. ↩
Hamilton, J. D. (1989). "A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle." Econometrica 57(2), pp. 357–384. ↩
Embrechts, P., McNeil, A. J., & Straumann, D. (2002). "Correlation and Dependence in Risk Management: Properties and Pitfalls." In Risk Management: Value at Risk and Beyond (ed. Dempster), Cambridge University Press. The canonical critique of naive Gaussian dependence assumptions. ↩