TL;DR

Synthetic data is not a substitute for real out-of-sample validation, but it is the cheapest way to scaffold a backtest pipeline. Use it to unit-test the plumbing, stress-test the risk estimator against fat tails and vol clustering that the historical sample did not happen to contain, and cross-check whether a strategy's apparent edge survives when the data-generating process is known. The progression from simplest to most realistic: geometric Brownian motion (log-normal returns; fine for wiring), GARCH(1,1) (adds vol clustering; cite Bollerslev 1986), regime-switching (two-state Markov for bull/bear regimes; cite Hamilton 1989), Gaussian copula for pairs (joint distribution for multi-asset tests; with a strong warning on tail-dependence under-statement, see Embrechts-McNeil-Straumann 2002). Each with a working Python template and its honest limits.

Why scaffold with synthetic data

Three uses where synthetic data is the right tool:

  1. Pipeline wiring. Before wasting real-data iterations, verify that the event loop, fill simulator, equity-curve assembly, and reporting code produce the correct numbers on a known input. Feed GBM paths through the pipeline, verify that the empirical Sharpe converges to the analytical value as sample size grows.
  2. Risk-estimator stress tests. Historical data contains only the regimes that happened to occur. If a strategy has run only through low-volatility tape, its risk model has never been tested against 2008 or 2020. Synthetic data lets you inject worse tails than history provided and verify the risk layer catches them.
  3. Overfitting diagnostics. If a strategy shows Sharpe 2 on a synthetic series known to have zero true edge, the research pipeline is fitting noise. Run the entire optimisation loop against pure-noise synthetic data before trusting the result on real data.

Three uses where synthetic data is the wrong tool:

Level 1: Geometric Brownian motion

The standard log-normal model behind Black-Scholes:

dS_t = μ · S_t · dt + σ · S_t · dW_t

Discretised:

log(S_{t+1} / S_t) ~ N(μ − σ²/2, σ²) · dt

GBM generates returns that are independent, identically distributed, and Gaussian. Properties:

  • No volatility clustering. Yesterday's big move has no effect on today's expected magnitude.
  • No fat tails. Tail probabilities drop off as in a normal distribution.
  • Stationary mean and variance. Parameters do not shift.

What GBM is good for: wiring a pipeline, generating "null" data for overfitting diagnostics, benchmarking an analytical result against Monte Carlo.

What GBM is bad for: anything tail-risk related. A VaR estimator tuned on GBM will silently under-state tail risk on real data by 2-3x.

import numpy as np
import pandas as pd

def gbm_returns(
    n_days: int,
    mu_annual: float = 0.08,
    sigma_annual: float = 0.16,
    dt: float = 1 / 252,
    seed: int | None = None,
) -> pd.Series:
    rng = np.random.default_rng(seed)
    drift = (mu_annual - 0.5 * sigma_annual ** 2) * dt
    shock = sigma_annual * np.sqrt(dt) * rng.standard_normal(n_days)
    log_rets = drift + shock
    return pd.Series(log_rets, name="log_return")

Level 2: GARCH(1,1)

Bollerslev's 1986 Generalised Autoregressive Conditional Heteroskedasticity model (Bollerslev 1986)1 is the simplest formulation that captures the most prominent stylised fact of real returns: volatility clusters. High-vol days are followed by high-vol days; calm days are followed by calm days.

r_t     = σ_t · z_t,           z_t ~ N(0, 1)
σ²_t    = ω + α · r²_{t-1} + β · σ²_{t-1}

Parameters:

  • ω — long-run variance floor (positive, small).
  • α — reaction to yesterday's shock (typical 0.05–0.15 on equity indices).
  • β — persistence of volatility (typical 0.80–0.92).

The constraint α + β < 1 enforces mean-reverting volatility. Values close to 1 (e.g. 0.99) produce near-integrated, long-memory volatility.

GARCH captures clustering but still generates Gaussian conditional returns. Real equity returns have conditional tails fatter than Gaussian, which is why practitioners fit GARCH with Student-t innovations in production. For scaffolding purposes, Gaussian innovations are sufficient.

def garch_returns(
    n_days: int,
    omega: float = 1e-6,
    alpha: float = 0.08,
    beta: float = 0.90,
    seed: int | None = None,
) -> pd.Series:
    assert alpha + beta < 1, "non-stationary: alpha + beta must be < 1"
    rng = np.random.default_rng(seed)
    long_run_var = omega / (1 - alpha - beta)
    sigma2 = np.zeros(n_days)
    rets = np.zeros(n_days)
    sigma2[0] = long_run_var
    rets[0] = np.sqrt(sigma2[0]) * rng.standard_normal()
    for t in range(1, n_days):
        sigma2[t] = omega + alpha * rets[t - 1] ** 2 + beta * sigma2[t - 1]
        rets[t] = np.sqrt(sigma2[t]) * rng.standard_normal()
    return pd.Series(rets, name="return")

Validate a fit by computing autocorrelation of squared returns — real equity data shows positive autocorrelation out to ~50 lags, and GARCH(1,1) reproduces that signature. If a strategy's backtest on real data looks nothing like its backtest on GARCH-generated data of similar mean/variance, the strategy is relying on sequential dependence (trend or mean reversion in returns themselves, not just in volatility).

Level 3: Regime-switching

Hamilton's 1989 regime-switching model (Hamilton 1989)2 posits that returns are drawn from one of K regimes, with a Markov chain governing transitions. The two-state case (bull / bear) is the retail default:

state s_t ∈ {bull, bear}, evolves per transition matrix P
r_t | s_t = bull  ~ N(μ_bull,  σ²_bull)
r_t | s_t = bear  ~ N(μ_bear,  σ²_bear)

Typical equity-index parameters fit on post-war US data:

Regime μ (daily) σ (daily) P(stay)
Bull +0.0005 0.008 0.99
Bear −0.0010 0.022 0.96

The stay-probabilities produce regimes that last on the order of months to years, matching observed business-cycle structure.

Regime switching is the lightest-weight way to produce synthetic data with qualitatively different tape periods — useful for testing whether a strategy relies on one regime. A trend-following strategy that profits in bull regimes and loses in bear regimes will show this split cleanly on synthetic data with a known generator.

def regime_switching_returns(
    n_days: int,
    params=(
        {"mu": 0.0005, "sigma": 0.008},   # bull
        {"mu": -0.001, "sigma": 0.022},   # bear
    ),
    transition=np.array([[0.99, 0.01],    # bull -> [bull, bear]
                         [0.04, 0.96]]),  # bear -> [bull, bear]
    seed: int | None = None,
) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    K = len(params)
    assert transition.shape == (K, K)
    state = 0
    states = np.zeros(n_days, dtype=int)
    rets = np.zeros(n_days)
    for t in range(n_days):
        p = params[state]
        rets[t] = rng.normal(p["mu"], p["sigma"])
        states[t] = state
        state = rng.choice(K, p=transition[state])
    return pd.DataFrame({"return": rets, "regime": states})

Level 4: Gaussian copula for pair trading

Pair-trading, statistical arbitrage, and any multi-asset backtest need joint return series with controlled dependence. The simplest parametric joint distribution is the Gaussian copula: two series whose marginals can be anything, coupled by a correlation matrix applied in the Gaussian-quantile space.

For asset i, draw z_i ~ N(0, Σ)   # multivariate normal
Convert to uniform: u_i = Φ(z_i)
Apply marginal:    r_i = F_i^{-1}(u_i)

F_i is any univariate distribution — Gaussian, Student-t, or empirical.

Warning (the important one). The Gaussian copula understates joint tail dependence. Two assets with pairwise correlation 0.7 under a Gaussian copula will show benign joint-tail behaviour (simultaneous extreme losses are rare in simulation). Real markets do not behave that way — correlations rise toward 1 in tail events, a phenomenon the Gaussian copula cannot represent. This is the mathematical error at the heart of the 2007-2008 CDO pricing failure; see Embrechts, McNeil, and Straumann (2002)3 for the formal critique.

For pair-trading scaffolding where tail co-movement is not the central concern, Gaussian copula is fine. For any test where the joint tail matters — basket risk, portfolio VaR, factor crowding — use a Student-t copula with low degrees of freedom (typically ν ≈ 4–8) or an explicit tail-dependence model.

from scipy.stats import norm

def gaussian_copula_pair(
    n_days: int,
    rho: float = 0.7,
    sigma_a: float = 0.012,
    sigma_b: float = 0.010,
    seed: int | None = None,
) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    L = np.linalg.cholesky(np.array([[1.0, rho], [rho, 1.0]]))
    z = rng.standard_normal((n_days, 2)) @ L.T
    # keep Gaussian marginals with the target per-asset sigma
    return pd.DataFrame({
        "ret_a": z[:, 0] * sigma_a,
        "ret_b": z[:, 1] * sigma_b,
    })

Validation: does synthetic data match real data?

Whatever generator you pick, validate it against real returns before trusting its output. Four cheap checks:

  1. Sharpe ratio of generated returns matches the design target within noise.
  2. Autocorrelation of absolute returns reproduces real-data volatility clustering (GBM: zero; GARCH: decaying; real data: decaying slowly, 50+ lags).
  3. Tail ratio (95th-percentile gain / 5th-percentile loss) — real equity index data is near 1; option-income strategies are below 1; option-buying is above 1.
  4. QQ plot of empirical returns vs Gaussian quantiles — real equity daily returns deviate noticeably in the tails beyond ±2σ.

If the synthetic series fails any of these checks against the target, fix the generator before running the strategy on it.

The one rule that matters

Synthetic-first is safe. Synthetic-only is dangerous.

Use synthetic data to build confidence in the pipeline, the risk layer, and the overfitting diagnostics. Once the pipeline is trustworthy, move to real data for parameter selection and validation. Never commit real capital to a strategy whose performance has been demonstrated only on synthetic series: the generator is by construction a simplification, and the strategy may be exploiting exactly the structure the generator imposed.

Connects to

References

  • Engle, R. F. (1982). "Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation." Econometrica 50(4). The ARCH predecessor to GARCH.
  • Cont, R. (2001). "Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues." Quantitative Finance 1(2), pp. 223–236. The standard reference for which generator features matter.
  • McNeil, A. J., Frey, R., & Embrechts, P. (2015). Quantitative Risk Management (2nd ed.), Princeton University Press. Chapter 7 on copulas, including tail-dependence properties.
  • Black, F., & Scholes, M. (1973). "The Pricing of Options and Corporate Liabilities." Journal of Political Economy 81(3). GBM as the foundation of the continuous-time option-pricing framework.

Footnotes

  1. Bollerslev, T. (1986). "Generalized Autoregressive Conditional Heteroskedasticity." Journal of Econometrics 31(3), pp. 307–327.

  2. Hamilton, J. D. (1989). "A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle." Econometrica 57(2), pp. 357–384.

  3. Embrechts, P., McNeil, A. J., & Straumann, D. (2002). "Correlation and Dependence in Risk Management: Properties and Pitfalls." In Risk Management: Value at Risk and Beyond (ed. Dempster), Cambridge University Press. The canonical critique of naive Gaussian dependence assumptions.