Methodology · Generator · Last updated 2026-04-22
How Synthetic Market Data Generator works
What the Synthetic Market Data Generator actually simulates, where each model comes from, and when synthetic data is dangerous.
1. Geometric Brownian Motion (GBM)
GBM is the classical continuous-time model underlying the Black–Scholes framework.
Prices are assumed log-normal with constant drift μ and constant volatility σ.
In discrete form, one-step price updates use:
S_{t+1} = S_t · exp((μ − 0.5·σ²)·Δt + σ·√Δt · Z), Z ~ N(0, 1)
where Δt = 1 / (trading days per year). Log-returns are i.i.d. Normal with
mean (μ − 0.5σ²)·Δt and variance σ²·Δt.
Assumptions:
- Log-normal returns (no fat tails).
- Constant volatility (no clustering).
- No jumps, no regime shifts, no autocorrelation.
- Continuous trading — no microstructure effects.
Limitations: GBM underestimates kurtosis by 10× or more in equity indices, grossly understates drawdown probabilities during crises, and cannot produce the volatility-clustering stylized fact universally present in real returns.
2. GARCH(1,1) — Bollerslev 1986
GARCH(1,1) is the workhorse model for volatility clustering. Daily log-return r_t = σ_t · Z_t
with conditional variance evolving as:
σ²_t = ω + α · r²_{t−1} + β · σ²_{t−1}
Stationarity requires α + β < 1. The unconditional variance is
ω / (1 − α − β). Higher β → longer vol memory; higher
α → faster vol response to shocks.
Reasonable parameter heuristics for equity-like data:
ω ≈ 2·10⁻⁵— implies long-run daily σ around 1.5% when α+β ≈ 0.95.α ≈ 0.08 – 0.12— news-response speed.β ≈ 0.82 – 0.90— vol persistence.α + β ≈ 0.94 – 0.97for equities; closer to 1 = slower mean-reversion of vol.
Limitations: Innovations are still Gaussian by default (tails are slightly fatter than GBM thanks to clustering, but still underestimate crash probabilities versus a Student-t variant). Asymmetric leverage effects (vol rising more on down-moves) need GJR-GARCH / EGARCH, not GARCH(1,1).
3. Regime switching — Hamilton 1989
Two-state Markov switching assumes the process flips between a bull regime (positive drift,
low vol) and a bear regime (negative drift, high vol), with constant per-period transition
probabilities P(bull→bear) and P(bear→bull). Each day, we draw a
Bernoulli to decide whether to switch, then sample the day's log-return from a GBM-style
normal using the current regime's (μ, σ).
Hamilton (1989) originally fit this to GNP data; it has since been applied to equities,
FX, and credit spreads. The stationary distribution over regimes depends only on the
transition rates: π_bull = p_br / (p_bb + p_br) where p_bb = P(bull→bear),
p_br = P(bear→bull).
Limitations:
- Two regimes is a strong simplification — real data often has 3+ regimes (bull, bear, choppy, crisis).
- Transition probabilities are constant; in reality they depend on macro state, vol, policy.
- Regimes are unobserved — calibrating them from data requires MLE / Hamilton filter, which this tool doesn't do.
4. Gaussian copula — correlated pairs
The tool simulates two GBM processes sharing a Gaussian copula with correlation ρ.
Cholesky factorization of the 2×2 correlation matrix gives:
L = [[1, 0], [ρ, √(1 − ρ²)]]
Then for each step we draw ε₁, ε₂ ~ N(0, 1) independent, compute
z₁ = ε₁ and z₂ = ρ·ε₁ + √(1−ρ²)·ε₂, and feed z₁, z₂
into the GBM increments for A and B respectively.
The Gaussian copula is the default in most risk systems because it's trivially fittable — you only need a correlation matrix. But it has a well-known failure mode:
Tail dependence is zero. A Gaussian copula implies that in the limit of extreme moves, assets become uncorrelated. Real asset pairs show the opposite — correlations spike to near 1 during crashes. This was the core of the 2008 GFC critique; Embrechts, McNeil, and Straumann (2002) had warned years earlier that the Gaussian copula was an exceptionally poor choice for joint-tail modeling. Credit-derivative books priced with Gaussian copulas during 2004–2007 systematically underestimated default-correlation risk. For genuine joint-tail modeling, use a t-copula or Clayton copula.
When synthetic data is reasonable
- Backtest-infrastructure scaffolding. Wire up your data-ingestion, feature-engineering, sizing, and PnL-accounting pipeline against synthetic bars before you touch real data. Swap in real data once plumbing works.
- Risk-estimator unit tests. Feed GARCH paths with known
σinto your vol-targeting code and verify it clips positions correctly. - Drawdown-distribution sanity checks. Run 1,000 seeds at your expected drift/vol, plot the distribution of Max DD — that tells you whether your stop-loss is set against realistic variation or against a single lucky path.
- Load testing. Generate 10 years × 500 tickers of GBM bars to stress-test your storage layer without buying vendor data.
- Reproducible bug reports. Seed + parameters → identical series on any machine. Attach the seed to your GitHub issue.
When synthetic data is dangerous
- Final strategy validation. A strategy that works on synthetic data tells you nothing about whether it will work on real data. Real markets have microstructure, overnight gaps, earnings events, circuit breakers, fat tails, feedback loops from other algos — none of which these models contain.
- Calibrating risk budgets for live capital. Drawdowns in real markets are routinely 2–3× worse than what GBM or even GARCH produces, because real crashes involve liquidity collapses that no parametric model captures.
- Correlation-driven strategies. Pair trades and basket strategies calibrated on Gaussian-copula data will blow up precisely when correlations spike, because the model by construction forbids that.
- Anything involving options or tails. Option prices under GBM are Black-Scholes — which systematically under-prices out-of-the-money puts. If you're sizing against a tail estimate from synthetic data, you are not measuring the tail.
Implementation notes
- RNG. Uniform[0,1) samples come from
mulberry32when a seed is provided, orMath.random()otherwise. Normal variates via Box–Muller. - Reproducibility. Any integer seed + parameter set fully determines the output. "Generate new path" increments an internal run-id that salts the seed, so you can sample multiple paths under the same nominal seed.
- Date axis. Exports use consecutive calendar days starting 2024-01-02. This is a display artifact — the simulation steps are "days" in the GBM Δt sense (252 per year by default).
- Everything runs client-side. No API calls, no uploads. Inputs and outputs never leave the browser.
References
- Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3), 307–327.
- Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica, 57(2), 357–384.
- Embrechts, P., McNeil, A. J., & Straumann, D. (2002). Correlation and dependence in risk management: properties and pitfalls. In: Risk Management: Value at Risk and Beyond, Cambridge University Press, 176–223.
- Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81(3), 637–654.
Changelog
- 2026-04-22 — Initial release with GBM, GARCH(1,1), 2-state regime switching, and Gaussian-copula pairs. Mulberry32 seeded PRNG. CSV + JSON export.