How to Backtest a Value-at-Risk Model
A value-at-risk number is a prediction: losses should exceed it only as often as the confidence level allows, and the breaches should be scattered, not bunched. Backtesting checks both. A VaR model that passes the frequency test but fails independence understates tail risk in stressed periods, which is the worst time to be wrong. The two standard tests, what each catches, and why the independence check matters more than the frequency count are covered here.
Before You Start
Set up the inputs that make the next steps easier
Guide Steps
Move through it in order
Each step focuses on one decision so you can keep momentum without losing the thread.
- 1
Define the breach indicator
For each period, mark a breach when the realized loss exceeds the VaR forecast for that period. This produces a sequence of zeros and ones, the breach indicator, which is the raw material for every VaR backtest. The VaR confidence level sets the expected breach rate: a 99 percent one-day VaR should be breached on about one percent of days. Everything downstream tests whether this sequence behaves the way the model claims.
Use out-of-sample VaR forecasts. Backtesting a VaR model on the same data it was fit on overstates how well it works, just like any other backtest.
- 2
Run the Kupiec frequency test
The Kupiec proportion-of-failures test checks whether the observed breach rate matches the expected rate implied by the confidence level. Too many breaches means the model understates risk; too few means it overstates risk and ties up capital needlessly. The test produces a statistic and a p-value for the null hypothesis that the breach frequency is correct. It is the first gate: a model that fails frequency is mis-scaled before you even look at timing.
Too few breaches is a real failure too, not a free pass. An overly conservative VaR wastes capital and signals the model is not capturing the actual distribution.
- 3
Run the Christoffersen independence test
Passing the frequency test is not enough, because breaches can occur at the right rate but bunch together in stressed periods. The Christoffersen test checks whether a breach today is independent of a breach yesterday. If breaches cluster, the model has the right average but the wrong dynamics: it underestimates risk precisely when volatility spikes. Independence is the property that tells you the model holds up in the conditions you most need it to.
Clustered breaches mean your VaR is calm until it is suddenly very wrong, all at once. That is the failure mode that causes blowups.
- 4
Combine frequency and independence
The Christoffersen conditional-coverage test combines both properties into one: correct breach frequency and independent breaches. A model must pass both to be trustworthy. Passing only frequency means the average is right but the timing is dangerous; passing only independence is meaningless if the rate is wrong. Read the combined test as the headline result, then use the individual tests to diagnose which property failed when it does.
When the combined test fails, look at the components to localize the problem. Frequency failure is a scaling issue; independence failure is a dynamics issue, and they need different fixes.
- 5
Act on the diagnosis
A frequency failure usually means recalibrating the VaR level or the distributional assumption. An independence failure usually means the model is not capturing volatility clustering, which calls for a model that updates its risk estimate as volatility changes rather than a static one. Either way, the backtest does not just pass or fail; it points at what to fix. Re-run the tests after the fix on fresh data to confirm the correction held.
Independence failures point toward a volatility-aware model. A static VaR that ignores changing volatility will keep clustering breaches no matter how you rescale it.
Common Mistakes
The misses that undo good inputs
Testing only breach frequency
A model can breach at exactly the right rate while bunching all the breaches into stressed periods. Frequency alone misses this, and clustered breaches are the failure mode that causes losses to arrive all at once.
Treating too few breaches as success
An overly conservative VaR that rarely breaches is not capturing the real distribution; it ties up capital and signals the model is mis-specified, which the frequency test correctly flags as a failure.
Backtesting VaR on in-sample data
Evaluating a VaR model on the data it was calibrated to overstates its accuracy. The breach behavior must be tested on out-of-sample forecasts to reflect how the model performs on unseen markets.
Try These Tools
Run the numbers next
Returns Distribution Analyzer
Paste a returns CSV. Histogram, normal-overlay, QQ plot, skewness, excess kurtosis, Jarque-Bera test, tail-weight index. See why Sharpe alone misleads.
Risk-Adjusted Returns Calculator
Paste a returns CSV. Sharpe, Sortino, Calmar, Omega, alpha, beta, tracking error, information ratio, max drawdown, and tail moments — plus.
Drawdown-Recovery Markov Simulator
Time to recover from an N% drawdown given monthly Sharpe + skew + kurtosis. Cornish-Fisher Monte Carlo, percentile distribution of recovery months.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- Techniques for Verifying the Accuracy of Risk Measurement Models — Paul H. Kupiec, Federal Reserve (1995)
- Evaluating Interval Forecasts — Peter F. Christoffersen, International Economic Review (1998)
Related Content
Keep the topic connected
Value at Risk (VaR)
Value at Risk: the loss threshold you'll exceed with probability α. Why historical VaR is brittle and what it doesn't tell you about the tail.
Expected Shortfall (CVaR)
Expected shortfall: the average loss given a VaR breach. Why regulators are migrating from VaR and what ES catches that VaR misses.
Volatility
Volatility as the standard deviation of returns: realized vs implied, the annualization gotcha, and why volatility-of-volatility matters.
Monte Carlo Simulation
Monte Carlo simulation in trading: when it's the right tool, when it's overkill, and the seed-discipline gotcha that ruins most published examples.