A 30-day VaR backtest at 99% confidence with 7 observed exceptions against an expected exception count of 0.3 returns a Kupiec LR of 32.34 and a joint LR (Kupiec + Christoffersen) of 32.34 with a joint p-value of 9.5e-8. The VaR Backtest (Kupiec & Christoffersen) tool rejects the VaR model decisively. The bootstrap alternative — resample observed exceptions to construct an empirical distribution of LR — agrees on the rejection but converges slower. For published VaR validation under MAR/Basel III framing, both tests have value; Kupiec is the formal default.
TL;DR
- 30-day, 99%-confidence VaR backtest. Expected exceptions: 0.3. Observed: 7.
- Kupiec LR: 32.34. Kupiec p-value: 1.3e-8. Model is rejected at any standard α.
- Christoffersen LR: 0. No serial dependence in exceptions (good).
- Joint LR: 32.34. Joint p-value: 9.5e-8. Combined rejection.
- Bootstrap variant agrees but needs 1,000+ resamples to converge.
- For Basel III / EBA validation, Kupiec is the named default; bootstrap is the supplementary.
The scenario
A daily VaR model produces 99%-confidence loss estimates for 30 trading days. The realised P&L violates the VaR threshold on 7 days, against an expected 0.3 violations under a correctly-calibrated 99% VaR. The VaR Backtest returns:
| Metric | Value |
|---|---|
| Observations | 30 |
| Exceptions observed | 7 |
| Observed exception rate | 23.3% |
| Expected exception rate | 1.0% |
| Kupiec LR statistic | 32.34 |
| Kupiec p-value | 1.30e-8 |
| Christoffersen LR (independence) | 0.00 |
| Christoffersen p-value | 1.00 |
| Joint LR (Kupiec + Christoffersen) | 32.34 |
| Joint p-value | 9.50e-8 |
The Kupiec test rejects the VaR model at any standard significance level. The Christoffersen independence test does not reject independence — exceptions do not cluster — which means the violations are not clustered in time, only too frequent in aggregate.
What Kupiec actually tests
The Kupiec proportion-of-failures test computes the likelihood ratio1:
LR_kupiec = -2 · ln( (p_expected^k × (1−p_expected)^(n−k)) /
(p_observed^k × (1−p_observed)^(n−k)) )
where $k$ is the observed exception count, $n$ is the number of observations, $p_{\text{expected}}$ is the modelled exception rate, and $p_{\text{observed}} = k/n$. The statistic is approximately $\chi^2(1)$ under the null that the VaR model is correctly calibrated.
For our scenario, 7 exceptions out of 30 is far above the 0.3 expected. The LR is 32.34, well past the $\chi^2(1)$ critical value of 6.63 at $\alpha = 0.01$.
What Christoffersen adds
Christoffersen's independence test asks whether exceptions cluster2. The test computes the transition probabilities between non-exception and exception days and tests independence under the null. Our scenario has Christoffersen LR = 0, meaning there is no evidence of clustering. The 7 exceptions are spread through the 30-day window rather than concentrated in one storm.
The joint Kupiec + Christoffersen test combines both into a single $\chi^2(2)$ statistic. For our scenario, joint LR = 32.34 (because Christoffersen contributed 0), joint p-value still rejects.
The bootstrap alternative
The bootstrap approach3:
- Take the observed exception indicator sequence (0/1 for each day).
- Resample with replacement to construct $N$ alternative sequences under the null.
- For each sample, compute the Kupiec LR.
- The bootstrap p-value is the fraction of resampled LRs that exceed the observed LR.
For our scenario, the bootstrap would also reject (observed LR 32.34 is at the extreme of any plausible null distribution). The advantage of bootstrap over parametric Kupiec is robustness to sample size — for $n < 250$ observations, the $\chi^2(1)$ asymptotic approximation underlying Kupiec has slow convergence, and bootstrap gives exact small-sample p-values.
For our $n = 30$, the bootstrap p-value would be close to but not identical to the Kupiec parametric p-value. At conventional decision thresholds, both reject. The reason to run bootstrap as a supplement is that it gives confidence in the rejection rather than a single number.
When each test is right
Kupiec
- Basel III / EBA-named default for VaR backtest validation4.
- Quick and parametric; no resampling.
- Well-understood asymptotic distribution.
Cons:
- Asymptotic distribution slow to converge at small samples ($n < 250$).
- Tests only proportion, not timing of exceptions.
Christoffersen
- Catches clustered exceptions (storm-cluster pattern).
- Combined with Kupiec for full coverage.
Cons:
- Requires more observations to test independence meaningfully.
- Has weak power against subtle clustering patterns.
Bootstrap
- Exact small-sample p-values.
- Sidesteps the asymptotic-approximation issue.
- Can test custom test statistics that parametric tests cannot.
Cons:
- Slower (1,000+ resamples).
- Less standardised — regulators expect Kupiec / Christoffersen.
For BaFin / EBA-supervised internal-model-approval, the regulator expects Kupiec and Christoffersen5. The bootstrap is supplementary, useful in research and as a robustness check, but not a substitute for the named tests.
The empirical interpretation
The scenario's 7 exceptions in 30 days at 99% VaR is roughly 23x the expected rate. That is not "model is slightly miscalibrated"; that is "model is wrong." Possible causes:
- The VaR threshold is too tight. A 99% threshold should produce roughly 2.5 exceptions per year on a 250-day window. 7 in 30 days suggests a misspecified threshold.
- The P&L distribution is fatter-tailed than the model assumes. Normal-based VaR consistently under-states fat-tail exceedance rates.
- A regime change has occurred. The model was calibrated on a calmer regime and the recent 30 days fall in a different regime.
The fix is not to tweak the Kupiec parameters; it is to refit the VaR model with a fatter-tailed distribution (Student-t, GARCH residuals, empirical) and re-validate.
Sample size matters
At $n = 30$, the test has just enough power to reject this extreme misspecification. For more subtle misspecifications, sample size requirements grow:
| Misspecification | Sample needed for 80% power |
|---|---|
| 23x expected rate (this scenario) | 25-30 days |
| 5x expected rate | 100-200 days |
| 2x expected rate | 500+ days |
| 1.5x expected rate | 2,000+ days |
For Basel III internal-model validation, the named requirement is 250 days minimum, which gives meaningful power against ~3x misspecifications. Below that sample size, the test has weak power against modest deviations from calibration.
Failure modes
- Quoting Kupiec without Christoffersen. The combined test catches more failure modes than either alone.
- Treating non-rejection as model validation. Non-rejection means "not enough evidence to reject," not "model is correct."
- Skipping bootstrap supplementation at small samples. Parametric Kupiec at $n < 100$ is approximate; bootstrap is exact.
- Ignoring the regime-change diagnosis. A failed VaR backtest is often a regime-change signal, not a calibration bug.
FAQ
Why is Christoffersen LR = 0 in this example?
The 7 exceptions are scattered through the 30-day window, not clustered. Christoffersen's test computes the transition matrix between exception/non-exception states; when transitions look independent (no clustering), LR = 0 and the test does not reject independence.
Does Basel III require both tests?
Basel III internal-model-approach validation references the Kupiec test by name and recommends supplementing with Christoffersen-type independence checks4. Both should be in a defensible VaR validation report.
Should I use the bootstrap if my sample is small?
Yes, as a supplement. Report both Kupiec and bootstrap p-values when $n < 200$. If they agree, the rejection is firm. If they disagree, investigate — small-sample asymptotic violation is the usual cause.
Connects to
- Returns Distribution: Fat Tails in an Equity Portfolio: distributional diagnostics underpinning VaR.
- Sharpe Ratio Trap: performance metric framing.
- How to Read a Backtest Report: VaR backtests in the broader report.
- Synthetic Data: GARCH vs GBM for Backtesting: GARCH residuals as a VaR input.
- VaR Backtest (Kupiec & Christoffersen): re-run on your model.
- VaR Backtest methodology: full input/output specification.
References
Footnotes
-
Kupiec, P. H. (1995). "Techniques for Verifying the Accuracy of Risk Measurement Models." Journal of Derivatives 3(2), 73–84. pm-research.com ↩
-
Christoffersen, P. F. (1998). "Evaluating Interval Forecasts." International Economic Review 39(4), 841–862. jstor.org ↩
-
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall. routledge.com ↩
-
BCBS (2019). "Minimum capital requirements for market risk." bis.org ↩ ↩2
-
EBA (2024). "Guidelines on internal models." eba.europa.eu ↩
Verified engine output
Show the recompute-verified inputs and outputs
| confidence_level | 0.99 |
|---|---|
| var_series (30 items) | [...] |
| pnl (30 items) | [...] |
| exceptions | 7 |
|---|---|
| observations | 30 |
| observed rate | 0.23333333333333334 |
| expected rate | 0.010000000000000009 |
| kupiec lr | 32.33833117288029 |
| kupiec pvalue | 1.29902661960557e-8 |
| christoffersen lr | 0 |
| christoffersen pvalue | 1 |
| joint lr | 32.33833117288029 |
| joint pvalue | 9.502122144677827e-8 |
| hits (30 items) | [...] |
Computed live at build time.