Kupiec vs Bootstrap for VaR Validation

Q: Why is Christoffersen LR = 0 in this example?

The 7 exceptions are scattered through the 30-day window, not clustered. Christoffersen's test computes the transition matrix between exception/non-exception states; when transitions look independent, LR = 0 and the test does not reject independence.

Q: Does Basel III require both tests?

Basel III internal-model-approach validation references the Kupiec test by name and recommends supplementing with Christoffersen-type independence checks. Both should be in a defensible VaR validation report.

Q: Should I use the bootstrap if my sample is small?

Yes, as a supplement. Report both Kupiec and bootstrap p-values when n is under 200. If they agree, the rejection is firm. If they disagree, investigate — small-sample asymptotic violation is the usual cause.

A 30-day VaR backtest at 99% confidence with 7 observed exceptions against an expected exception count of 0.3 returns a Kupiec LR of 32.34 and a joint LR (Kupiec + Christoffersen) of 32.34 with a joint p-value of 9.5e-8. The VaR Backtest (Kupiec & Christoffersen) tool rejects the VaR model decisively. The bootstrap alternative — resample observed exceptions to construct an empirical distribution of LR — agrees on the rejection but converges slower. For published VaR validation under MAR/Basel III framing, both tests have value; Kupiec is the formal default.

TL;DR

30-day, 99%-confidence VaR backtest. Expected exceptions: 0.3. Observed: 7.
Kupiec LR: 32.34. Kupiec p-value: 1.3e-8. Model is rejected at any standard α.
Christoffersen LR: 0. No serial dependence in exceptions (good).
Joint LR: 32.34. Joint p-value: 9.5e-8. Combined rejection.
Bootstrap variant agrees but needs 1,000+ resamples to converge.
For Basel III / EBA validation, Kupiec is the named default; bootstrap is the supplementary.

The scenario

A daily VaR model produces 99%-confidence loss estimates for 30 trading days. The realised P&L violates the VaR threshold on 7 days, against an expected 0.3 violations under a correctly-calibrated 99% VaR. The VaR Backtest returns:

Metric	Value
Observations	30
Exceptions observed	7
Observed exception rate	23.3%
Expected exception rate	1.0%
Kupiec LR statistic	32.34
Kupiec p-value	1.30e-8
Christoffersen LR (independence)	0.00
Christoffersen p-value	1.00
Joint LR (Kupiec + Christoffersen)	32.34
Joint p-value	9.50e-8

The Kupiec test rejects the VaR model at any standard significance level. The Christoffersen independence test does not reject independence — exceptions do not cluster — which means the violations are not clustered in time, only too frequent in aggregate.

What Kupiec actually tests

The Kupiec proportion-of-failures test computes the likelihood ratio¹:

LR_kupiec = -2 · ln( (p_expected^k × (1−p_expected)^(n−k)) /
                     (p_observed^k × (1−p_observed)^(n−k)) )

where $k$ is the observed exception count, $n$ is the number of observations, $p_{\text{expected}}$ is the modelled exception rate, and $p_{\text{observed}} = k/n$. The statistic is approximately $\chi^2(1)$ under the null that the VaR model is correctly calibrated.

For our scenario, 7 exceptions out of 30 is far above the 0.3 expected. The LR is 32.34, well past the $\chi^2(1)$ critical value of 6.63 at $\alpha = 0.01$.

What Christoffersen adds

Christoffersen's independence test asks whether exceptions cluster². The test computes the transition probabilities between non-exception and exception days and tests independence under the null. Our scenario has Christoffersen LR = 0, meaning there is no evidence of clustering. The 7 exceptions are spread through the 30-day window rather than concentrated in one storm.

The joint Kupiec + Christoffersen test combines both into a single $\chi^2(2)$ statistic. For our scenario, joint LR = 32.34 (because Christoffersen contributed 0), joint p-value still rejects.

The bootstrap alternative

The bootstrap approach³:

Take the observed exception indicator sequence (0/1 for each day).
Resample with replacement to construct $N$ alternative sequences under the null.
For each sample, compute the Kupiec LR.
The bootstrap p-value is the fraction of resampled LRs that exceed the observed LR.

For our scenario, the bootstrap would also reject (observed LR 32.34 is at the extreme of any plausible null distribution). The advantage of bootstrap over parametric Kupiec is robustness to sample size — for $n < 250$ observations, the $\chi^2(1)$ asymptotic approximation underlying Kupiec has slow convergence, and bootstrap gives exact small-sample p-values.

For our $n = 30$, the bootstrap p-value would be close to but not identical to the Kupiec parametric p-value. At conventional decision thresholds, both reject. The reason to run bootstrap as a supplement is that it gives confidence in the rejection rather than a single number.

When each test is right

Kupiec

Basel III / EBA-named default for VaR backtest validation⁴.
Quick and parametric; no resampling.
Well-understood asymptotic distribution.

Cons:

Asymptotic distribution slow to converge at small samples ($n < 250$).
Tests only proportion, not timing of exceptions.

Christoffersen

Catches clustered exceptions (storm-cluster pattern).
Combined with Kupiec for full coverage.

Cons:

Requires more observations to test independence meaningfully.
Has weak power against subtle clustering patterns.

Bootstrap

Exact small-sample p-values.
Sidesteps the asymptotic-approximation issue.
Can test custom test statistics that parametric tests cannot.

Cons:

Slower (1,000+ resamples).
Less standardised — regulators expect Kupiec / Christoffersen.

For BaFin / EBA-supervised internal-model-approval, the regulator expects Kupiec and Christoffersen⁵. The bootstrap is supplementary, useful in research and as a robustness check, but not a substitute for the named tests.

The empirical interpretation

The scenario's 7 exceptions in 30 days at 99% VaR is roughly 23x the expected rate. That is not "model is slightly miscalibrated"; that is "model is wrong." Possible causes:

The VaR threshold is too tight. A 99% threshold should produce roughly 2.5 exceptions per year on a 250-day window. 7 in 30 days suggests a misspecified threshold.
The P&L distribution is fatter-tailed than the model assumes. Normal-based VaR consistently under-states fat-tail exceedance rates.
A regime change has occurred. The model was calibrated on a calmer regime and the recent 30 days fall in a different regime.

The fix is not to tweak the Kupiec parameters; it is to refit the VaR model with a fatter-tailed distribution (Student-t, GARCH residuals, empirical) and re-validate.

Sample size matters

At $n = 30$, the test has just enough power to reject this extreme misspecification. For more subtle misspecifications, sample size requirements grow:

Misspecification	Sample needed for 80% power
23x expected rate (this scenario)	25-30 days
5x expected rate	100-200 days
2x expected rate	500+ days
1.5x expected rate	2,000+ days

For Basel III internal-model validation, the named requirement is 250 days minimum, which gives meaningful power against ~3x misspecifications. Below that sample size, the test has weak power against modest deviations from calibration.

Failure modes

Quoting Kupiec without Christoffersen. The combined test catches more failure modes than either alone.
Treating non-rejection as model validation. Non-rejection means "not enough evidence to reject," not "model is correct."
Skipping bootstrap supplementation at small samples. Parametric Kupiec at $n < 100$ is approximate; bootstrap is exact.
Ignoring the regime-change diagnosis. A failed VaR backtest is often a regime-change signal, not a calibration bug.

FAQ

Why is Christoffersen LR = 0 in this example?

The 7 exceptions are scattered through the 30-day window, not clustered. Christoffersen's test computes the transition matrix between exception/non-exception states; when transitions look independent (no clustering), LR = 0 and the test does not reject independence.

Does Basel III require both tests?

Basel III internal-model-approach validation references the Kupiec test by name and recommends supplementing with Christoffersen-type independence checks⁴. Both should be in a defensible VaR validation report.

Should I use the bootstrap if my sample is small?

Yes, as a supplement. Report both Kupiec and bootstrap p-values when $n < 200$. If they agree, the rejection is firm. If they disagree, investigate — small-sample asymptotic violation is the usual cause.

Connects to

Returns Distribution: Fat Tails in an Equity Portfolio: distributional diagnostics underpinning VaR.
Sharpe Ratio Trap: performance metric framing.
How to Read a Backtest Report: VaR backtests in the broader report.
Synthetic Data: GARCH vs GBM for Backtesting: GARCH residuals as a VaR input.
VaR Backtest (Kupiec & Christoffersen): re-run on your model.
VaR Backtest methodology: full input/output specification.

References

Kupiec, P. H. (1995). "Techniques for Verifying the Accuracy of Risk Measurement Models." Journal of Derivatives 3(2), 73–84. pm-research.com ↩
Christoffersen, P. F. (1998). "Evaluating Interval Forecasts." International Economic Review 39(4), 841–862. jstor.org ↩
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall. routledge.com ↩
BCBS (2019). "Minimum capital requirements for market risk." bis.org ↩ ↩²
EBA (2024). "Guidelines on internal models." eba.europa.eu ↩

Verified engine output

Show the recompute-verified inputs and outputs

30-day 99% VaR backtest, flat VaR threshold, 7 exceptions vs 0.3 expected

Inputs
confidence_level	0.99
var_series (30 items)	[...]
pnl (30 items)	[...]

Result
exceptions	7
observations	30
observed rate	0.23333333333333334
expected rate	0.010000000000000009
kupiec lr	32.33833117288029
kupiec pvalue	1.29902661960557e-8
christoffersen lr	0
christoffersen pvalue	1
joint lr	32.33833117288029
joint pvalue	9.502122144677827e-8
hits (30 items)	[...]

Computed live at build time.

TL;DR

The scenario

What Kupiec actually tests

What Christoffersen adds

The bootstrap alternative

When each test is right

Kupiec

Christoffersen

Bootstrap

The empirical interpretation

Sample size matters

Failure modes

FAQ

Why is Christoffersen LR = 0 in this example?

Does Basel III require both tests?

Should I use the bootstrap if my sample is small?

Connects to

References

Footnotes

Verified engine output