On two synthetic 60-day P&L series with identical exception counts (4 breaches of a 95% VaR) but different clustering, the VaR Backtest Kupiec-Christoffersen engine returns: Kupiec LR = 0.319, Kupiec p-value = 0.572 for both series (Kupiec sees only the breach count, not the order); Christoffersen LR = 6.524, p-value = 0.011 on the clustered series and Christoffersen LR = 0, p-value = 1.000 on the isolated-breaches series. The joint p-values: 0.033 (rejects) vs 0.853 (passes). A VaR model that passes Kupiec at p > 0.05 is not safe to deploy without also passing Christoffersen.

TL;DR

Two 60-day tapes, 4 VaR exceptions each, three breach patterns:

Breach pattern Kupiec p-value Christoffersen p-value Joint p-value Decision
Clustered (days 18, 19, 20, 42) 0.572 (pass) 0.011 (fail) 0.033 (fail) Reject the VaR model
Isolated (days 5, 22, 41, 55) 0.572 (pass) 1.000 (pass) 0.853 (pass) Pass, VaR model is consistent

Kupiec's frequency test is blind to clustering. Christoffersen's independence test catches what Kupiec misses. The combined test is the cheap default for any VaR claim.

What each test answers

The two tests answer separable questions:

Kupiec frequency test. "Is the number of breaches consistent with the stated confidence level?" For a 95% VaR over 60 days the expected exception count is 3 (5% × 60); the observed is 4 in both canonical runs. The likelihood-ratio test asks whether the observed count is too high (or too low) given the binomial distribution. p-value of 0.572 means "not significantly different from the expected count", Kupiec is happy.

Christoffersen independence test. "Are the breaches independent over time?" If they are independent the conditional probability of a breach given the previous day was a breach equals the conditional probability given no previous breach. A clustering signature (breaches that follow each other more often than chance) inflates the conditional-on-previous-breach probability and the test detects it via a likelihood ratio.

The two tests measure different failure modes. Kupiec catches systematic mis-calibration (the model overstates or understates risk on average). Christoffersen catches regime-dependence (the model fails specifically when the regime changes). A VaR model that uses unconditional volatility forecasts often passes Kupiec but fails Christoffersen because it cannot adapt to vol clustering.

The clustered run

The clustered run has breaches on days 18, 19, 20, and 42, three consecutive days followed by one isolated breach. The engine's hits array shows the pattern:

[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,
 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Kupiec sees 4 breaches in 60 obs → observedRate 6.7%, expectedRate 5%. The likelihood ratio is small (LR = 0.319) because the count is close to expectation. Kupiec passes.

Christoffersen counts transitions: of the 4 breach days, 2 followed a breach the prior day (days 19 and 20 followed days 18 and 19 respectively). The conditional probability of a breach given the previous day was a breach is 2/3 = 67%; the unconditional probability is 6.7%. The 10× discrepancy is the LR's driver: LR = 6.524, p = 0.011. The test rejects independence at the 5% level.

The joint test combines both LR statistics and reports p = 0.033, failing at the 5% level. The VaR model is mis-specified: its exception rate is roughly correct on average but its breaches cluster, which is the regime-fragility signature.

The isolated-breaches run

Same exception count (4 in 60 obs), but breaches on days 5, 22, 41, 55, spread across the tape with no consecutive pair. Kupiec returns identical statistics (LR = 0.319, p = 0.572). Christoffersen counts 0 breach-after-breach transitions; the conditional and unconditional probabilities match; LR = 0, p = 1.000.

The joint p-value is 0.853, the VaR model passes both tests. The model is well-specified on this tape: the breach count is in the right ballpark AND the breaches do not cluster, which is what a properly-conditioned VaR forecast should produce.

This is the diagnostic case the engine is built to surface. Same exception count, two completely different conclusions. A reviewer who runs only Kupiec accepts both runs as equivalent and deploys a regime-fragile model.

Why Kupiec alone is insufficient

The Kupiec frequency test relies on the binomial distribution to compute the LR. It is blind to ordering by construction, the test statistic depends only on the count and total observations, not on the sequence. Two tapes with the same (count, obs) get identical Kupiec verdicts regardless of how the breaches are distributed.

For a stationary independent-breach model, this is fine. For a regime-switching model (VaR forecasts that don't update fast enough through a regime change), the breaches will cluster and Kupiec will miss it. The empirical history of VaR backtesting after 2008 is largely the story of models that passed Kupiec on rolling 252-day windows but clustered breaches around regime switches1.

Christoffersen's 1998 paper introduced the independence test precisely to close this gap2. The Basel Committee on Banking Supervision now requires the joint test (or an equivalent conditional-coverage test) for bank trading-book models. Retail VaR models rarely get this rigor; the engine's joint output is the cheap way to apply institutional discipline.

The decision rule

A defensible VaR backtest pipeline:

  1. Compute Kupiec LR and p-value. If p < 0.05, the model is mis-calibrated on frequency. Stop, fix the unconditional vol estimate.
  2. If Kupiec passes, compute Christoffersen LR and p-value. If p < 0.05, the model has regime-dependence; the unconditional vol is right but the conditional updating is too slow. Switch to GARCH(1,1) or a regime-switching framework.
  3. Compute the joint LR and p-value. If both individual tests pass but the joint fails, there is interaction between frequency and clustering that neither test alone sees. This is rare but the engine surfaces it.
  4. If all three pass, accept the model, for now. Re-run on every new 60-90-day window.

For the canonical clustered run, the pipeline stops at step 2: the VaR forecast is too slow to update through the breach cluster on days 18-20. The fix is to use conditional vol (GARCH or EWMA with a faster decay).

Where the tests break

Both tests assume the VaR threshold is correctly specified ex ante. A model that re-fits VaR on the same data it backtests against is biased toward passing Kupiec; the engine cannot detect this. The cure is strict separation of estimation window from backtest window.

The Christoffersen independence test asks about first-order Markov dependence — does this breach depend on the immediately previous day. Higher-order clustering (breaches that cluster over 5-day windows but not consecutive days) is detected weakly. For long-memory failure modes, augment the joint test with a duration-between-breach distributional test (Christoffersen-Pelletier 2004 extension).

The tests are designed for daily P&L and daily VaR. Intraday VaR backtests need to handle the intraday seasonality; the engine accepts an arbitrary pnl array but the breach-clustering interpretation depends on the unit of observation. For weekly VaR over 60 weeks, the engine still works; for monthly VaR with only 12-20 observations, the test power is too low to be meaningful.

Connects to

References

  • Christoffersen, P., & Pelletier, D. (2004). "Backtesting Value-at-Risk: A Duration-Based Approach." Journal of Financial Econometrics 2(1), 84–108. Higher-order clustering extension.
  • Basel Committee on Banking Supervision (2019). "Minimum capital requirements for market risk." Bank for International Settlements. https://www.bis.org/bcbs/publ/d457.htm
  • McNeil, A. J., Frey, R., & Embrechts, P. (2015). Quantitative Risk Management, 2nd ed., Princeton UP. Chapter 7 covers VaR backtesting in depth.

Footnotes

  1. Kupiec, P. H. (1995). "Techniques for Verifying the Accuracy of Risk Measurement Models." Journal of Derivatives 3(2), 73–84. The Kupiec proportion-of-failures test. https://www.pm-research.com/content/iijderiv/3/2/73

  2. Christoffersen, P. F. (1998). "Evaluating Interval Forecasts." International Economic Review 39(4), 841–862. The independence and conditional-coverage tests. https://www.jstor.org/stable/2527341

Verified engine output

Show the recompute-verified inputs and outputs
Clustered breaches (days 18, 19, 20, 42)
Inputs
confidence_level0.95
pnl (60 items)[...]
var_series (60 items)[...]
Result
exceptions4
observations60
observed rate0.06666666666666667
expected rate0.050000000000000044
kupiec lr0.3191039444813413
kupiec pvalue0.5721465616636887
christoffersen lr6.524116855057301
christoffersen pvalue0.01064218855574195
joint lr6.843220799538642
joint pvalue0.03265979723658785
hits (60 items)[...]

Computed live at build time.

Isolated breaches (days 5, 22, 41, 55)
Inputs
confidence_level0.95
pnl (60 items)[...]
var_series (60 items)[...]
Result
exceptions4
observations60
observed rate0.06666666666666667
expected rate0.050000000000000044
kupiec lr0.3191039444813413
kupiec pvalue0.5721465616636887
christoffersen lr0
christoffersen pvalue1
joint lr0.3191039444813413
joint pvalue0.8525256585763135
hits (60 items)[...]

Computed live at build time.

Frequently asked questions

Why is Christoffersen's p = 1.000 on the isolated run?
The LR statistic is 0 — observed conditional probabilities exactly match the unconditional probability. p = 1.000 means no evidence against the null of independence, the cleanest possible pass.
What if I have fewer than 30 observations?
The engine refuses because the LR distribution is poorly approximated by chi-squared at small N. Aggregate to weekly or move to daily VaR; monthly VaR over a year lacks test power.
Is the joint test always more conservative than the individual tests?
No — the joint test has dof = sum of individual dofs (3). It can pass when one test marginally fails and the other marginally passes because the LR statistics partially offset. Report all three p-values.
How does this relate to expected shortfall (ES)?
Expected shortfall backtests need different tests (Acerbi-Szekely or Du-Escanciano). Kupiec-Christoffersen is for VaR breaches; run both for tail-risk model validation.
Should I use 95% VaR or 99% VaR for backtest?
99% has tighter binding power but worse test power at typical sample sizes; 95% is the retail default, 99% the Basel default. The engine accepts confidence_level as input.