Skip to main content
aifinhub
Risk & Portfolio Construction Guide

How to Calibrate Probability Forecasts

A forecast that says 70 percent should be right about 70 percent of the time. When it is not, the probabilities are miscalibrated, and any decision sized on them, from a bet to a risk limit, is mis-sized. Calibration is separate from accuracy: a model can rank cases well yet state wildly wrong probabilities. Measuring calibration, diagnosing it visually, correcting it, and verifying the correction without overfitting the fix are all covered below.

By AI Fin Hub Research · AI Fin Hub Team
Best Next MovePlaygrounds

Forecast Scoring Sandbox

Paste a forecast stream (probability + outcome) and see Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence.

CalculatorOpen ->

On This Page

Before You Start

Set up the inputs that make the next steps easier

A set of probabilistic forecasts paired with the binary outcomes that eventually occurred.
Enough forecasts that each probability range has a meaningful number of observations.
A held-out set of forecasts not used to fit the recalibration, so the fix can be tested honestly.

Guide Steps

Move through it in order

Each step focuses on one decision so you can keep momentum without losing the thread.

  1. 1

    Score with the Brier score and log loss

    Start with proper scoring rules that reward honest probabilities. The Brier score is the mean squared error between the stated probability and the outcome; log loss penalizes confident wrong calls much more heavily. Both are minimized by reporting your true belief, which is why they are called proper. A single accuracy number cannot tell you whether the probabilities are honest; these scores can, and they give a baseline to improve against.

    Log loss punishes overconfidence severely. If your log loss is far worse than your Brier score suggests, you are stating extreme probabilities you have not earned.

  2. 2

    Decompose the Brier score

    The Brier score splits into calibration, refinement, and uncertainty terms. The calibration component isolates exactly how far your stated probabilities sit from the realized frequencies, separate from how well you discriminate between cases. This decomposition tells you whether a poor score comes from miscalibration you can fix with recalibration, or from weak discrimination that needs a better model. Treating the two problems the same wastes effort.

    A model can have excellent discrimination and terrible calibration. The decomposition tells you which one to fix, which is not obvious from the headline score.

  3. 3

    Plot a reliability diagram

    Bin the forecasts by stated probability and plot the realized frequency in each bin against the stated probability. A perfectly calibrated model sits on the diagonal. Points below the diagonal mean overconfidence in that range; points above mean underconfidence. The reliability diagram turns an abstract score into a picture of where the miscalibration lives, which often reveals that a model is well calibrated in the middle but overconfident at the extremes.

    Show the number of observations per bin. A bin with three forecasts that looks badly off may just be noise, not real miscalibration.

  4. 4

    Apply Platt scaling or isotonic regression

    Correct the miscalibration with a recalibration map fit on a calibration set. Platt scaling fits a logistic function and assumes the miscalibration has a simple sigmoid shape, which works well with limited data. Isotonic regression fits a flexible monotonic step function that can correct arbitrary shapes but needs more data and can overfit. Choose Platt when data is scarce and the distortion is smooth; choose isotonic when you have ample data and a complex distortion.

    Isotonic regression is more flexible but more prone to overfitting on small calibration sets. With limited data, Platt scaling is the safer default.

  5. 5

    Re-score on held-out data

    Recalibration is itself a fit, so it must be validated on data it did not see. Apply the recalibration map to a held-out set, re-score the Brier and log loss, and re-plot the reliability diagram. If the held-out scores improve and the diagram tightens to the diagonal, the fix is real. If they do not, you have overfit the calibration map to the calibration set, which is the same selection trap that haunts every other fitted step.

    If recalibration improves the calibration set but not the held-out set, the map is overfit. This is most common with isotonic regression on too little data.

Common Mistakes

The misses that undo good inputs

1

Confusing calibration with accuracy

A model can rank cases perfectly while stating wrong probabilities, or state honest probabilities while ranking poorly. Reporting only accuracy hides whether the numbers you size decisions on are trustworthy.

2

Reading a reliability diagram without bin counts

Sparse bins look badly miscalibrated from noise alone. Without the count per bin you cannot tell real distortion from sampling variation, and you may recalibrate away signal.

3

Fitting recalibration on the same data you score it on

Recalibration is a fit, so scoring it on its own training data overstates the improvement. The gain has to be confirmed on held-out forecasts or it may be overfitting rather than a real correction.

Try These Tools

Run the numbers next

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

A forecast is calibrated when its stated probabilities match realized frequencies: across all the times you said 70 percent, the event happens about 70 percent of the time. Calibration is a property of the probabilities themselves, separate from how well the forecast discriminates between likely and unlikely cases. A calibrated forecast can be used directly to size decisions; a miscalibrated one mis-sizes everything built on it.

Sources & References

Related Content

Keep the topic connected

Planning estimates only — not financial, tax, or investment advice.