How to Calibrate Probability Forecasts
A forecast that says 70 percent should be right about 70 percent of the time. When it is not, the probabilities are miscalibrated, and any decision sized on them, from a bet to a risk limit, is mis-sized. Calibration is separate from accuracy: a model can rank cases well yet state wildly wrong probabilities. Measuring calibration, diagnosing it visually, correcting it, and verifying the correction without overfitting the fix are all covered below.
Before You Start
Set up the inputs that make the next steps easier
Guide Steps
Move through it in order
Each step focuses on one decision so you can keep momentum without losing the thread.
- 1
Score with the Brier score and log loss
Start with proper scoring rules that reward honest probabilities. The Brier score is the mean squared error between the stated probability and the outcome; log loss penalizes confident wrong calls much more heavily. Both are minimized by reporting your true belief, which is why they are called proper. A single accuracy number cannot tell you whether the probabilities are honest; these scores can, and they give a baseline to improve against.
Log loss punishes overconfidence severely. If your log loss is far worse than your Brier score suggests, you are stating extreme probabilities you have not earned.
- 2
Decompose the Brier score
The Brier score splits into calibration, refinement, and uncertainty terms. The calibration component isolates exactly how far your stated probabilities sit from the realized frequencies, separate from how well you discriminate between cases. This decomposition tells you whether a poor score comes from miscalibration you can fix with recalibration, or from weak discrimination that needs a better model. Treating the two problems the same wastes effort.
A model can have excellent discrimination and terrible calibration. The decomposition tells you which one to fix, which is not obvious from the headline score.
- 3
Plot a reliability diagram
Bin the forecasts by stated probability and plot the realized frequency in each bin against the stated probability. A perfectly calibrated model sits on the diagonal. Points below the diagonal mean overconfidence in that range; points above mean underconfidence. The reliability diagram turns an abstract score into a picture of where the miscalibration lives, which often reveals that a model is well calibrated in the middle but overconfident at the extremes.
Show the number of observations per bin. A bin with three forecasts that looks badly off may just be noise, not real miscalibration.
- 4
Apply Platt scaling or isotonic regression
Correct the miscalibration with a recalibration map fit on a calibration set. Platt scaling fits a logistic function and assumes the miscalibration has a simple sigmoid shape, which works well with limited data. Isotonic regression fits a flexible monotonic step function that can correct arbitrary shapes but needs more data and can overfit. Choose Platt when data is scarce and the distortion is smooth; choose isotonic when you have ample data and a complex distortion.
Isotonic regression is more flexible but more prone to overfitting on small calibration sets. With limited data, Platt scaling is the safer default.
- 5
Re-score on held-out data
Recalibration is itself a fit, so it must be validated on data it did not see. Apply the recalibration map to a held-out set, re-score the Brier and log loss, and re-plot the reliability diagram. If the held-out scores improve and the diagram tightens to the diagonal, the fix is real. If they do not, you have overfit the calibration map to the calibration set, which is the same selection trap that haunts every other fitted step.
If recalibration improves the calibration set but not the held-out set, the map is overfit. This is most common with isotonic regression on too little data.
Common Mistakes
The misses that undo good inputs
Confusing calibration with accuracy
A model can rank cases perfectly while stating wrong probabilities, or state honest probabilities while ranking poorly. Reporting only accuracy hides whether the numbers you size decisions on are trustworthy.
Reading a reliability diagram without bin counts
Sparse bins look badly miscalibrated from noise alone. Without the count per bin you cannot tell real distortion from sampling variation, and you may recalibrate away signal.
Fitting recalibration on the same data you score it on
Recalibration is a fit, so scoring it on its own training data overstates the improvement. The gain has to be confirmed on held-out forecasts or it may be overfitting rather than a real correction.
Try These Tools
Run the numbers next
Calibration Dojo
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
Risk-Adjusted Returns Calculator
Paste a returns CSV. Sharpe, Sortino, Calmar, Omega, alpha, beta, tracking error, information ratio, max drawdown, and tail moments — plus.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- Verification of Forecasts Expressed in Terms of Probability — Glenn W. Brier, Monthly Weather Review (1950)
- Predicting Good Probabilities with Supervised Learning — Niculescu-Mizil and Caruana, ICML (2005)
Related Content
Keep the topic connected
Sharpe Ratio
Sharpe ratio defined, when it lies (skew, fat tails, autocorrelation), and how to read a Sharpe number you didn't compute yourself.
Monte Carlo Simulation
Monte Carlo simulation in trading: when it's the right tool, when it's overkill, and the seed-discipline gotcha that ruins most published examples.
Value at Risk (VaR)
Value at Risk: the loss threshold you'll exceed with probability α. Why historical VaR is brittle and what it doesn't tell you about the tail.
Model Drift
Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.