When should I use Platt scaling versus isotonic regression?

Use Platt scaling when the calibration data is limited and the miscalibration is a smooth, roughly sigmoid distortion, since the logistic fit needs few parameters and resists overfitting. Use isotonic regression when you have ample data and the distortion is complex or non-monotonic in shape, since its flexible step function can correct arbitrary monotonic mappings. The tradeoff is flexibility against overfitting risk, which scales inversely with how much calibration data you have.

Why use the Brier score instead of just accuracy?

Accuracy only cares whether the most likely class was right and ignores the probability you assigned. The Brier score is a proper scoring rule that rewards stating your true probability and penalizes both overconfidence and underconfidence. Two forecasters can have identical accuracy while one is well calibrated and the other wildly overconfident; the Brier score distinguishes them and accuracy does not.

Do large language models need calibration in finance?

Often yes. When an LLM emits a confidence or probability, that number is frequently miscalibrated, commonly overconfident, and a finance decision sized on it will be mis-sized. Score the model's stated probabilities against outcomes with a reliability diagram, and if they diverge from the diagonal, apply a recalibration map before using the confidences to drive any sizing or gating.

Risk & Portfolio Construction Guide

How to Calibrate Probability Forecasts

A forecast that says 70 percent should be right about 70 percent of the time. When it is not, the probabilities are miscalibrated, and any decision sized on them, from a bet to a risk limit, is mis-sized. Calibration is separate from accuracy: a model can rank cases well yet state wildly wrong probabilities. Measuring calibration, diagnosing it visually, correcting it, and verifying the correction without overfitting the fix are all covered below.

8 MIN READPublished May 26, 2026Live Content

By AI Fin Hub Research · AI Fin Hub Team

Best Next MovePlaygrounds

Forecast Scoring Sandbox

Paste a forecast stream (probability + outcome) and see Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence.

CalculatorOpen ->

On This Page

Before you start 5 steps Common mistakes FAQ

Before You Start

Set up the inputs that make the next steps easier

A set of probabilistic forecasts paired with the binary outcomes that eventually occurred.

Enough forecasts that each probability range has a meaningful number of observations.

A held-out set of forecasts not used to fit the recalibration, so the fix can be tested honestly.

Guide Steps

Move through it in order

Each step focuses on one decision so you can keep momentum without losing the thread.

1

Score with the Brier score and log loss

Start with proper scoring rules that reward honest probabilities. The Brier score is the mean squared error between the stated probability and the outcome; log loss penalizes confident wrong calls much more heavily. Both are minimized by reporting your true belief, which is why they are called proper. A single accuracy number cannot tell you whether the probabilities are honest; these scores can, and they give a baseline to improve against.

Log loss punishes overconfidence severely. If your log loss is far worse than your Brier score suggests, you are stating extreme probabilities you have not earned.
2

Decompose the Brier score

The Brier score splits into calibration, refinement, and uncertainty terms. The calibration component isolates exactly how far your stated probabilities sit from the realized frequencies, separate from how well you discriminate between cases. This decomposition tells you whether a poor score comes from miscalibration you can fix with recalibration, or from weak discrimination that needs a better model. Treating the two problems the same wastes effort.

A model can have excellent discrimination and terrible calibration. The decomposition tells you which one to fix, which is not obvious from the headline score.
3

Plot a reliability diagram

Bin the forecasts by stated probability and plot the realized frequency in each bin against the stated probability. A perfectly calibrated model sits on the diagonal. Points below the diagonal mean overconfidence in that range; points above mean underconfidence. The reliability diagram turns an abstract score into a picture of where the miscalibration lives, which often reveals that a model is well calibrated in the middle but overconfident at the extremes.

Show the number of observations per bin. A bin with three forecasts that looks badly off may just be noise, not real miscalibration.
4

Apply Platt scaling or isotonic regression

Correct the miscalibration with a recalibration map fit on a calibration set. Platt scaling fits a logistic function and assumes the miscalibration has a simple sigmoid shape, which works well with limited data. Isotonic regression fits a flexible monotonic step function that can correct arbitrary shapes but needs more data and can overfit. Choose Platt when data is scarce and the distortion is smooth; choose isotonic when you have ample data and a complex distortion.

Isotonic regression is more flexible but more prone to overfitting on small calibration sets. With limited data, Platt scaling is the safer default.
5

Re-score on held-out data

Recalibration is itself a fit, so it must be validated on data it did not see. Apply the recalibration map to a held-out set, re-score the Brier and log loss, and re-plot the reliability diagram. If the held-out scores improve and the diagram tightens to the diagonal, the fix is real. If they do not, you have overfit the calibration map to the calibration set, which is the same selection trap that haunts every other fitted step.

If recalibration improves the calibration set but not the held-out set, the map is overfit. This is most common with isotonic regression on too little data.

Common Mistakes

The misses that undo good inputs

Confusing calibration with accuracy

A model can rank cases perfectly while stating wrong probabilities, or state honest probabilities while ranking poorly. Reporting only accuracy hides whether the numbers you size decisions on are trustworthy.

Reading a reliability diagram without bin counts

Sparse bins look badly miscalibrated from noise alone. Without the count per bin you cannot tell real distortion from sampling variation, and you may recalibrate away signal.

Fitting recalibration on the same data you score it on

Recalibration is a fit, so scoring it on its own training data overstates the improvement. The gain has to be confirmed on held-out forecasts or it may be overfitting rather than a real correction.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Calibration Dojo

Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.

Launch toolOpen ->

PlaygroundsCalculator

Hallucination Detector

Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.

Launch toolOpen ->

CalculatorsCalculator

Risk-Adjusted Returns Calculator

Paste a returns CSV. Sharpe, Sortino, Calmar, Omega, alpha, beta, tracking error, information ratio, max drawdown, and tail moments — plus.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

A forecast is calibrated when its stated probabilities match realized frequencies: across all the times you said 70 percent, the event happens about 70 percent of the time. Calibration is a property of the probabilities themselves, separate from how well the forecast discriminates between likely and unlikely cases. A calibrated forecast can be used directly to size decisions; a miscalibrated one mis-sizes everything built on it.

Sources & References

Verification of Forecasts Expressed in Terms of Probability — Glenn W. Brier, Monthly Weather Review (1950)
Predicting Good Probabilities with Supervised Learning — Niculescu-Mizil and Caruana, ICML (2005)

Keep the topic connected

Risk & Portfolio Construction2 FAQS

Sharpe Ratio

Sharpe ratio defined, when it lies (skew, fat tails, autocorrelation), and how to read a Sharpe number you didn't compute yourself.

Keep readingRead ->

Backtesting & Validation1 FAQS

Monte Carlo Simulation

Monte Carlo simulation in trading: when it's the right tool, when it's overkill, and the seed-discipline gotcha that ruins most published examples.

Keep readingRead ->

Risk & Portfolio Construction1 FAQS

Value at Risk (VaR)

Value at Risk: the loss threshold you'll exceed with probability α. Why historical VaR is brittle and what it doesn't tell you about the tail.

Keep readingRead ->

AI in Markets1 FAQS

Model Drift

Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.

Keep readingRead ->

Set up the inputs that make the next steps easier

Move through it in order

Score with the Brier score and log loss

Decompose the Brier score

Plot a reliability diagram

Apply Platt scaling or isotonic regression

Re-score on held-out data

The misses that undo good inputs

Confusing calibration with accuracy

Reading a reliability diagram without bin counts

Fitting recalibration on the same data you score it on

Run the numbers next

Calibration Dojo

Hallucination Detector

Risk-Adjusted Returns Calculator

Questions people ask next

Keep the topic connected

Sharpe Ratio

Monte Carlo Simulation

Value at Risk (VaR)

Model Drift