The Calibration Dojo returns 24 medium-difficulty general-knowledge calibration questions in a single session at session_length_questions = 30 (the engine ships the 24 it has on hand; the question pool is the binding constraint, not the request). The Forecast Scoring Sandbox scored its 200-forecast sample tape and returns Brier score 0.202, log-loss 0.588, reliability 0.009, resolution 0.055, uncertainty 0.250, base rate 0.49: a forecaster that is already well calibrated (the reliability term is near zero). Three calibration techniques (the human-in-the-loop Dojo, Platt scaling, isotonic regression) address what you do when reliability is not near zero. The decision depends on whether the miscalibration is monotone, dataset size, and whether human intuition is the limiting factor.

TL;DR

Three calibration approaches for LLM probability outputs:

Approach What it does When it wins
Calibration Dojo (human) Trains a human to issue calibrated probabilities When the miscalibration is in the human-in-the-loop layer
Platt scaling (parametric) Fits a sigmoid f(p) = σ(a·p + b) to the (predicted, observed) pairs When miscalibration is monotone and dataset is small (<300 samples)
Isotonic regression (non-parametric) PAV algorithm fits a monotone step function to the same pairs When miscalibration is non-monotone or dataset is large (>500 samples)

The Forecast Scoring Sandbox on its 200-forecast sample tape returns Brier 0.202, log-loss 0.588, reliability 0.009 (near zero, so this forecaster is well calibrated), resolution 0.055, uncertainty 0.250. When reliability is near zero there is nothing to fix; the three methods below are what you reach for when a different tape returns a high reliability term.

The Brier decomposition

Brier = reliability − resolution + uncertainty. On the sample run:

  • Brier = 0.202. The mean squared error of the probabilistic forecast.
  • Reliability = 0.009. Near zero. This term measures how far the forecaster's probability estimates are from the empirical frequencies in each bin; near zero means well calibrated. (This is the term the three methods below reduce when it is large.)
  • Resolution = 0.055. The forecaster does discriminate between bins; higher resolution is better, as it subtracts from the Brier score.
  • Uncertainty = 0.250. The baseline binary-outcome variance at the empirical base rate of 0.49 (close to a coin flip, so uncertainty is near its 0.25 maximum).

The decomposition checks out: 0.009 − 0.055 + 0.250 ≈ 0.20. The log-loss = 0.588 is below the random-baseline (−log 0.5 ≈ 0.693), consistent with a forecaster that beats a coin flip. The sample tape is well calibrated; the rest of this article is about what to do when it is not — when a tape comes back with a reliability term an order of magnitude larger.

Platt scaling on this signature

Platt scaling fits a logistic transformation:

calibrated_prob = sigmoid(a · raw_prob + b)

On a miscalibrated tape, Platt would fit (a, b) to minimise log-loss on the (raw, outcome) pairs. The two parameters are estimable from 50-100 observations; with 200 observations (the sample size here) the standard errors are small.

Platt's strengths:

  • Two parameters, fast to fit, low risk of over-fitting on small datasets.
  • Monotone by construction, if the raw probability is monotonically related to the outcome rate, Platt's sigmoid preserves that monotonicity.
  • Closed-form predictions, calibrated probability is a simple function of raw probability, no lookup table needed.

Platt's weaknesses:

  • Sigmoid shape is fixed. If the miscalibration is U-shaped (forecaster over-confident at both extremes, under-confident in the middle), Platt cannot fit it.
  • Two parameters cannot capture local non-linearity. If the forecaster is well-calibrated at low probabilities but over-confident only at high probabilities, Platt's sigmoid will smooth both regions and worsen the well-calibrated portion.

On a monotone miscalibration (say, a forecaster whose predicted probabilities are systematically below the observed frequencies), Platt could fit it with a higher slope a and a shifted intercept b, and the post-Platt log-loss would drop significantly.

Isotonic regression on this signature

Isotonic regression fits a monotone step function via the Pool Adjacent Violators (PAV) algorithm:

calibrated_prob = isotonic_step_function(raw_prob)

The step function is non-parametric, it has as many steps as the data supports, only constrained by monotonicity.

Isotonic's strengths:

  • Non-parametric, adapts to whatever monotone shape the data implies.
  • Distribution-free, no Gaussian or sigmoid assumption.
  • Provably consistent, converges to the true calibration function as N grows.

Isotonic's weaknesses:

  • Data-hungry, needs ~500+ observations for stable step boundaries.
  • Output is a lookup table, no closed form; predictions require interpolating on the fitted steps.
  • Monotone only, cannot fit U-shaped miscalibration if such exists.

On a 200-observation tape (the sample size here), isotonic would fit but with wide step boundaries (low statistical precision). With 500 observations the step structure would resolve more cleanly. For LLM-probability calibration on production-scale traffic (10,000+ forecasts), isotonic is the safer default.

The Isotonic Calibration of LLM Forecasts article walks through the implementation in 40 lines of Python.

The Calibration Dojo: training the human-in-the-loop layer

The Calibration Dojo addresses a different failure mode. Where Platt and isotonic post-process machine probabilities, the Dojo trains a human to issue calibrated probability estimates.

The engine returns 24 questions per session: each statement plus the empirical truth value (binary), source citation, and rationale. The human predicts a probability for each; the engine scores the responses against the truth values using Brier and log-loss.

Over multiple sessions the human's reliability term decreases: they learn to issue 90% probability when 90% of past 90%-confidence statements were true, not when they "feel" 90% confident. Tetlock's superforecaster research and the Good Judgment Project both document that humans can become calibrated through this kind of structured practice1.

The Dojo is the right tool when:

  • The human-in-the-loop layer (e.g., the analyst reviewing LLM-output trade signals) is the binding constraint on overall calibration.
  • The volume of decisions is too low to fit Platt or isotonic statistically.
  • Domain expertise is the missing input, a calibrated human can incorporate prior knowledge a machine cannot.

The decision tree

For LLM-probability calibration:

  1. Is the miscalibration in the LLM output, or in the human review layer? If LLM: Platt or isotonic. If human: Calibration Dojo.
  2. If LLM, is the dataset large (>500 obs)? Yes: isotonic. No: Platt.
  3. Is the miscalibration monotone? Yes: either Platt or isotonic. No: neither, the miscalibration is U-shaped or worse, and a more flexible model (e.g., Beta calibration, see Kull et al. 2017) is required2.

For most LLM-driven retail finance workflows, isotonic regression on 1,000–10,000 historical (predicted, outcome) pairs is the right default. The Calibration Dojo is a separate practice for the human-in-the-loop layer; it does not replace post-processing of LLM probabilities.

Where each method fails

Platt fails when the underlying miscalibration is non-sigmoidal. Common case: an LLM that is well-calibrated at p ∈ [0.4, 0.6] but over-confident at p ∈ [0.8, 1.0]. Platt's two parameters cannot capture a local distortion; the global sigmoid fit smooths both regions and may worsen the calibrated portion.

Isotonic fails on small datasets. Below 200 observations the PAV step boundaries are too noisy to be reliable; below 50 observations isotonic is essentially memorisation. Use Platt or a Bayesian Beta calibration instead.

The Calibration Dojo fails when the human is unwilling or unable to engage. The training requires sustained attention over multiple sessions; for a solo retail operator who treats forecasting as an occasional exercise, the Dojo's benefits decay quickly. The Dojo also does not transfer well to highly domain-specific contexts, practice on general-knowledge questions does not necessarily calibrate the human on, say, options-volatility forecasting.

Connects to

References

  • Platt, J. C. (1999). "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods." Advances in Large Margin Classifiers. The Platt-scaling reference.
  • Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities With Supervised Learning." ICML 2005. The isotonic-regression-for-calibration reference. https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
  • Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." Monthly Weather Review 78(1), 1–3. The Brier score's original formulation.

Footnotes

  1. Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishing. Empirical evidence that calibration is trainable through structured practice. https://goodjudgment.com/superforecasting/

  2. Kull, M., Silva Filho, T. M., & Flach, P. (2017). "Beta Calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers." Artificial Intelligence and Statistics 2017. https://proceedings.mlr.press/v54/kull17a.html

Verified engine output

Show the recompute-verified inputs and outputs
Forecast scoring on the 200-forecast sample tape
Inputs
n_bins10
forecasts (200 items)[...]
Result
brier › brier0.20235451785000014
brier › reliability0.008955834573804455
brier › resolution0.055053521956970226
brier › uncertainty0.2499
brier › base rate0.49
brier › count200
brier › bins used10
log loss0.5879454705782917
bins (10 items)[...]

Computed live at build time.

Frequently asked questions

Why is the Brier score 0.202 on the sample run?
The sample tape is a reasonably well-calibrated forecaster on a near-coin-flip base rate (0.49). The uncertainty term (0.250) dominates and the reliability term is near zero (0.009), so almost none of the Brier score is miscalibration.
Should I use Brier or log-loss as my primary calibration metric?
Both. Log-loss penalises confidently-wrong forecasts more aggressively; Brier is bounded and easier to interpret. Report both; log-loss is usually the more sensitive arbiter.
How often should I re-fit Platt or isotonic?
Monthly is typical for LLM workflows; more frequently if the LLM is being updated, less frequently if the pipeline is stable.
Can I combine all three approaches?
Yes — isotonic post-processes LLM probabilities, a calibrated human reviews the post-processed probabilities before they route to execution. Institutional-grade pipeline.
Is calibration the same as accuracy?
No. A 60%-accurate forecaster who always says 60% is calibrated; a 95%-accurate forecaster who always says 99% is over-confident and not calibrated.