Calibrating LLM Forecasts with Isotonic Regression

TL;DR

LLMs emit probabilities that are systematically miscalibrated — a "70% confident" claim is right ~55% of the time on most finance-adjacent tasks. The cheapest and most robust fix is isotonic regression (PAV algorithm): a non-parametric monotonic transform from raw model score to calibrated probability. It takes ~40 lines of Python, requires no distributional assumptions, and converges with a few hundred dated (prediction, outcome) pairs. Below: the algorithm, the calibration-vs-accuracy distinction, and the cold-start shrinkage trick that keeps it honest when you have too few data points.

The problem

An LLM says "70% probability of X." You paper-trade the setup for six months. X actually occurs 54% of the time when the model said 70%. That 16-percentage-point gap is not an LLM failure — it's a calibration failure. The LLM's ordinal ranking is fine; its absolute probabilities are not.

Calibration and accuracy are orthogonal. A miscalibrated model can still rank events correctly (its 70% cases are more likely than its 60% cases), and a well-calibrated model can still be wrong on average (bag of coin flips).

For trading, calibration matters more than ranking. Position sizing in any Kelly-family rule requires an unbiased probability, not a relative score.

Isotonic regression (PAV algorithm)

Given a dataset of (raw_score, actual_outcome) pairs where outcome ∈ {0, 1}, isotonic regression finds the monotonic non-decreasing function that minimizes squared error from the raw scores to the observed frequencies.

The Pool Adjacent Violators (PAV) algorithm is the standard implementation:

def pool_adjacent_violators(pairs):
    """
    pairs: list of (score, outcome) tuples, outcome in {0, 1}.
    Returns a sorted list of (threshold, calibrated_probability).
    """
    # Sort by raw score ascending.
    pairs = sorted(pairs, key=lambda p: p[0])

    # Start: each point is its own bin.
    bins = [[x, y, 1] for (x, y) in pairs]  # [score, sum_outcomes, count]

    # Pool until monotonic non-decreasing.
    changed = True
    while changed:
        changed = False
        i = 0
        while i < len(bins) - 1:
            if bins[i][1] / bins[i][2] > bins[i + 1][1] / bins[i + 1][2]:
                # Violation: merge.
                bins[i][1] += bins[i + 1][1]
                bins[i][2] += bins[i + 1][2]
                del bins[i + 1]
                changed = True
            else:
                i += 1

    # Emit (threshold, calibrated_prob) pairs.
    return [(b[0], b[1] / b[2]) for b in bins]

def calibrate(calibration_map, raw_score):
    """Look up the calibrated probability for a new raw_score."""
    prob = 0.5
    for (threshold, p) in calibration_map:
        if raw_score >= threshold:
            prob = p
    return prob

That's the algorithm in full. No parameters, no distributional assumption, no training loop.

The cold-start problem

With only 30 (prediction, outcome) pairs, isotonic regression produces bins of 1–5 elements each. Those bin-level frequencies are noisy. Raw PAV on sparse data will emit p = 1.0 or p = 0.0 for scores where you have one sample.

The fix is cold-start shrinkage — blend the raw PAV output toward a neutral prior as a function of sample count:

def calibrate_with_shrinkage(calibration_map, raw_score, total_n):
    raw = calibrate(calibration_map, raw_score)
    # Shrink toward 0.5 when total_n is small.
    # At n=1, return 0.5. At n=500, return raw.
    weight = min(1.0, total_n / 500)
    return 0.5 + (raw - 0.5) * weight

This is the standard three-phase shrinkage pattern — cold-start, early, reliable — with the shrinkage weight ramping from 0 toward 1 as the calibration sample grows. Without it, early PAV output over-commits to whatever the first dozen runs looked like.

Calibration vs accuracy — the practical test

Two separate questions, two separate checks:

Accuracy — does the model rank events correctly? Use the ROC-AUC or Brier skill score.

Calibration — do 70%-probability claims happen 70% of the time? Plot a reliability curve: for each decile of predicted probability, compute the actual frequency. A perfectly calibrated model plots on the diagonal.

The Calibration Dojo uses exactly this mechanic for user predictions. The pattern transfers directly to your LLM's outputs once you have a dated log of (probability, outcome).

When isotonic beats Platt + why

Platt scaling (sigmoid transformation) is the parametric alternative. It's faster and has fewer parameters but assumes the miscalibration is a shifted/scaled sigmoid. Isotonic makes no shape assumption — it just enforces monotonicity.

LLM miscalibration empirically isn't sigmoid. A common pattern: the model is well-calibrated in the middle of the probability range and wildly overconfident at the extremes. Platt can't fit this shape without material error; isotonic handles it as two separate bin pools.

Recommended default: use isotonic. Fall back to Platt only if sample size is genuinely tiny (n < 50) and you need a smoother output.

References

Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio." (Discusses calibration correction in the context of Sharpe.)
Platt, J. (1999). "Probabilistic Outputs for Support Vector Machines."
Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities With Supervised Learning."
Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). "Probabilistic forecasts, calibration and sharpness."