TL;DR
LLMs emit probabilities that are systematically miscalibrated — a "70% confident" claim is right ~55% of the time on most finance-adjacent tasks. The cheapest and most robust fix is isotonic regression (PAV algorithm): a non-parametric monotonic transform from raw model score to calibrated probability. It takes ~40 lines of Python, requires no distributional assumptions, and converges with a few hundred dated (prediction, outcome) pairs. Below: the algorithm, the calibration-vs-accuracy distinction, and the cold-start shrinkage trick that keeps it honest when you have too few data points.
The problem
An LLM says "70% probability of X." You paper-trade the setup for six months. X actually occurs 54% of the time when the model said 70%. That 16-percentage-point gap is not an LLM failure — it's a calibration failure. The LLM's ordinal ranking is fine; its absolute probabilities are not.
Calibration and accuracy are orthogonal. A miscalibrated model can still rank events correctly (its 70% cases are more likely than its 60% cases), and a well-calibrated model can still be wrong on average (bag of coin flips).
For trading, calibration matters more than ranking. Position sizing in any Kelly-family rule requires an unbiased probability, not a relative score.
Isotonic regression (PAV algorithm)
Given a dataset of (raw_score, actual_outcome) pairs where outcome ∈ {0, 1}, isotonic regression finds the monotonic non-decreasing function that minimizes squared error from the raw scores to the observed frequencies.
The Pool Adjacent Violators (PAV) algorithm is the standard implementation:
def pool_adjacent_violators(pairs):
"""
pairs: list of (score, outcome) tuples, outcome in {0, 1}.
Returns a sorted list of (threshold, calibrated_probability).
"""
# Sort by raw score ascending.
pairs = sorted(pairs, key=lambda p: p[0])
# Start: each point is its own bin.
bins = [[x, y, 1] for (x, y) in pairs] # [score, sum_outcomes, count]
# Pool until monotonic non-decreasing.
changed = True
while changed:
changed = False
i = 0
while i < len(bins) - 1:
if bins[i][1] / bins[i][2] > bins[i + 1][1] / bins[i + 1][2]:
# Violation: merge.
bins[i][1] += bins[i + 1][1]
bins[i][2] += bins[i + 1][2]
del bins[i + 1]
changed = True
else:
i += 1
# Emit (threshold, calibrated_prob) pairs.
return [(b[0], b[1] / b[2]) for b in bins]
def calibrate(calibration_map, raw_score):
"""Look up the calibrated probability for a new raw_score."""
prob = 0.5
for (threshold, p) in calibration_map:
if raw_score >= threshold:
prob = p
return prob
That's the algorithm in full. No parameters, no distributional assumption, no training loop.
The cold-start problem
With only 30 (prediction, outcome) pairs, isotonic regression produces bins of 1–5 elements each. Those bin-level frequencies are noisy. Raw PAV on sparse data will emit p = 1.0 or p = 0.0 for scores where you have one sample.
The fix is cold-start shrinkage — blend the raw PAV output toward a neutral prior as a function of sample count:
def calibrate_with_shrinkage(calibration_map, raw_score, total_n):
raw = calibrate(calibration_map, raw_score)
# Shrink toward 0.5 when total_n is small.
# At n=1, return 0.5. At n=500, return raw.
weight = min(1.0, total_n / 500)
return 0.5 + (raw - 0.5) * weight
This is the standard three-phase shrinkage pattern — cold-start, early, reliable — with the shrinkage weight ramping from 0 toward 1 as the calibration sample grows. Without it, early PAV output over-commits to whatever the first dozen runs looked like.
Calibration vs accuracy — the practical test
Two separate questions, two separate checks:
Accuracy — does the model rank events correctly? Use the ROC-AUC or Brier skill score.
Calibration — do 70%-probability claims happen 70% of the time? Plot a reliability curve: for each decile of predicted probability, compute the actual frequency. A perfectly calibrated model plots on the diagonal.
The Calibration Dojo uses exactly this mechanic for user predictions. The pattern transfers directly to your LLM's outputs once you have a dated log of (probability, outcome).
When isotonic beats Platt + why
Platt scaling (sigmoid transformation) is the parametric alternative. It's faster and has fewer parameters but assumes the miscalibration is a shifted/scaled sigmoid. Isotonic makes no shape assumption — it just enforces monotonicity.
LLM miscalibration empirically isn't sigmoid. A common pattern: the model is well-calibrated in the middle of the probability range and wildly overconfident at the extremes. Platt can't fit this shape without material error; isotonic handles it as two separate bin pools.
Recommended default: use isotonic. Fall back to Platt only if sample size is genuinely tiny (n < 50) and you need a smoother output.
References
- Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio." (Discusses calibration correction in the context of Sharpe.)
- Platt, J. (1999). "Probabilistic Outputs for Support Vector Machines."
- Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities With Supervised Learning."
- Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). "Probabilistic forecasts, calibration and sharpness."