Platt vs Temperature Scaling
Modern models, including large neural networks, are often miscalibrated: a 90 percent confidence does not mean right nine times in ten. Post-hoc calibration fixes this by learning a transform on a validation set that maps raw scores to honest probabilities. Platt and temperature scaling are the two simplest parametric transforms. Platt fits a logistic regression on the scores, with a slope and an intercept. Temperature scaling divides the logits by a single learned scalar before the softmax. The extra parameters in Platt give flexibility but can move the decision boundary, whereas temperature scaling is constrained to leave predictions and ranking untouched. This matrix compares them for calibrating financial-model and LLM confidences.
On This Page
Fits a logistic transform, a slope and an intercept, mapping raw scores to calibrated probabilities. Two parameters learned on a held-out set.
Pros
- Two parameters let it both rescale confidence and shift the effective decision threshold
- Well-suited to binary classifiers and to SVM-style scores that need a probability mapping
- Can correct both over- and under-confidence and an offset bias simultaneously
- Long-established with broad library support and well-understood behavior
Cons
- Can change the argmax and decision boundary, altering accuracy as a side effect
- Two parameters need more calibration data to fit reliably than a single one
- Assumes a logistic relationship that may not match the true miscalibration shape
- Awkward to extend to many classes, where it is applied one-versus-rest
Binary scores needing both rescaling and threshold adjustment, and probability mapping for margin-based classifiers
Divides the logits by a single learned temperature before the softmax, softening or sharpening the confidence distribution without changing the ranking.
Pros
- A single parameter, so it fits reliably on small calibration sets and rarely overfits
- Preserves the ranking and the argmax, so accuracy is unchanged by construction
- The standard, often most effective method for modern multiclass neural networks
- Cheap to apply and trivial to reason about: one scalar softens every prediction
Cons
- Cannot fix class-specific or threshold miscalibration, since it scales everything uniformly
- One parameter is too rigid when miscalibration differs across classes or score regions
- Does not adjust the decision boundary, which is sometimes exactly what you need
- Assumes the miscalibration is a uniform over- or under-confidence, which is not always true
Multiclass neural-network confidences where accuracy must be preserved and a single uniform softening suffices
Decision Table
See the tradeoffs side by side
| Criterion | Platt Scaling | Temperature Scaling |
|---|---|---|
| Parameters fit | Two: slope and intercept | One: temperature |
| Changes argmax / accuracy | Can change it | Never, ranking preserved |
| Adjusts decision threshold | Yes | No |
| Calibration data needed | More | Less |
| Multiclass fit | One-versus-rest, awkward | Natural, single scalar |
| Best modern use | Binary scores | Multiclass neural networks |
Verdict
Choose by whether you must preserve accuracy and how many classes you have. For modern multiclass neural networks, temperature scaling is the default and frequently the most effective method, because its single parameter fits reliably on little data, rarely overfits, and by construction leaves the ranking and argmax, and therefore accuracy, untouched: it only softens overconfident probabilities. Reach for Platt scaling when you are calibrating a binary classifier or margin-based score and you genuinely want the extra flexibility to shift the decision threshold as well as rescale confidence, accepting that the two parameters need more calibration data and can move accuracy. If neither captures the miscalibration shape, both are simple baselines that a non-parametric method like isotonic regression can beat at the cost of more data and overfitting risk. Whichever you pick, fit it on a held-out set the model never trained on, and verify the result with a reliability diagram rather than trusting the transform blindly.
Try These Tools
Run the numbers next
Calibration Dojo
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.
Forecast Scoring Sandbox
Paste a forecast stream (probability + outcome) and see Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence.
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- On Calibration of Modern Neural Networks — Guo, Pleiss, Sun, Weinberger, ICML (2017)
- Probabilistic Outputs for Support Vector Machines — John Platt, Advances in Large Margin Classifiers (1999)
Related Content
Keep the topic connected
Model Drift
Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Agent Skill Testing
Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.
Prompt Injection
Prompt injection: when untrusted text in a prompt overrides system instructions. The attack patterns and the structural defenses that work in production.