Brier Score vs Log Loss
When an LLM or model outputs a probability rather than a label, you need a scoring rule that rewards calibrated honesty and cannot be gamed by hedging. Both the Brier score and log loss are proper, meaning the expected score is optimized by reporting your true belief. They differ in the shape of the penalty as a prediction approaches certainty and is wrong. The Brier score grows quadratically and stays bounded; log loss grows without bound and explodes near the confident-and-wrong corner. That difference decides which is appropriate for a given forecasting task. This matrix compares them for scoring financial and agent forecasts.
On This Page
The mean squared difference between predicted probabilities and outcomes. A bounded, proper scoring rule where lower is better and zero is perfect.
Pros
- Bounded between zero and one for binary outcomes, so a single bad forecast cannot dominate the average
- Interpretable as a mean squared error of probabilities, intuitive to reason about
- Decomposes cleanly into calibration and refinement components for diagnostics
- Robust to occasional overconfident errors, which it penalizes only quadratically
Cons
- Penalizes confident wrong predictions relatively gently, which may understate their real cost
- Less sensitive than log loss to differences among already-good probabilistic forecasts
- Quadratic shape does not match the information-theoretic cost of being surprised
- Can rate an overconfident model leniently when confident errors should be severe
Bounded, interpretable scoring robust to outliers, and reporting forecast quality where one confident miss should not dominate
The negative log-likelihood of the observed outcomes under the predicted probabilities. A proper scoring rule, unbounded, that punishes confident errors severely.
Pros
- Punishes confident wrong predictions extremely hard, matching risk-sensitive use cases
- Information-theoretically grounded as the expected surprise, the standard ML training loss
- Highly sensitive among good forecasts, sharply rewarding better-calibrated probabilities
- Directly the objective most classifiers optimize, so it aligns metric with training
Cons
- Unbounded: a single confident, wrong prediction can blow up the average score
- Infinite penalty for assigning zero probability to an event that occurs, requiring clipping
- Harder to interpret on an absolute scale than the bounded Brier score
- Sensitive to outliers and to probabilities pushed near zero or one
Risk-sensitive forecasting that must avoid confident errors, model training, and discriminating among already-good forecasters
Decision Table
See the tradeoffs side by side
| Criterion | Brier Score | Log Loss (Cross-Entropy) |
|---|---|---|
| Penalty shape | Quadratic, bounded | Logarithmic, unbounded |
| Confident wrong prediction | Penalized gently | Penalized severely, can be infinite |
| Range (binary) | 0 to 1 | 0 to infinity |
| Outlier robustness | High | Low |
| Proper scoring rule | Yes | Yes |
| Interpretability | Mean squared error of probabilities | Expected surprise, less direct |
Verdict
Both are proper, so neither rewards hedging, and the choice is about how harshly you want to punish confident mistakes. Use the Brier score when you want a bounded, interpretable number that no single overconfident miss can dominate, which makes it the safer default for reporting forecast quality and for comparing forecasters when robustness matters. Use log loss when the cost of being confidently wrong is genuinely catastrophic, as in risk-sensitive financial forecasts where a model that says 99 percent and is wrong should be penalized far more than one that says 60 percent and is wrong; log loss encodes exactly that asymmetry. Two practical notes: log loss requires clipping probabilities away from zero and one or it returns infinity, and because the two metrics can rank forecasters differently, report both when the decision is important rather than trusting a single number.
Try These Tools
Run the numbers next
Forecast Scoring Sandbox
Paste a forecast stream (probability + outcome) and see Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence.
Calibration Dojo
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.
Hallucination Detector
Paste a source document + an LLM's extraction. Every numeric claim in the output is checked against the source. Client-side. Catches silent fabrication.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Sources & References
- Verification of Forecasts Expressed in Terms of Probability — Glenn W. Brier, Monthly Weather Review (1950)
- Strictly Proper Scoring Rules, Prediction, and Estimation — Gneiting and Raftery, Journal of the American Statistical Association (2007)
Related Content
Keep the topic connected
Model Drift
Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.
Hallucination Detection
Detecting LLM hallucinations in financial outputs: the verifiable-claim approach, citation grounding, and cross-model agreement signals that work.
Monte Carlo Simulation
Monte Carlo simulation in trading: when it's the right tool, when it's overkill, and the seed-discipline gotcha that ruins most published examples.
Bailey-Lopez de Prado PBO
Probability of Backtest Overfitting: a combinatorial test that estimates how likely your best in-sample strategy is to underperform out-of-sample.