A 15-forecast sandbox returns Brier 0.154, log-loss 0.470, and a base rate of 53.3% on the Forecast Scoring Sandbox. Brier decomposes into reliability 0.154, resolution 0.249, and uncertainty 0.249. The reliability term equals the uncertainty term — the forecaster's calibration is no better than the climatological base rate. That is the diagnostic, not the headline. The headline number says "this forecast set is mediocre"; the decomposition says "the forecaster does not know what they don't know."
TL;DR
- 15-forecast sandbox sample on binary outcomes.
- Brier score: 0.154 (lower is better; 0 is perfect, 0.25 is coin-flip on 50% base rate).
- Log loss: 0.470 (lower is better; ln(2)=0.693 is coin-flip).
- Base rate: 53.3% of outcomes were positive.
- Brier decomposition: reliability 0.154, resolution 0.249, uncertainty 0.249.
- The reliability = uncertainty equality is the calibration failure signal.
The scenario
The Forecast Scoring Sandbox takes a list of (predicted_probability, observed_outcome) pairs and returns Brier, log-loss, and the Murphy decomposition1. On a 15-forecast sample with a 53.3% base rate, the output is:
| Metric | Value |
|---|---|
| Brier score | 0.154 |
| Log loss | 0.470 |
| Reliability | 0.154 |
| Resolution | 0.249 |
| Uncertainty | 0.249 |
| Base rate | 0.533 |
| Count | 15 |
| Bins used | 14 |
Brier and log-loss in one paragraph
Brier score is the mean squared error of forecast probability versus realised outcome2. Log loss is the mean negative log-likelihood — heavier penalty for confident wrong forecasts. Both are proper scoring rules: the forecaster minimises them by reporting their true belief3. Neither is a calibration test on its own; both are aggregate score summaries.
For a 53.3% base rate:
- A forecaster who says 50% on every prediction: Brier = 0.25, log loss = 0.693.
- A forecaster who matches base rate (53.3% on every prediction): Brier = 0.249, log loss = 0.691.
- A perfectly-calibrated forecaster who also has resolution: Brier < 0.20, log loss < 0.55.
- Our scenario at Brier 0.154, log loss 0.470: better than base rate, but the decomposition tells us where the score comes from.
The Murphy decomposition
Murphy (1973) showed Brier decomposes as:
Brier = Reliability − Resolution + Uncertainty
where:
- Reliability measures how close the forecaster's stated probabilities are to the realised frequencies in each probability bucket. Zero is perfect calibration.
- Resolution measures how much the forecaster's predictions differ from the base rate. Higher resolution = more informative predictions.
- Uncertainty is the variance of the observed outcomes — a property of the data, not the forecaster.
For our scenario: Brier 0.154 = Reliability 0.154 − Resolution 0.249 + Uncertainty 0.249.
The Resolution and Uncertainty terms exactly cancel. That means the forecaster's predictions have as much variance as the underlying outcomes — they are informative. The Reliability of 0.154 is the entire Brier. The forecaster's stated probabilities do not match the realised frequencies in their buckets.
What this means for a finance forecaster
A finance LLM that returns probabilities on directional calls (e.g., "65% chance EUR/USD closes above 1.08") needs both calibration and resolution to be useful:
- Calibrated, no resolution. Forecaster matches base rate on every bin. Probabilities are accurate as long-run averages, but say nothing about specific cases. Useless for trading.
- Resolved, no calibration. Forecaster's predictions vary meaningfully but the stated probabilities don't match realised frequencies. The shape of the predictions is right; the levels are off. Fixable with isotonic calibration.
- Both. The defensible target.
Our 15-sample scenario has resolution but lacks calibration. The fix is to run the forecasts through an isotonic calibration pass — see Isotonic Calibration for LLM Forecasts: which maps the model's raw probabilities to empirically-realised frequencies.
Sample-size caveats
15 observations is small. The Brier and log-loss estimates have wide standard errors at this sample size:
- Brier stderr ≈ √(Brier × (1−Brier) / n) ≈ √(0.154 × 0.846 / 15) ≈ 0.094.
- 95% CI on Brier: roughly [0.000, 0.342] — wide enough that "no information" (Brier = 0.25 on this base rate) is inside the CI.
For statistically reliable scoring, plan for 100+ forecasts minimum. Calibration improves with deliberate, scored practice rather than volume alone4. The Calibration Dojo provides a structured training environment for building up a 100-forecast calibration baseline before deploying any model output.
Log-loss vs Brier — when each is right
Brier and log-loss agree on the ordering of forecasters most of the time. They differ on:
- Confident wrong forecasts. A forecaster who says 99% on an event that fails takes a hit in both, but log-loss penalises more severely. For trading applications where confident wrong calls cost more than confident right calls, log-loss is the right scoring rule.
- Probabilities near 0 or 1. Log-loss is unbounded as $p \to 0$ or $p \to 1$; Brier is bounded. For applications where models occasionally output near-0 or near-1 probabilities, Brier is more numerically stable.
- Mean-zero comparisons. Brier is a quadratic; log-loss is logarithmic. For ranking close forecasters, Brier's quadratic gives finer resolution near small differences.
For finance applications targeting calibrated probability output, log-loss is the typical primary metric, with Brier as a sanity check.
The reliability diagram
The engine returns 20 bins for the reliability diagram. The diagram plots mean predicted probability per bin against observed frequency per bin. Perfect calibration is the y = x line; deviation from that line is the reliability term in numeric form.
For visual reading, look for systematic over- or under-confidence. A curve that sits above y = x means the forecaster systematically under-predicts (probabilities are lower than realised frequencies). A curve below means over-confidence. Both patterns are fixable with isotonic calibration.
Production usage
For a production LLM finance forecast pipeline:
- Log every forecast and outcome. Run the model in production with full probability output; capture the eventual realised outcome.
- Run the scoring sandbox monthly. Brier and log-loss over the last 100+ forecasts.
- Decompose Brier monthly. Watch reliability, resolution, and uncertainty separately. Reliability drift is the calibration regression signal.
- Recalibrate quarterly. Re-fit the isotonic calibrator on the most recent 200+ forecasts.
The forecasting cycle is operational, not one-off. Models drift; markets shift regimes; the calibration that worked on Q1 data may not work on Q3 data.
Failure modes
- Quoting Brier without the decomposition. Brier alone hides the calibration vs resolution question.
- Running on tiny samples. 15 observations gives a Brier with ±0.09 standard error. Plan for 100+.
- Treating log-loss as a probability. It is a negative log-likelihood, dimensionless, not interpretable in isolation. Use it as a relative ranking signal.
- Skipping the production logging step. Models cannot be re-calibrated against outcomes you did not log.
FAQ
Should I prefer Brier or log-loss?
For finance applications where confident wrong calls cost more, log-loss is the primary metric. Brier is the sanity check. Report both in any production calibration report.
Why is the resolution equal to uncertainty?
It means the forecaster's predictions have variance matching the underlying outcome variance — they are informative about specific cases, not just long-run averages. That is what you want; the remaining failure is calibration (reliability term), not resolution.
Can I trust the Brier number from 15 forecasts?
Not as a precise estimate. Treat it as a noisy point estimate with roughly ±0.09 standard error. Use it as the start of a forecasting cycle, not as a final number. Build to 100+ forecasts before publishing or trading on the Brier value.
Connects to
- Isotonic Calibration for LLM Forecasts: the fix when reliability is poor.
- Brier Scores and Log Loss for Forecasters: extended treatment.
- Bayesian Updating for LLM Forecasts: multi-source forecast combination.
- Calibration Drift for LLM Confidence Scores: when calibration ages.
- Forecast Scoring Sandbox: score your own forecasts.
- Forecast Scoring Sandbox methodology: full input/output specification.
References
Footnotes
-
Murphy, A. H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology 12(4), 595–600. journals.ametsoc.org ↩
-
Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." Monthly Weather Review 78(1), 1–3. journals.ametsoc.org ↩
-
Gneiting, T., & Raftery, A. E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." Journal of the American Statistical Association 102(477), 359–378. tandfonline.com ↩
-
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown. ↩
Verified engine output
Show the recompute-verified inputs and outputs
| n_bins | 20 |
|---|---|
| forecasts (15 items) | [...] |
| brier › brier | 0.15444666666666668 |
|---|---|
| brier › reliability | 0.15441666666666667 |
| brier › resolution | 0.24888888888888885 |
| brier › uncertainty | 0.24888888888888888 |
| brier › base rate | 0.5333333333333333 |
| brier › count | 15 |
| brier › bins used | 14 |
| log loss | 0.47000646507046634 |
| bins (20 items) | [...] |
Computed live at build time.