Forecast Scoring Sandbox: Reading the Reliability Curve

Q: Should I prefer Brier or log-loss?

For finance applications where confident wrong calls cost more, log-loss is the primary metric. Brier is the sanity check. Report both in any production calibration report.

Q: Why is the resolution equal to uncertainty?

It means the forecaster's predictions have variance matching the underlying outcome variance — they are informative about specific cases, not just long-run averages. That is what you want; the remaining failure is calibration (reliability term), not resolution.

Q: Can I trust the Brier number from 15 forecasts?

Not as a precise estimate. Treat it as a noisy point estimate with roughly ±0.09 standard error. Use it as the start of a forecasting cycle, not as a final number. Build to 100+ forecasts before publishing or trading on the Brier value.

A 15-forecast sandbox returns Brier 0.154, log-loss 0.470, and a base rate of 53.3% on the Forecast Scoring Sandbox. Brier decomposes into reliability 0.154, resolution 0.249, and uncertainty 0.249. The reliability term equals the uncertainty term — the forecaster's calibration is no better than the climatological base rate. That is the diagnostic, not the headline. The headline number says "this forecast set is mediocre"; the decomposition says "the forecaster does not know what they don't know."

TL;DR

15-forecast sandbox sample on binary outcomes.
Brier score: 0.154 (lower is better; 0 is perfect, 0.25 is coin-flip on 50% base rate).
Log loss: 0.470 (lower is better; ln(2)=0.693 is coin-flip).
Base rate: 53.3% of outcomes were positive.
Brier decomposition: reliability 0.154, resolution 0.249, uncertainty 0.249.
The reliability = uncertainty equality is the calibration failure signal.

The scenario

The Forecast Scoring Sandbox takes a list of (predicted_probability, observed_outcome) pairs and returns Brier, log-loss, and the Murphy decomposition¹. On a 15-forecast sample with a 53.3% base rate, the output is:

Metric	Value
Brier score	0.154
Log loss	0.470
Reliability	0.154
Resolution	0.249
Uncertainty	0.249
Base rate	0.533
Count	15
Bins used	14

Brier and log-loss in one paragraph

Brier score is the mean squared error of forecast probability versus realised outcome². Log loss is the mean negative log-likelihood — heavier penalty for confident wrong forecasts. Both are proper scoring rules: the forecaster minimises them by reporting their true belief³. Neither is a calibration test on its own; both are aggregate score summaries.

For a 53.3% base rate:

A forecaster who says 50% on every prediction: Brier = 0.25, log loss = 0.693.
A forecaster who matches base rate (53.3% on every prediction): Brier = 0.249, log loss = 0.691.
A perfectly-calibrated forecaster who also has resolution: Brier < 0.20, log loss < 0.55.
Our scenario at Brier 0.154, log loss 0.470: better than base rate, but the decomposition tells us where the score comes from.

The Murphy decomposition

Murphy (1973) showed Brier decomposes as:

Brier = Reliability − Resolution + Uncertainty

where:

Reliability measures how close the forecaster's stated probabilities are to the realised frequencies in each probability bucket. Zero is perfect calibration.
Resolution measures how much the forecaster's predictions differ from the base rate. Higher resolution = more informative predictions.
Uncertainty is the variance of the observed outcomes — a property of the data, not the forecaster.

For our scenario: Brier 0.154 = Reliability 0.154 − Resolution 0.249 + Uncertainty 0.249.

The Resolution and Uncertainty terms exactly cancel. That means the forecaster's predictions have as much variance as the underlying outcomes — they are informative. The Reliability of 0.154 is the entire Brier. The forecaster's stated probabilities do not match the realised frequencies in their buckets.

What this means for a finance forecaster

A finance LLM that returns probabilities on directional calls (e.g., "65% chance EUR/USD closes above 1.08") needs both calibration and resolution to be useful:

Calibrated, no resolution. Forecaster matches base rate on every bin. Probabilities are accurate as long-run averages, but say nothing about specific cases. Useless for trading.
Resolved, no calibration. Forecaster's predictions vary meaningfully but the stated probabilities don't match realised frequencies. The shape of the predictions is right; the levels are off. Fixable with isotonic calibration.
Both. The defensible target.

Our 15-sample scenario has resolution but lacks calibration. The fix is to run the forecasts through an isotonic calibration pass — see Isotonic Calibration for LLM Forecasts: which maps the model's raw probabilities to empirically-realised frequencies.

Sample-size caveats

15 observations is small. The Brier and log-loss estimates have wide standard errors at this sample size:

Brier stderr ≈ √(Brier × (1−Brier) / n) ≈ √(0.154 × 0.846 / 15) ≈ 0.094.
95% CI on Brier: roughly [0.000, 0.342] — wide enough that "no information" (Brier = 0.25 on this base rate) is inside the CI.

For statistically reliable scoring, plan for 100+ forecasts minimum. Calibration improves with deliberate, scored practice rather than volume alone⁴. The Calibration Dojo provides a structured training environment for building up a 100-forecast calibration baseline before deploying any model output.

Log-loss vs Brier — when each is right

Brier and log-loss agree on the ordering of forecasters most of the time. They differ on:

Confident wrong forecasts. A forecaster who says 99% on an event that fails takes a hit in both, but log-loss penalises more severely. For trading applications where confident wrong calls cost more than confident right calls, log-loss is the right scoring rule.
Probabilities near 0 or 1. Log-loss is unbounded as $p \to 0$ or $p \to 1$; Brier is bounded. For applications where models occasionally output near-0 or near-1 probabilities, Brier is more numerically stable.
Mean-zero comparisons. Brier is a quadratic; log-loss is logarithmic. For ranking close forecasters, Brier's quadratic gives finer resolution near small differences.

For finance applications targeting calibrated probability output, log-loss is the typical primary metric, with Brier as a sanity check.

The reliability diagram

The engine returns 20 bins for the reliability diagram. The diagram plots mean predicted probability per bin against observed frequency per bin. Perfect calibration is the y = x line; deviation from that line is the reliability term in numeric form.

For visual reading, look for systematic over- or under-confidence. A curve that sits above y = x means the forecaster systematically under-predicts (probabilities are lower than realised frequencies). A curve below means over-confidence. Both patterns are fixable with isotonic calibration.

Production usage

For a production LLM finance forecast pipeline:

Log every forecast and outcome. Run the model in production with full probability output; capture the eventual realised outcome.
Run the scoring sandbox monthly. Brier and log-loss over the last 100+ forecasts.
Decompose Brier monthly. Watch reliability, resolution, and uncertainty separately. Reliability drift is the calibration regression signal.
Recalibrate quarterly. Re-fit the isotonic calibrator on the most recent 200+ forecasts.

The forecasting cycle is operational, not one-off. Models drift; markets shift regimes; the calibration that worked on Q1 data may not work on Q3 data.

Failure modes

Quoting Brier without the decomposition. Brier alone hides the calibration vs resolution question.
Running on tiny samples. 15 observations gives a Brier with ±0.09 standard error. Plan for 100+.
Treating log-loss as a probability. It is a negative log-likelihood, dimensionless, not interpretable in isolation. Use it as a relative ranking signal.
Skipping the production logging step. Models cannot be re-calibrated against outcomes you did not log.

FAQ

Should I prefer Brier or log-loss?

For finance applications where confident wrong calls cost more, log-loss is the primary metric. Brier is the sanity check. Report both in any production calibration report.

Why is the resolution equal to uncertainty?

It means the forecaster's predictions have variance matching the underlying outcome variance — they are informative about specific cases, not just long-run averages. That is what you want; the remaining failure is calibration (reliability term), not resolution.

Can I trust the Brier number from 15 forecasts?

Not as a precise estimate. Treat it as a noisy point estimate with roughly ±0.09 standard error. Use it as the start of a forecasting cycle, not as a final number. Build to 100+ forecasts before publishing or trading on the Brier value.

Connects to

Isotonic Calibration for LLM Forecasts: the fix when reliability is poor.
Brier Scores and Log Loss for Forecasters: extended treatment.
Bayesian Updating for LLM Forecasts: multi-source forecast combination.
Calibration Drift for LLM Confidence Scores: when calibration ages.
Forecast Scoring Sandbox: score your own forecasts.
Forecast Scoring Sandbox methodology: full input/output specification.

References

Murphy, A. H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology 12(4), 595–600. journals.ametsoc.org ↩
Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." Monthly Weather Review 78(1), 1–3. journals.ametsoc.org ↩
Gneiting, T., & Raftery, A. E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." Journal of the American Statistical Association 102(477), 359–378. tandfonline.com ↩
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown. ↩

Verified engine output

Show the recompute-verified inputs and outputs

15 binary forecasts, 20 reliability bins

Inputs
n_bins	20
forecasts (15 items)	[...]

Result
brier › brier	0.15444666666666668
brier › reliability	0.15441666666666667
brier › resolution	0.24888888888888885
brier › uncertainty	0.24888888888888888
brier › base rate	0.5333333333333333
brier › count	15
brier › bins used	14
log loss	0.47000646507046634
bins (20 items)	[...]

Computed live at build time.