Methodology: Forecast Scoring Sandbox

What the tool computes

You paste or upload a CSV with two columns: a predicted probability in [0, 1] and an observed binary outcome {0, 1}. From that stream the tool derives four things:

The mean Brier score, the standard quadratic loss for probabilistic binary forecasts.
The mean log loss (natural log), the scoring rule implied by cross-entropy.
A three-way Murphy decomposition of Brier into reliability, resolution, and uncertainty.
95% percentile bootstrap confidence intervals around every metric.

The reliability diagram visualises the same data the decomposition is computed from: predicted probability (binned) on the x axis, observed frequency on the y axis, dot size proportional to sample count, the ground-truth diagonal drawn for reference.

Inputs and assumptions

The CSV may include a header row. Column names matching prob, probability, predicted, p, pred, forecast are recognised for the probability column; outcome, y, actual, observed, result for the outcome column.
Separators are flexible: commas, tabs, semicolons, or whitespace. Lines starting with # are treated as comments.
Rows with probabilities outside [0, 1] or outcomes other than 0 or 1 are silently dropped so a single malformed row does not invalidate the run.
The "Load 100-case demo" button fills the textarea with a synthetic, deterministic stream. It is generated with a seeded PRNG inside the engine module; no external data is fetched or implied.
Bootstrap resampling is seeded per metric so identical inputs produce identical CIs across re-renders.

Formulas

Brier(p, y)    = (1/N) · Σ (p_i − y_i)²
LogLoss(p, y)  = −(1/N) · Σ [ y_i·log(p_i) + (1 − y_i)·log(1 − p_i) ]
                 with p clamped to [1e-9, 1 − 1e-9]

Bin forecasts into K equal-width buckets over [0, 1].
For bin k with n_k observations, mean pred p̄_k, observed freq ō_k,
and overall base rate ȳ = mean(y_i):

  reliability = Σ_k (n_k / N) · (p̄_k − ō_k)²
  resolution  = Σ_k (n_k / N) · (ō_k − ȳ)²
  uncertainty = ȳ · (1 − ȳ)

Identity (Murphy 1973):
  Brier = reliability − resolution + uncertainty

Reliability is the weighted squared gap between what you predicted and what actually happened, per bin; it is the component you drive down by calibrating. Resolution is the variation in observed frequency across bins; it is the component you drive up by actually discriminating good forecasts from bad ones. Uncertainty is a property of the outcomes alone — a 50/50 problem has the most head-room; a 99/1 problem has almost none.

The "Check: rel − res + unc" row in the results table exposes the identity so you can see it hold to rounding error. It is a small honesty signal — the tool is not doing anything unusual behind the scenes.

Bootstrap methodology

Confidence intervals use the percentile bootstrap (Efron, 1979). For each resample we draw N forecasts with replacement from your uploaded stream, recompute the metric, and collect B resampled values. The reported 95% interval is the interval between the 2.5th and 97.5th percentiles of that distribution. B defaults to 1000 but you can crank it up to 10,000 if you want tighter percentile estimates on small samples.

The percentile bootstrap is a first-order method: it assumes your sample is representative and resamples within it. That is fine for most forecast streams but will understate uncertainty if your predictions are highly autocorrelated (e.g. daily weather forecasts over the same week). For correlated streams a block bootstrap is more honest; the sandbox does not implement one because most users will be feeding it de-correlated decision-grade forecasts.

Why sparse bins are drawn lighter

A reliability diagram plots one point per bin. With 100 forecasts and 10 bins you get on average 10 per bin — but predictions tend to cluster, so some bins will hold one or two observations. The observed frequency in a two-obs bin is either 0.0, 0.5, or 1.0, and none of those three positions tell you anything about calibration in that region. Drawing every bin at the same visual weight invites the eye to over-interpret the noisy ones.

The sandbox marks bins with fewer than max(5, ceil(N / 50)) observations as "sparse" and fades them to a neutral grey. The same bins still contribute to the Brier, log-loss, and decomposition numbers above — you are only being told "do not read a calibration story from this single point." The binning is transparent: a small bar chart along the bottom shows per-bin count on the same x axis.

Limitations

Binary outcomes only. Multi-class Brier and ranked probability scoring are out of scope; a future tool can handle them.
Forecast streams are treated as iid. If yours are autocorrelated, bootstrap CIs will be narrower than they should be.
The decomposition depends on the bin count K. With too few bins reliability hides inside individual bins; with too many it becomes noisy. The default is 10; adjust and watch the identity check move.
This is an evaluation tool, not a recommender. It does not tell you whether your forecaster is "good enough" — only how to read the numbers.

Brier scores and log loss for forecasters — the primer this tool operationalises.
Bayesian updating for LLM forecasts — where the probabilities you score should come from.
Isotonic calibration for LLM forecasts — the post-processing step that drives reliability toward zero.
Research diary schema that stays auditable — how to log forecasts so you can feed them straight into this sandbox.

Changelog

2026-04-23 — Initial release. Brier + decomposition + log loss + percentile bootstrap + reliability diagram with sparse-bin downweighting.

How Forecast Scoring Sandbox works