Playground

Forecast Scoring Sandbox

Forecast scoring sandbox: paste a CSV of predictions to get Brier score, log loss, Murphy decomposition, bootstrap CIs and a reliability diagram.

AI Fin Hub Research Published Apr 23, 2026 Methodology Corrections

Inputs: Paste + configure
Runtime: 1–15 s
Privacy: Client-side · no upload
API key: Not required
Methodology: Open →

1 · Configure — paste or upload a forecast stream

Forecast CSV (columns: prob_predicted, outcome_observed)

Upload CSVParsed 100 rows

Bootstrap resamples

Used for 95% CI on every metric.

Reliability bins

Bin count for decomposition + diagram.

Prediction range

0.071 → 0.947

Base rate (observed)

46.0%

2 · Results — scores, decomposition, 95% bootstrap CIs

Brier score

0.2084

95% CI [0.1698, 0.2521]

Lower is better · 0 = perfect

Log loss

0.6072

95% CI [0.5063, 0.7071]

Natural log · lower is better

Forecasts

100

Bins populated: 10/10

Uncertainty

0.2484

p̄(1 − p̄) — data ceiling

Component	Value	95% CI	Reading
Reliability	0.0090	[0.0085, 0.0556]	Lower = better calibrated
Resolution	0.0494	[0.0320, 0.1033]	Higher = more discriminating
Uncertainty	0.2484	—	Irreducible from base rate
Check: rel − res + unc	0.2080	—	Should equal Brier (0.2084)

BS = reliability − resolution + uncertainty (Murphy, 1973)

3 · Reliability diagram

X axis: predicted probability bin. Y axis: observed frequency of outcome=1 in that bin. Dots on the diagonal = perfectly calibrated. Dots below the diagonal = overconfident (predictions exceed observed). Dots above = underconfident. Sparse bins are drawn lighter: with few observations, their y position is noisy and the deviation from the diagonal is not informative on its own.

Formulas

Brier(p, y)     = mean( (p_i - y_i)^2 )
LogLoss(p, y)   = -mean( y_i·log(p_i) + (1-y_i)·log(1-p_i) )

Bin forecasts into K equal-width buckets of predicted probability.
For bin k with n_k observations, mean pred p̄_k, observed freq ō_k:

reliability  = Σ_k (n_k / N) · (p̄_k − ō_k)²
resolution   = Σ_k (n_k / N) · (ō_k − ȳ)²
uncertainty  = ȳ · (1 − ȳ)             where ȳ = mean(y_i)

Identity:   Brier = reliability − resolution + uncertainty

95% CI:  bootstrap-percentile over B resamples with replacement.

See the methodology page for derivations, references, and why sparse bins are shown lighter.

Complementary tools

Calibration Dojo

Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All state stored locally.

Playgrounds Open

Backtest Overfitting Score

Upload a backtest trade log and compute Probability of Backtest Overfitting (PBO), Deflated Sharpe Ratio, and the odds your edge survives live trading.

Calculators Open

Fractional Kelly Sizer

Map conviction tiers to fractional Kelly bet sizes with a drawdown Monte Carlo simulator. Client-side. Private by default.