Playground
Forecast Scoring Sandbox
Forecast scoring sandbox: paste a CSV of predictions to get Brier score, log loss, Murphy decomposition, bootstrap CIs and a reliability diagram.
- Inputs
- Paste + configure
- Runtime
- 1–15 s
- Privacy
- Client-side · no upload
- API key
- Not required
- Methodology
- Open →
1 · Configure — paste or upload a forecast stream
Used for 95% CI on every metric.
Bin count for decomposition + diagram.
Prediction range
0.071 → 0.947
Base rate (observed)
46.0%
2 · Results — scores, decomposition, 95% bootstrap CIs
Brier score
0.2084
95% CI [0.1698, 0.2521]
Lower is better · 0 = perfect
Log loss
0.6072
95% CI [0.5063, 0.7071]
Natural log · lower is better
Forecasts
100
Bins populated: 10/10
Uncertainty
0.2484
p̄(1 − p̄) — data ceiling
| Component | Value | 95% CI | Reading |
|---|---|---|---|
| Reliability | 0.0090 | [0.0085, 0.0556] | Lower = better calibrated |
| Resolution | 0.0494 | [0.0320, 0.1033] | Higher = more discriminating |
| Uncertainty | 0.2484 | — | Irreducible from base rate |
| Check: rel − res + unc | 0.2080 | — | Should equal Brier (0.2084) |
BS = reliability − resolution + uncertainty (Murphy, 1973)
3 · Reliability diagram
X axis: predicted probability bin. Y axis: observed frequency of outcome=1 in that bin. Dots on the diagonal = perfectly calibrated. Dots below the diagonal = overconfident (predictions exceed observed). Dots above = underconfident. Sparse bins are drawn lighter: with few observations, their y position is noisy and the deviation from the diagonal is not informative on its own.
Formulas
Brier(p, y) = mean( (p_i - y_i)^2 ) LogLoss(p, y) = -mean( y_i·log(p_i) + (1-y_i)·log(1-p_i) ) Bin forecasts into K equal-width buckets of predicted probability. For bin k with n_k observations, mean pred p̄_k, observed freq ō_k: reliability = Σ_k (n_k / N) · (p̄_k − ō_k)² resolution = Σ_k (n_k / N) · (ō_k − ȳ)² uncertainty = ȳ · (1 − ȳ) where ȳ = mean(y_i) Identity: Brier = reliability − resolution + uncertainty 95% CI: bootstrap-percentile over B resamples with replacement.
See the methodology page for derivations, references, and why sparse bins are shown lighter.
Complementary tools
Users of this tool often explore
Calibration Dojo
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All state stored locally.
Backtest Overfitting Score
Upload a backtest trade log and compute Probability of Backtest Overfitting (PBO), Deflated Sharpe Ratio, and the odds your edge survives live trading.
Fractional Kelly Sizer
Map conviction tiers to fractional Kelly bet sizes with a drawdown Monte Carlo simulator. Client-side. Private by default.