How to use Forecast Scoring Sandbox
Paste a forecast stream of (probability, outcome) pairs. The page computes Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence intervals — the diagnostics for whether a model is calibrated, lucky, or both.
What It Does
Use the calculator with intent
Paste a forecast stream of (probability, outcome) pairs. The page computes Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence intervals — the diagnostics for whether a model is calibrated, lucky, or both.
Quants and PMs running forecasting models who need to score them on calibration, not just hit rate — and need bootstrap CIs to tell skill from luck.
Interpreting Results
Brier decomposition is the headline — reliability (calibration), resolution (informativeness), and uncertainty are the three components. A high Brier score is bad; check whether reliability or resolution is the bigger contributor.
Input Steps
Field by field
- 1
Upload data
Upload or enter probabilistic forecasts paired with realized outcomes (binary, categorical, or continuous).
- 2
Pick option
Pick scoring rule: Brier (squared error), log (penalize overconfidence harshly), CRPS (continuous distributions).
- 3
Read outputs
Read your aggregate score per forecast. Compare against the base-rate benchmark — beating the benchmark = skill.
- 4
Read outputs
Read the calibration plot to see where you're miscalibrated (overconfident or underconfident at specific probability ranges).
- 5
Run calculation
Run at least 50 forecasts before drawing conclusions. Below 50, scores are dominated by sample noise.
Common Scenarios
Use realistic starting points
Earnings beat/miss predictions
Predictions
100
Outcome
binary
Reliability often dominates — model is roughly calibrated but not very informative. Improve resolution before improving calibration.
Macro event predictions
Predictions
50
Outcome
binary
Sample too small to distinguish skill from luck — bootstrap CI on Brier wide. Add more predictions before claiming an edge.
Try These Tools
Run the numbers next
Calibration Dojo
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.
Prompt Regression Tester
Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Related Content
Keep the topic connected
Agent Skill Testing
Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.
Model Drift
Model drift: when an LLM's behavior changes between calls, versions, or weeks. The monitoring stack that catches it before production breaks.