What's a proper scoring rule?

A scoring rule where forecasters maximize their expected score by reporting their true beliefs. Brier and log are both proper. Improper rules (e.g., 'biggest single hit wins') incentivize gaming, like reporting extreme probabilities to maximize variance. Proper rules incentivize honesty.

When should I use Brier vs. log?

Brier is bounded and treats all errors symmetrically. Log is unbounded — predicting 0.001 on something that turns out true gives a catastrophic score. Use log when you want to severely penalize overconfidence; use Brier when you want a more forgiving metric. Most academic forecasting uses both.

What's a common mistake when using Forecast Scoring Sandbox?

Reading Brier score without the decomposition. A model can be well-calibrated but uninformative (50%/50% on every prediction). Reliability alone is not skill.

How do I compare Brier vs log-loss fairly?

Treating sample CI as tight when it isn't. Forecasting samples are small; bootstrap CIs are wide — don't claim an edge until the CI is meaningfully positive.

general Calculator Guide

How to use Forecast Scoring Sandbox

Paste a forecast stream of (probability, outcome) pairs. The page computes Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence intervals — the diagnostics for whether a model is calibrated, lucky, or both.

5 STEPSPublished May 12, 2026Live Content

By Orbyd Editorial · AI Fin Hub Team

Best Next MovePlaygrounds

Forecast Scoring Sandbox

Paste a forecast stream (probability + outcome) and see Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence.

CalculatorOpen ->

On This Page

Overview 5 steps Scenarios FAQ

What It Does

Use the calculator with intent

Quants and PMs running forecasting models who need to score them on calibration, not just hit rate — and need bootstrap CIs to tell skill from luck.

Interpreting Results

Brier decomposition is the headline — reliability (calibration), resolution (informativeness), and uncertainty are the three components. A high Brier score is bad; check whether reliability or resolution is the bigger contributor.

Input Steps

Field by field

1

Upload data

Upload or enter probabilistic forecasts paired with realized outcomes (binary, categorical, or continuous).
2

Pick option

Pick scoring rule: Brier (squared error), log (penalize overconfidence harshly), CRPS (continuous distributions).
3

Read outputs

Read your aggregate score per forecast. Compare against the base-rate benchmark — beating the benchmark = skill.
4

Read outputs

Read the calibration plot to see where you're miscalibrated (overconfident or underconfident at specific probability ranges).
5

Run calculation

Run at least 50 forecasts before drawing conclusions. Below 50, scores are dominated by sample noise.

Common Scenarios

Use realistic starting points

Earnings beat/miss predictions

Predictions

100

Outcome

binary

Reliability often dominates — model is roughly calibrated but not very informative. Improve resolution before improving calibration.

Macro event predictions

Predictions

Outcome

binary

Sample too small to distinguish skill from luck — bootstrap CI on Brier wide. Add more predictions before claiming an edge.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Calibration Dojo

Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.

Launch toolOpen ->

PlaygroundsCalculator

Prompt Regression Tester

Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

Brier score (squared error), log score (log-likelihood penalty for overconfidence), and continuous ranked probability score (CRPS, for full distributional forecasts). All three are proper scoring rules — they reward calibrated probabilities.

Keep the topic connected

AI in Markets1 FAQS

Agent Skill Testing

Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.

Keep readingRead ->