How is the Brier score calculated?

Brier score is the mean squared error between probability forecast and outcome. For a single forecast: (probability − outcome)². For a series: average across forecasts. Lower is better. A 50/50 random forecaster scores 0.25; a perfectly calibrated forecaster with discrimination scores below 0.20.

What's the difference between calibration and resolution?

Calibration is whether your stated probabilities match observed frequencies. Resolution is whether you give different probabilities to different outcomes (vs. 50% on everything). A forecaster who always says 50% is perfectly calibrated but has zero resolution — they're useless. Brier score combines both.

What's a common mistake when using Calibration Dojo?

Anchoring on round confidence levels (50% / 75% / 90%) because they're easy to remember. Use the slider — your calibration improves when you can express 62%.

How do I read the reliability diagram correctly?

Quitting after a bad streak. Streaks happen; calibration is measured over hundreds of questions, not dozens.

general Calculator Guide

How to use Calibration Dojo

Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; the page tracks Brier score and reliability curve over time so you can tell whether you're miscalibrated or just unlucky.

5 STEPSPublished May 12, 2026Live Content

By Orbyd Editorial · AI Fin Hub Team

Best Next MovePlaygrounds

Calibration Dojo

Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; track Brier score and reliability curve over time. All.

CalculatorOpen ->

On This Page

Overview 5 steps Scenarios FAQ

What It Does

Use the calculator with intent

Traders and PMs who want to know whether their 70% confidence calls actually win 70% of the time, and what to do when they don't.

Interpreting Results

The reliability curve is the headline. Points above the diagonal mean you're underconfident at that bucket; below means overconfident. Most untrained forecasters cluster well above the line in the 70–90% confidence range.

Input Steps

Field by field

1

Make

Make probabilistic forecasts on questions with eventual ground truth (binary outcomes work best for entry-level practice).
2

Resolve

Resolve forecasts as outcomes become known. The dojo computes Brier score, log score, calibration curve, and resolution.
3

Read outputs

Read the calibration plot: do your 70%-confident forecasts hit at 70%? If they hit at 50%, you're overconfident.
4

Compare results

Compare to the base-rate benchmark. Beating 50/50 is necessary; the gap between your score and the benchmark is your skill.
5

Make

Make at least 50 forecasts before drawing conclusions. Calibration estimates are noisy below that.

Common Scenarios

Use realistic starting points

First 20 questions, mixed topics

Question count

Mix

general knowledge

Expect overconfidence at 80–90% — most beginners are miscalibrated there. Brier score above 0.25 is unimpressive.

Domain-specific session (markets)

Question count

Mix

finance only

Calibration is domain-specific; you might be well-calibrated on markets and badly miscalibrated elsewhere — that's fine, just notice it.

Try These Tools

Run the numbers next

PlaygroundsCalculator

Forecast Scoring Sandbox

Paste a forecast stream (probability + outcome) and see Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence.

Launch toolOpen ->

PlaygroundsCalculator

Prompt Regression Tester

Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.

Launch toolOpen ->

FAQ

Questions people ask next

The short answers readers usually want after the first pass.

How well a model's stated confidence matches its empirical hit rate. A 70%-confident forecast that's right 70% of the time is well-calibrated; a 90%-confident forecast that's right 60% of the time is overconfident. The dojo lets you self-train calibration using forecast feedback loops.

Keep the topic connected

AI in Markets1 FAQS

Agent Skill Testing

Agent skill testing: the regression-test discipline for LLM-driven agents. What to test, how to score, and the difference between pass-rate and capability.

Keep readingRead ->