How to use Calibration Dojo
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; the page tracks Brier score and reliability curve over time so you can tell whether you're miscalibrated or just unlucky.
What It Does
Use the calculator with intent
Train your probabilistic intuition. Answer binary forecasting questions at any confidence level; the page tracks Brier score and reliability curve over time so you can tell whether you're miscalibrated or just unlucky.
Traders and PMs who want to know whether their 70% confidence calls actually win 70% of the time, and what to do when they don't.
Interpreting Results
The reliability curve is the headline. Points above the diagonal mean you're underconfident at that bucket; below means overconfident. Most untrained forecasters cluster well above the line in the 70–90% confidence range.
Input Steps
Field by field
- 1
Make
Make probabilistic forecasts on questions with eventual ground truth (binary outcomes work best for entry-level practice).
- 2
Resolve
Resolve forecasts as outcomes become known. The dojo computes Brier score, log score, calibration curve, and resolution.
- 3
Read outputs
Read the calibration plot: do your 70%-confident forecasts hit at 70%? If they hit at 50%, you're overconfident.
- 4
Compare results
Compare to the base-rate benchmark. Beating 50/50 is necessary; the gap between your score and the benchmark is your skill.
- 5
Make
Make at least 50 forecasts before drawing conclusions. Calibration estimates are noisy below that.
Common Scenarios
Use realistic starting points
First 20 questions, mixed topics
Question count
20
Mix
general knowledge
Expect overconfidence at 80–90% — most beginners are miscalibrated there. Brier score above 0.25 is unimpressive.
Domain-specific session (markets)
Question count
30
Mix
finance only
Calibration is domain-specific; you might be well-calibrated on markets and badly miscalibrated elsewhere — that's fine, just notice it.
Try These Tools
Run the numbers next
Forecast Scoring Sandbox
Paste a forecast stream (probability + outcome) and see Brier score with full decomposition, log loss, reliability diagram, and bootstrap confidence.
Prompt Regression Tester
Run the same prompt against multiple models (Claude 4.5/4.6/4.7, GPT-5, Gemini 2.5) with your own keys. Diff outputs, score drift, catch regressions.
FAQ
Questions people ask next
The short answers readers usually want after the first pass.
Related Content