Playground

Calibration Dojo

Name: Calibration Dojo
Author: AI Fin Hub Research

Train probabilistic intuition. Binary forecasting questions at any confidence level; track Brier score + reliability curve over time. Browser-only. Free.

AI Fin Hub Research Published Apr 20, 2026 Methodology Corrections

Inputs: Paste + configure
Runtime: 1–15 s
Privacy: Client-side · no upload
API key: Not required
Methodology: Open →

Education · Not investment advice. BaFin/EU framework. Past performance does not indicate future results. Editorial standards Sponsor disclosure Corrections

Brier score

—

Lower is better · perfect calibration = 0.000 · uniform random 50% = 0.250

Resolution: — · 0 answered · stored locally only

1 · Read the statement

Loading…

Reliability curve

Dot size ∝ number of answers in that decile. Diagonal = perfect calibration.

History persists in your browser's localStorage only.

How to use

Step-by-step

Full calculator guide →

1
Make probabilistic forecasts on questions with eventual ground truth (binary outcomes work best for entry-level practice).
2
Resolve forecasts as outcomes become known. The dojo computes Brier score, log score, calibration curve, and resolution.
3
Read the calibration plot: do your 70%-confident forecasts hit at 70%? If they hit at 50%, you're overconfident.
4
Compare to the base-rate benchmark. Beating 50/50 is necessary; the gap between your score and the benchmark is your skill.
5
Make at least 50 forecasts before drawing conclusions. Calibration estimates are noisy below that.

For agents

Use in an agent

Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.

import { compute } from "https://aifinhub.io/engines/calibration-dojo.js";

Contract: /contracts/calibration-dojo.json Full agent guide →

Glossary references

Terms used by this tool

All glossary →

Agent skill testing

Questions people ask next

FAQ

What's calibration in this context?

How well a model's stated confidence matches its empirical hit rate. A 70%-confident forecast that's right 70% of the time is well-calibrated; a 90%-confident forecast that's right 60% of the time is overconfident. The dojo lets you self-train calibration using forecast feedback loops.

How is the Brier score calculated?

Brier score is the mean squared error between probability forecast and outcome. For a single forecast: (probability − outcome)². For a series: average across forecasts. Lower is better. A 50/50 random forecaster scores 0.25; a perfectly calibrated forecaster with discrimination scores below 0.20.

What's the difference between calibration and resolution?

Calibration is whether your stated probabilities match observed frequencies. Resolution is whether you give different probabilities to different outcomes (vs. 50% on everything). A forecaster who always says 50% is perfectly calibrated but has zero resolution — they're useless. Brier score combines both.

How many forecasts before I get a stable calibration estimate?

At least 50, ideally 100+. The dojo shows the calibration curve (predicted probability vs. observed frequency in bins) — it's noisy below 50 forecasts. The methodology page documents the bootstrap CI on the curve.

Why does a low Brier score not always mean I'm right?

A Brier score is meaningful only relative to a benchmark. Forecasting 'AAPL up tomorrow' at 50% gives Brier 0.25 with zero skill. The dojo always reports your score against the benchmark of forecasting the base rate, so you can see whether your work is adding signal or just hitting the average.

Related deep dive

All articles →

Read further

Long-form context behind the tool output.

Used in

Decision workflows that use this tool

Goal-driven flows that bundle this tool with adjacent ones.

Validate Your Strategy
Pressure-test a quant or LLM-augmented strategy before paper-trading or production.
Open

Complementary tools

Returns Distribution Analyzer

Paste a returns CSV. Histogram, normal-overlay, QQ plot, skewness, excess kurtosis, Jarque-Bera test, tail-weight index. See why Sharpe alone misleads.

Calculators Open

Backtest Overfitting Score

Upload a backtest trade log and compute Probability of Backtest Overfitting (PBO), Deflated Sharpe Ratio, and the odds your edge survives live trading.

Calculators Open

Fractional Kelly Sizer

Map conviction tiers to fractional Kelly bet sizes with a drawdown Monte Carlo simulator. Client-side. Private by default.

Calculators Open

1 · Read the statement

Reliability curve

Step-by-step

Use in an agent

Terms used by this tool

FAQ

Read further

Decision workflows that use this tool

Users of this tool often explore

Returns Distribution Analyzer

Backtest Overfitting Score

Fractional Kelly Sizer