Playground
Calibration Dojo
Train probabilistic intuition. Binary forecasting questions at any confidence level; track Brier score + reliability curve over time. Browser-only. Free.
- Inputs
- Paste + configure
- Runtime
- 1–15 s
- Privacy
- Client-side · no upload
- API key
- Not required
- Methodology
- Open →
Brier score
—
Lower is better · perfect calibration = 0.000 · uniform random 50% = 0.250
Resolution: — · 0 answered · stored locally only
1 · Read the statement
Loading…
Reliability curve
Dot size ∝ number of answers in that decile. Diagonal = perfect calibration.
History persists in your browser's localStorage only.
How to use
Step-by-step
- 1
Make probabilistic forecasts on questions with eventual ground truth (binary outcomes work best for entry-level practice).
- 2
Resolve forecasts as outcomes become known. The dojo computes Brier score, log score, calibration curve, and resolution.
- 3
Read the calibration plot: do your 70%-confident forecasts hit at 70%? If they hit at 50%, you're overconfident.
- 4
Compare to the base-rate benchmark. Beating 50/50 is necessary; the gap between your score and the benchmark is your skill.
- 5
Make at least 50 forecasts before drawing conclusions. Calibration estimates are noisy below that.
For agents
Use in an agent
Same math, same result shape as the UI above — as a static ES module. No HTTP request, no auth, no rate limit.
import { compute } from "https://aifinhub.io/engines/calibration-dojo.js"; Contract: /contracts/calibration-dojo.json Full agent guide →
Glossary references
Terms used by this tool
Questions people ask next
FAQ
What's calibration in this context?
How well a model's stated confidence matches its empirical hit rate. A 70%-confident forecast that's right 70% of the time is well-calibrated; a 90%-confident forecast that's right 60% of the time is overconfident. The dojo lets you self-train calibration using forecast feedback loops.
How is the Brier score calculated?
Brier score is the mean squared error between probability forecast and outcome. For a single forecast: (probability − outcome)². For a series: average across forecasts. Lower is better. A 50/50 random forecaster scores 0.25; a perfectly calibrated forecaster with discrimination scores below 0.20.
What's the difference between calibration and resolution?
Calibration is whether your stated probabilities match observed frequencies. Resolution is whether you give different probabilities to different outcomes (vs. 50% on everything). A forecaster who always says 50% is perfectly calibrated but has zero resolution — they're useless. Brier score combines both.
How many forecasts before I get a stable calibration estimate?
At least 50, ideally 100+. The dojo shows the calibration curve (predicted probability vs. observed frequency in bins) — it's noisy below 50 forecasts. The methodology page documents the bootstrap CI on the curve.
Why does a low Brier score not always mean I'm right?
A Brier score is meaningful only relative to a benchmark. Forecasting 'AAPL up tomorrow' at 50% gives Brier 0.25 with zero skill. The dojo always reports your score against the benchmark of forecasting the base rate, so you can see whether your work is adding signal or just hitting the average.
Related deep dive
All articles →Read further
Long-form context behind the tool output.
- Methodology · Opinion·8 min
The Price-Blind LLM Research Harness
Price-blind LLM research — most harnesses leak the current price and the model confabulates. The architectural fix and a 30-line Python scaffold.
Read - Tutorial · Runnable·8 min
Conviction-Scaled Kelly Bet Sizing
Full Kelly is brutally unforgiving of over-estimation. Quarter-Kelly with a conviction-tier mapping and a per-trade cap is the defensible default.
Read - Tutorial · Runnable·10 min
Calibrating LLM Forecasts with Isotonic Regression
LLM probabilities are systematically miscalibrated. Isotonic regression via PAV is the cheapest robust fix: 40 lines of Python, no distributional priors.
Read
Used in
Decision workflows that use this tool
Goal-driven flows that bundle this tool with adjacent ones.
Complementary tools
Users of this tool often explore
Returns Distribution Analyzer
Paste a returns CSV. Histogram, normal-overlay, QQ plot, skewness, excess kurtosis, Jarque-Bera test, tail-weight index. See why Sharpe alone misleads.
Backtest Overfitting Score
Upload a backtest trade log and compute Probability of Backtest Overfitting (PBO), Deflated Sharpe Ratio, and the odds your edge survives live trading.
Fractional Kelly Sizer
Map conviction tiers to fractional Kelly bet sizes with a drawdown Monte Carlo simulator. Client-side. Private by default.