Skip to main content
aifinhub

Worked example

Running the shipped calibration-dojo engine on the input below produces exactly this output. Continuous integration recomputes it against the engine bundle on every build, so these numbers cannot drift from the code.

Input

{
  "tool": "calibration_dojo",
  "session_length_questions": 20,
  "difficulty": "medium",
  "domain_filter": "general_knowledge"
}

Output

{
  "questions": [
    {
      "id": "q1",
      "statement": "The speed of light in vacuum is greater than 300,000 km/s.",
      "truth": false,
      "category": "physics",
      "source": "CODATA 2018 fundamental physical constants",
      "rationale": "The exact value is 299,792,458 m/s = 299,792.458 km/s, which is less than 300,000 km/s."
    },
    {
      "id": "q2",
      "statement": "The Eiffel Tower is taller than 300 meters including antennas.",
      "truth": true,
      "category": "geography",
      "source": "Société d'Exploitation de la Tour Eiffel",
      "rationale": "Height to tip including antennas is ~330 m."
    },
    {
      "id": "q3",
      "statement": "Over the last 50 years (to 2024), the US stock market has had more positive calendar years than negative calendar years.",
      "truth": true,
      "category": "finance-generic",
      "source": "S&P / Damodaran annual returns series",
      "rationale": "Roughly 73–76% of calendar years are positive over long rolling windows; far more ups than downs."
    },
    {
      "id": "q4",
      "statement": "The Great Wall of China is visible to the naked eye from low Earth orbit.",
      "truth": false,
      "category": "physics",
      "source": "NASA — multiple astronaut reports",
      "rationale": "It is generally not visible from LEO without aids; this is a common misconception debunked by NASA and astronauts including Chiao."
    },
    {
      "id": "q5",
      "statement": "A year on Venus is shorter than a day on Venus.",
      "truth": true,
      "category": "physics",
      "source": "NASA planetary fact sheet",
      "rationale": "Venusian orbital period is ~224.7 Earth days; a sidereal day on Venus is ~243 Earth days."
    },
    {
      "id": "q6",
      "statement": "Japan has more islands than Greece.",
      "truth": true,
      "category": "geography",
      "source": "Geospatial Information Authority of Japan 2023 recount",
      "rationale": "Japan's recount placed islands at ~14,125; Greece has ~1,400–6,000 depending on definition."
    },
    {
      "id": "q7",
      "statement": "The global median household income is above USD 15,000.",
      "truth": false,
      "category": "economics",
      "source": "Gallup / World Bank household income surveys",
      "rationale": "Global median is closer to $3,000–9,000 USD depending on methodology; nowhere near $15K."
    },
    {
      "id": "q8",
      "statement": "The first transatlantic submarine telegraph cable was laid before the American Civil War ended.",
      "truth": true,
      "category": "history",
      "source": "Smithsonian Institution archives",
      "rationale": "The first working transatlantic cable began service in August 1858; the Civil War ended April 1865."
    },
    {
      "id": "q9",
      "statement": "The Dead Sea is below sea level.",
      "truth": true,
      "category": "geography",
      "source": "Geological Survey of Israel",
      "rationale": "Approximately 430 meters below sea level — lowest land elevation on Earth."
    },
    {
      "id": "q10",
      "statement": "A bond with a higher yield-to-maturity always has a higher current price than a lower-yield bond of the same maturity and coupon.",
      "truth": false,
      "category": "finance-generic",
      "source": "Standard fixed-income textbook (e.g. Fabozzi)",
      "rationale": "Higher YTM corresponds to LOWER price (inverse relationship) for the same cash flows."
    },
    {
      "id": "q11",
      "statement": "Pluto is smaller than Mercury.",
      "truth": true,
      "category": "physics",
      "source": "NASA planetary fact sheet",
      "rationale": "Pluto diameter ~2,377 km; Mercury diameter ~4,880 km."
    },
    {
      "id": "q12",
      "statement": "The EU adopted the euro as a physical currency before the year 2000.",
      "truth": false,
      "category": "history",
      "source": "European Central Bank historical timeline",
      "rationale": "Euro banknotes and coins entered circulation on 1 January 2002. The euro existed as an accounting currency from 1 January 1999."
    },
    {
      "id": "q13",
      "statement": "In 2023, Norway's sovereign wealth fund crossed USD 1.5 trillion in assets.",
      "truth": true,
      "category": "finance-generic",
      "source": "Norges Bank Investment Management annual report",
      "rationale": "Norway's GPFG crossed $1.4T in late 2023, and $1.5T early 2024."
    },
    {
      "id": "q14",
      "statement": "Kelly criterion bet sizing is defined as f* = (bp − q) / b where b = odds, p = win probability, q = 1 − p.",
      "truth": true,
      "category": "finance-generic",
      "source": "Kelly 1956, 'A New Interpretation of Information Rate'",
      "rationale": "This is the original formula from Kelly's 1956 paper."
    },
    {
      "id": "q15",
      "statement": "MCP (Model Context Protocol) was first specified by OpenAI.",
      "truth": false,
      "category": "tech",
      "source": "Anthropic MCP specification repository",
      "rationale": "MCP was specified and open-sourced by Anthropic in November 2024."
    },
    {
      "id": "q16",
      "statement": "Over the last 100 years, large-cap US equities have had a positive real return in more than 60% of rolling 10-year windows.",
      "truth": true,
      "category": "finance-generic",
      "source": "Damodaran annualised returns series, 1928–present",
      "rationale": "Positive real-return share is closer to ~85%+ for rolling 10-year windows; the statement is comfortably true."
    },
    {
      "id": "q17",
      "statement": "At a 52% win rate and 1:1 odds, full Kelly recommends a 4% bet of bankroll per trade.",
      "truth": true,
      "category": "finance-generic",
      "source": "Direct Kelly formula: f* = (1 × 0.52 − 0.48) / 1 = 0.04",
      "rationale": "Substituting into Kelly's formula gives exactly 0.04 = 4%."
    },
    {
      "id": "q18",
      "statement": "The Antarctic Treaty, establishing Antarctica as a scientific preserve, entered force before 1970.",
      "truth": true,
      "category": "history",
      "source": "Secretariat of the Antarctic Treaty",
      "rationale": "Signed 1959, entered into force 23 June 1961."
    },
    {
      "id": "q19",
      "statement": "A Sharpe ratio of 2.0 on daily returns annualizes to roughly 31.7.",
      "truth": true,
      "category": "finance-generic",
      "source": "Standard Sharpe annualization: SR_annual ≈ SR_daily × √252 ≈ 31.75",
      "rationale": "2.0 × √252 ≈ 31.75."
    },
    {
      "id": "q20",
      "statement": "The Shannon entropy of a fair coin is 1 bit.",
      "truth": true,
      "category": "physics",
      "source": "Shannon 1948, 'A Mathematical Theory of Communication'",
      "rationale": "H(p=0.5, q=0.5) = −(0.5·log₂(0.5) + 0.5·log₂(0.5)) = 1 bit."
    },
    {
      "id": "q21",
      "statement": "The Deflated Sharpe Ratio was introduced by Bailey and Lopez de Prado in 2014.",
      "truth": true,
      "category": "finance-generic",
      "source": "Bailey, D. H., & Lopez de Prado, M. (2014). Journal of Portfolio Management 40(5).",
      "rationale": "DSR extends Sharpe with a multiple-testing + non-normality correction; original 2014 paper."
    },
    {
      "id": "q22",
      "statement": "Most benchmarks of mainstream LLMs on numeric extraction from 10-K filings report accuracy above 99%.",
      "truth": false,
      "category": "tech",
      "source": "Finance-LLM benchmark literature 2023–2025",
      "rationale": "Published extraction accuracy on numeric fields from full 10-Ks typically falls in the 80–95% range depending on task and model; 99%+ is not a realistic report."
    },
    {
      "id": "q23",
      "statement": "Options pricing via Black-Scholes assumes the underlying follows a geometric Brownian motion.",
      "truth": true,
      "category": "finance-generic",
      "source": "Black & Scholes 1973, Journal of Political Economy",
      "rationale": "Geometric Brownian motion (log-normal underlying, constant drift + volatility) is the canonical Black-Scholes assumption."
    },
    {
      "id": "q24",
      "statement": "Cointegration between two price series is the same as high Pearson correlation between them.",
      "truth": false,
      "category": "finance-generic",
      "source": "Engle & Granger 1987, Econometrica 55(2)",
      "rationale": "Cointegration is about a stationary linear combination of non-stationary series; two series can be highly correlated without being cointegrated, and vice versa."
    }
  ],
  "hint": "POST {op:'score', answers:[{probability, actual}, ...]} to score a session."
}

Frequently asked questions

What does the Calibration Dojo methodology page document?
How Calibration Dojo scores probabilistic forecasts. Brier score formula, reliability curve, question selection, privacy. It states the formulas, assumptions, data sources, limitations, and reproducibility steps behind the Calibration Dojo, in the Finance category.
When was the Calibration Dojo methodology last reviewed?
This methodology was last reviewed on 2026-04-20. The matching tool is at https://aifinhub.io/calibration-dojo/.
Are the Calibration Dojo numbers reproducible?
Yes. This page embeds a worked example whose output is the verbatim result of running the shipped calibration-dojo engine on a fixed input; the embedded JSON is recomputed and diffed against the engine in CI, so the numbers cannot drift from the code.

Methodology · Playground · Last updated 2026-04-20

How Calibration Dojo works

How the Calibration Dojo tool actually works — assumptions, algorithms, limitations.

Scoring

Brier score is the squared error between predicted probability and outcome, averaged across all answers:

Brier = (1/N) · Σᵢ (pᵢ − yᵢ)²

where pᵢ is the user's stated probability the statement is true and yᵢ ∈ 1 is the actual outcome. Perfect prediction → 0. Constant 50% on everything → 0.25. Worse than random is possible if the user is systematically miscalibrated.

Reliability curve

Predictions are binned into 10 deciles based on stated probability. For each bin, the X axis plots the average predicted probability and the Y axis plots the fraction of statements that were actually true. A perfectly calibrated forecaster's bins fall on the diagonal y = x.

Dot size on the chart is proportional to the number of answers in that decile.

Question bank

20 evergreen binary statements covering physics, history, economics, geography, technology, and generic finance concepts. Each question has a public source and short rationale revealed after the user commits a probability. No real-time market questions or specific-ticker prompts (compliance + education-focus).

Questions are drawn without replacement until all have been seen; after that, selection reverts to random with replacement for continued practice.

Privacy

All answer history is stored in your browser's localStorage under a single key (aifinhub.calibration-dojo.v1). Nothing is transmitted to any server. Clearing your browser data, or pressing the "Clear local history" button, resets all progress. The site sets no cookies.

Limitations

  1. Small question bank. 20 questions means early Brier scores have high variance. Answer at least 30–50 (re-randomized) for a stable personal baseline.
  2. Domain coverage is deliberately generic, not finance-specific. For finance-specific calibration training, use your own journal of real forecasts + outcomes; the tool's pattern (probability → outcome → Brier) transfers directly.
  3. No overconfidence detection beyond Brier. Full reliability diagnostics (resolution, refinement, ECE) are not surfaced in the UI but can be derived from the stored answer log if you export it.

References

  • Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." Monthly Weather Review 78(1).
  • Murphy, A. H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology.
  • Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction.

External resources

Planning estimates only — not financial, tax, or investment advice.