TL;DR
Probabilistic forecasts need proper scoring rules, meaning scoring systems that are minimized only when the forecaster reports true beliefs. Brier score and log loss are the two canonical choices. Brier is the mean squared error between predicted probability and outcome, and it decomposes cleanly into reliability, resolution, and uncertainty terms. Log loss is the negative log-likelihood of the observations under the forecast distribution and punishes overconfident wrongness far harder: a 99% prediction that turns out wrong contributes about 4.6 to the sum; a 90% wrong prediction contributes only about 2.3. Both metrics fit in ten lines of Python and should accompany every LLM-produced probability forecast stored in a research diary.
Why proper scoring rules matter
A scoring rule maps a predicted probability p and an observed outcome o ∈ {0, 1} to a number. A rule is proper when the forecaster's expected score is optimized only by reporting the true subjective probability. It is strictly proper when that optimum is unique.
Improper rules are everywhere in casual prediction markets and internal dashboards. "One point for every correct above-50% call" is improper: a forecaster who believes the true probability is 0.55 does better under this rule by reporting 0.99, because the reward is the same for any confident call that resolves correctly but the forecaster's tracked "accuracy" looks better. Incentives diverge from truthfulness.
Proper rules align incentives. If a forecaster is scored by Brier or log loss, the score-minimizing strategy is identical to honest reporting of subjective belief. This is a result of Savage (1971) on elicitation of personal probabilities, generalized by Gneiting and Raftery (2007) to the full class of strictly proper scoring rules.1 The practical consequence: when evaluating an LLM forecaster, the metric itself must be proper, or the tuning loop will drift toward overconfidence rather than calibration.
The property can be sanity-checked with a one-line argument. Suppose the true probability is q. Under Brier scoring, the expected score when reporting p is q(1 − p)² + (1 − q)p². Differentiating with respect to p and setting to zero gives p = q. Under log loss, the expected score is −q log(p) − (1 − q) log(1 − p); the same derivative trick gives p = q. Both rules bottom out at honest reporting and only there.
Brier score
The Brier score is mean squared error between predicted probability and the 0/1 outcome:
BS = (1/N) Σ (p_i − o_i)²
Range: [0, 1]. Lower is better. A forecaster who always predicts 0.5 regardless of event gets BS = 0.25 on any sequence where the base rate is 0.5. A perfect oracle gets BS = 0. Random Bernoulli guessing on a balanced stream gets BS ≈ 0.33.
import numpy as np
def brier_score(probs: np.ndarray, outcomes: np.ndarray) -> float:
probs = np.asarray(probs, dtype=float)
outcomes = np.asarray(outcomes, dtype=float)
if probs.shape != outcomes.shape:
raise ValueError("shape mismatch")
if not np.all((probs >= 0) & (probs <= 1)):
raise ValueError("probs must be in [0, 1]")
if not np.all(np.isin(outcomes, [0, 1])):
raise ValueError("outcomes must be 0 or 1")
return float(np.mean((probs - outcomes) ** 2))
# naive 50/50 baseline
print(brier_score(np.full(1000, 0.5), np.random.binomial(1, 0.5, 1000)))
# ≈ 0.25
The key property: if the forecaster holds belief q about the probability of an outcome, the expected Brier score is minimized at p = q. Any other report raises expected score.
Brier decomposition (reliability + resolution + uncertainty)
Brier score decomposes into three additive terms1 when predictions are binned:
BS = Reliability − Resolution + Uncertainty
- Reliability measures how far predicted probabilities are from observed frequencies inside each probability bin. A forecaster who says 0.80 on a set of events where the empirical hit rate is only 0.55 has poor reliability for that bin. Want low.
- Resolution measures how far conditional outcome frequencies in each bin stray from the overall base rate. High resolution means the forecaster's probability bins actually discriminate events from non-events. Want high.
- Uncertainty is the variance of the outcome base rate itself,
o(1 − o). The forecaster cannot change this: it is an irreducible property of the event stream.
Computation, binned:
import numpy as np
import pandas as pd
def brier_decomposition(probs, outcomes, n_bins: int = 10) -> dict:
probs = np.asarray(probs, dtype=float)
outcomes = np.asarray(outcomes, dtype=float)
N = len(probs)
base_rate = outcomes.mean()
uncertainty = base_rate * (1 - base_rate)
bins = np.linspace(0, 1, n_bins + 1)
bin_idx = np.digitize(probs, bins[1:-1])
reliability = 0.0
resolution = 0.0
per_bin = []
for k in range(n_bins):
mask = bin_idx == k
nk = int(mask.sum())
if nk == 0:
continue
pk = probs[mask].mean() # mean forecast in bin
ok = outcomes[mask].mean() # empirical hit rate in bin
reliability += (nk / N) * (pk - ok) ** 2
resolution += (nk / N) * (ok - base_rate) ** 2
per_bin.append({"bin": k, "n": nk, "p_mean": pk, "hit_rate": ok})
bs_direct = float(np.mean((probs - outcomes) ** 2))
bs_decomp = reliability - resolution + uncertainty
return {
"brier": bs_direct,
"reliability": float(reliability),
"resolution": float(resolution),
"uncertainty": float(uncertainty),
"decomposition_sum": float(bs_decomp),
"bins": pd.DataFrame(per_bin),
}
rng = np.random.default_rng(42)
probs = rng.beta(2, 2, 500)
outcomes = rng.binomial(1, probs * 0.9 + 0.05, 500) # slightly miscalibrated
result = brier_decomposition(probs, outcomes, n_bins=10)
print({k: v for k, v in result.items() if k != "bins"})
The decomposition makes the story legible. A forecaster with BS = 0.22 and uncertainty = 0.25 is already beating base-rate noise. If the reliability term is 0.02 and resolution is 0.05, most of the improvement comes from discrimination rather than calibration. If reliability is 0.05 and resolution is 0.02, the forecaster discriminates weakly but bins are well-calibrated, which is a different failure mode needing a different fix.
Binned decomposition is an approximation. The true decomposition (Murphy 1973) is defined on the conditional distribution of outcomes given the forecast, and binning introduces a bias that decreases as bin counts grow. A rule of thumb for finance research loops: aim for at least 30 observations per bin before reading reliability numbers as signal rather than noise. Early in a research diary, ten bins on 100 forecasts will have bins with 3–5 observations each, and the apparent calibration pattern is mostly sampling variance. The bootstrap approach described below gives a way to price that uncertainty.
Log loss
Log loss (also called cross-entropy or negative log-likelihood) is defined as:
LL = −(1/N) Σ [o_i log(p_i) + (1 − o_i) log(1 − p_i)]
Range: [0, ∞]. Lower is better. A 50/50 forecaster has log loss ln(2) ≈ 0.693 on balanced data.
The interesting property is how extremes are priced. A prediction of 0.99 that turns out right contributes −log(0.99) ≈ 0.010 to the sum; the same prediction turning out wrong contributes −log(0.01) ≈ 4.605. A 0.90 prediction that turns out wrong contributes −log(0.10) ≈ 2.303. Going from 90% to 99% makes the cost of being wrong roughly double again.
import numpy as np
def log_loss(probs: np.ndarray, outcomes: np.ndarray, eps: float = 1e-15) -> float:
p = np.clip(np.asarray(probs, dtype=float), eps, 1 - eps)
o = np.asarray(outcomes, dtype=float)
return float(-np.mean(o * np.log(p) + (1 - o) * np.log(1 - p)))
# wrong calls at different confidence
print(log_loss(np.array([0.99]), np.array([0]))) # 4.605
print(log_loss(np.array([0.90]), np.array([0]))) # 2.303
print(log_loss(np.array([0.60]), np.array([0]))) # 0.916
The clipping by eps is practical hygiene: a forecaster that reports 0.0 or 1.0 and is wrong produces −log(0) = ∞, which destroys the mean. In research settings, cap probabilities at [1e-15, 1 − 1e-15].
Brier vs log loss: when to use which
| Metric | Penalizes overconfidence | Easy to decompose | Related to a standard objective |
|---|---|---|---|
| Brier | Moderately (quadratic) | Yes (reliability / resolution / uncertainty) | Yes (mean squared error) |
| Log loss | Severely (logarithmic) | Harder (no standard triple) | Yes (KL divergence from the true distribution) |
Brier is the right default when the goal is to tune specific failure modes. A forecaster whose reliability term dominates total Brier score needs recalibration (isotonic regression, Platt scaling, temperature scaling). A forecaster whose resolution term is near zero is giving out near-base-rate predictions and needs better signal, not better calibration. Brier tells the team where the failure lives.
Log loss is the right default when miscalibration at the extremes is disproportionately costly, most obviously in any system where forecast confidence drives position sizing. A Kelly-fraction sizer (see Conviction-Scaled Kelly) that multiplies by reported probability will over-bet on overconfident wrong predictions. Log loss is the metric that already prices that risk; Brier merely notes it.
Both are proper. Neither is "more correct". Published evaluations of probabilistic climate and weather forecasts typically report Brier (Brier's own 1950 paper is the origin2). Machine-learning classifier evaluations typically report log loss because it falls naturally out of maximum-likelihood training. In an LLM-forecaster pipeline, both should be logged, and the team should pick one for the primary tuning signal based on whether extreme miscalibration is catastrophic or merely suboptimal.
Reliability diagrams
A reliability diagram plots predicted probability on the x-axis against observed frequency on the y-axis, one point per bin. The diagonal y = x is perfect calibration. Points below the line mean overconfident predictions (forecaster said 0.80, events happened 0.60 of the time); points above the line mean underconfident predictions.
import numpy as np
import matplotlib.pyplot as plt
def reliability_diagram(probs, outcomes, n_bins: int = 10, savepath: str = None):
probs = np.asarray(probs, dtype=float)
outcomes = np.asarray(outcomes, dtype=float)
bins = np.linspace(0, 1, n_bins + 1)
bin_idx = np.digitize(probs, bins[1:-1])
xs, ys, counts = [], [], []
for k in range(n_bins):
mask = bin_idx == k
if mask.sum() == 0:
continue
xs.append(probs[mask].mean())
ys.append(outcomes[mask].mean())
counts.append(int(mask.sum()))
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(6, 7),
gridspec_kw={"height_ratios": [3, 1]})
ax1.plot([0, 1], [0, 1], "k--", alpha=0.5, label="perfect")
ax1.plot(xs, ys, "o-", label="forecaster")
ax1.set_xlabel("predicted probability")
ax1.set_ylabel("observed frequency")
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 1)
ax1.legend()
ax2.bar(xs, counts, width=0.8 / n_bins)
ax2.set_xlabel("predicted probability")
ax2.set_ylabel("count")
ax2.set_xlim(0, 1)
plt.tight_layout()
if savepath:
plt.savefig(savepath, dpi=120)
return fig
rng = np.random.default_rng(0)
probs = rng.beta(2, 5, 2000)
outcomes = rng.binomial(1, probs ** 1.3, 2000) # slight overconfidence
reliability_diagram(probs, outcomes, n_bins=10, savepath="reliability.png")
The bottom histogram matters. A reliability point that sits far off the diagonal but has only three observations behind it is noise. Always plot the count per bin.
Reliability diagrams also make drift visible at a glance. A forecaster that was well-calibrated last quarter and now shows the high-probability bins pulled systematically below the diagonal has become overconfident, often because an upstream model was upgraded, or because the event-stream distribution shifted away from the forecaster's training. The same plot, overlaid for two time windows, catches this before the aggregate Brier score moves enough to trip any alert.
Putting it to work on finance forecasts
A minimal evaluation workflow for an LLM forecaster in a finance research loop:
- Collect forecasts from the research diary (see D4 on diary schemas). Each record has a market identifier, a resolution timestamp, the model identifier, the prompt hash, and a predicted probability.
- At resolution time, append the 0/1 outcome.
- Compute Brier, log loss, and Brier decomposition grouped by model, prompt hash, and task type.
- Plot reliability diagrams and compare runs before and after prompt revisions or model upgrades.
import numpy as np
import pandas as pd
rng = np.random.default_rng(7)
N = 100
true_p = rng.beta(3, 3, N)
outcomes = rng.binomial(1, true_p, N)
forecasts_model_a = np.clip(true_p + rng.normal(0, 0.05, N), 0.01, 0.99)
forecasts_model_b = np.clip(true_p ** 0.6, 0.01, 0.99) # overconfident
def evaluate(name, probs, outcomes):
bs = float(np.mean((probs - outcomes) ** 2))
p = np.clip(probs, 1e-15, 1 - 1e-15)
ll = float(-np.mean(outcomes * np.log(p) + (1 - outcomes) * np.log(1 - p)))
decomp = brier_decomposition(probs, outcomes, n_bins=10)
return {
"model": name,
"brier": round(bs, 4),
"log_loss": round(ll, 4),
"reliability": round(decomp["reliability"], 4),
"resolution": round(decomp["resolution"], 4),
}
report = pd.DataFrame([
evaluate("A_calibrated", forecasts_model_a, outcomes),
evaluate("B_overconfident", forecasts_model_b, outcomes),
])
print(report)
The resulting table makes the comparison legible: model B might have a lower Brier score despite worse calibration because its higher resolution offsets the reliability cost, but log loss typically penalizes its extreme predictions enough to flip the ranking. This is the moment where choosing Brier vs log loss as the primary optimization target matters.
The same table, run before and after a model upgrade, surfaces drift. If Brier rises 0.02 after swapping in a new model, and the entire increase lives in the reliability term, the fix is recalibration rather than retraining. See Isotonic Calibration for LLM Forecasts for the fitted post-hoc recalibration step.
Confidence intervals on the metrics themselves
A single Brier or log-loss number on 100 forecasts is noisy. Bootstrap resampling of (forecast, outcome) pairs gives a confidence interval without any distributional assumptions.
import numpy as np
def bootstrap_ci(probs, outcomes, metric_fn, n_boot: int = 2000,
alpha: float = 0.05, seed: int = 0):
rng = np.random.default_rng(seed)
probs = np.asarray(probs, dtype=float)
outcomes = np.asarray(outcomes, dtype=float)
N = len(probs)
stats = np.empty(n_boot)
for b in range(n_boot):
idx = rng.integers(0, N, N)
stats[b] = metric_fn(probs[idx], outcomes[idx])
lo = float(np.quantile(stats, alpha / 2))
hi = float(np.quantile(stats, 1 - alpha / 2))
return {"mean": float(stats.mean()), "lo": lo, "hi": hi}
rng = np.random.default_rng(1)
probs = rng.uniform(0, 1, 200)
outcomes = rng.binomial(1, probs, 200)
print(bootstrap_ci(probs, outcomes, brier_score))
print(bootstrap_ci(probs, outcomes, log_loss))
A typical output on 200 forecasts shows Brier 95% CI widths of 0.03–0.05, which means two models whose point Brier scores differ by 0.01 are statistically indistinguishable at that sample size. Any claim of "model B is better than model A" should reference CI overlap rather than point estimates alone. A standing rule in the research diary: never upgrade a prompt or swap a model based on a 50-forecast comparison.
Connects to
- D1: Eval Harness for Finance LLMs — the harness that produces the forecast-outcome pairs scored here.
- D2: Bayesian Updating on LLM Forecasts — how to update beliefs given a stream of scored predictions.
- D4: Research Diary Schema — the auditable log that stores each (prompt, forecast, outcome) tuple.
- Isotonic Calibration for LLM Forecasts — the recalibration step that targets the reliability term specifically.
- Conviction-Scaled Kelly — where log loss matters most, because confidence drives sizing.
- Calibration Dojo — practice tool for human forecasters aiming to match or beat LLM calibration.
- Forecast Scoring Sandbox — browser tool that computes all the metrics above on an uploaded CSV of forecast-outcome pairs.
References
- Savage, L. J. (1971). "Elicitation of Personal Probabilities and Expectations." Journal of the American Statistical Association 66(336), pp. 783–801. Foundational treatment of proper scoring rules as instruments for honest belief elicitation.
- Murphy, A. H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology 12(4), pp. 595–600. The reliability / resolution / uncertainty decomposition of the Brier score.
- DeGroot, M. H., & Fienberg, S. E. (1983). "The Comparison and Evaluation of Forecasters." The Statistician 32(1/2), pp. 12–22. Formalizes the relationship between calibration and refinement (the language that became reliability and resolution).
Footnotes
-
Gneiting, T., & Raftery, A. E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." Journal of the American Statistical Association 102(477), pp. 359–378. The modern reference for the full class of proper scoring rules and their decompositions. ↩ ↩2
-
Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." Monthly Weather Review 78(1), pp. 1–3. The paper that introduced what is now called the Brier score, originally for weather forecasting. ↩