TL;DR
LLM forecasts are noisy and systematically biased: overconfident on macro topics, underconfident on technical topics, and unstable across repeated runs. Bayesian updating treats the LLM output as one piece of evidence and combines it with a reference-class prior to produce a calibrated posterior. Beta-Binomial is the conjugate choice for binary forecasts; Normal-Inverse-Gamma handles continuous ones. The core implementation runs in roughly fifty lines of Python and needs only a historical base rate, a sample-size confidence parameter, and a log of prior (prediction, outcome) pairs. Over time the prior absorbs realised outcomes and the system self-calibrates. The cost of this discipline is one extra update step per forecast; the payoff is that a single confidently wrong LLM call stops driving the book.
The setup: prior, likelihood, posterior
A Bayesian forecast is a three-part object. The prior encodes beliefs before any model is consulted and comes from a reference class of analogous past events. The likelihood encodes how informative a given piece of evidence is about the unknown. The posterior is the coherent combination of the two.
The LLM output is not the posterior. It is evidence — a single, noisy observation about the thing being forecast. Treating it as the final probability skips the step that turns it from a prompt artefact into something a Kelly-style sizer1 can safely consume.
In compact form:
posterior ∝ prior × likelihood
For recurring finance forecasts the prior is cheap to build from historical frequencies, and the likelihood is cheap to estimate from the LLM's own track record. Neither step requires exotic machinery.
Choosing the reference class
Reference-class forecasting was named by Kahneman and Tversky2 and popularised in planning contexts. The mechanic is the same in markets: instead of asking "what will happen to this specific case," the forecaster first asks "what happened across N similar cases historically." That base rate is the prior.
Three common shapes:
- Binary. Will semiconductor issuer SYNTHETIC_A beat consensus EPS next quarter? Reference class = all semiconductor issuers in the prior twenty quarters (a few hundred filings). Base rate = fraction that beat consensus.
- Continuous. What is next-year EPS growth for SYNTHETIC_B? Reference class = empirical distribution of one-year EPS growth across issuers in the same sub-industry with similar revenue scale. Prior = mean and variance of that distribution.
- Multi-category. Which of five Federal Open Market Committee action categories (cut 50bps, cut 25bps, hold, hike 25bps, hike 50bps) is likely? Reference class = historical frequencies of FOMC actions under comparable CPI and unemployment regimes.
The reference class must be defined before any prediction is made, or the prior becomes hindsight-coloured. A practitioner running this loop in production freezes the reference-class definition in a config file and only revises it on a documented schedule, not mid-forecast. Tetlock's work on superforecasters3 emphasises this point — the best forecasters start from base rates and adjust, rather than starting from a narrative and stretching for a number.
| Forecast target | Reference class | Prior family |
|---|---|---|
| Will X occur? | Historical frequency of analogous events | Beta |
| Real-valued metric Y | Empirical distribution of analogous cases | Normal or Normal-Inverse-Gamma |
| Which of K outcomes? | Historical multinomial frequencies | Dirichlet |
| Rate or count per period | Historical count distribution | Gamma or Poisson-Gamma |
Beta-Binomial update (binary)
The Beta distribution is the conjugate prior for a Bernoulli likelihood. That means if the prior is Beta(α, β) and k successes are observed in n trials, the posterior is Beta(α + k, β + n - k). No integration, no MCMC, no sampling.
Translating the LLM case: the "trial" is the single LLM forecast, and the "success count" is a fractional number scaled by the model's historical reliability. If the LLM outputs probability p and has historical validation reliability w on this task family (w in [0, 1]), treat the LLM call as contributing w · p successes and w · (1 - p) failures to the posterior. The weight w is the effective number of "implied observations" the LLM vote is worth.
Eliciting α and β from a base rate plus sample-size confidence:
- Base rate μ from reference-class frequency (say 0.42).
- Sample-size confidence s (the "pseudo-count") — how many implied prior observations the analyst is willing to commit to. A weak prior might use s = 10; a strong one s = 100.
- α = μ · s, β = (1 - μ) · s.
Runnable Python:
from dataclasses import dataclass
from scipy.stats import beta
@dataclass
class BetaBinomialUpdater:
alpha_prior: float
beta_prior: float
alpha: float = None
beta: float = None
def __post_init__(self):
self.alpha = self.alpha_prior
self.beta = self.beta_prior
@classmethod
def from_base_rate(cls, base_rate: float, pseudo_count: float):
a = base_rate * pseudo_count
b = (1.0 - base_rate) * pseudo_count
return cls(alpha_prior=a, beta_prior=b)
def update(self, llm_probability: float, llm_weight: float = 1.0):
# Treat one LLM call as llm_weight implied observations.
p = max(1e-6, min(1 - 1e-6, llm_probability))
self.alpha += llm_weight * p
self.beta += llm_weight * (1.0 - p)
def record_outcome(self, outcome: int):
# outcome in {0, 1}; updates prior with a real observation.
self.alpha += outcome
self.beta += (1 - outcome)
def posterior_mean_and_ci(self, conf: float = 0.90):
mean = self.alpha / (self.alpha + self.beta)
lo = beta.ppf((1 - conf) / 2, self.alpha, self.beta)
hi = beta.ppf(1 - (1 - conf) / 2, self.alpha, self.beta)
return mean, lo, hi
Worked walkthrough. Reference class says semiconductor issuers beat consensus 54% of the time (s = 50 gives α = 27, β = 23). An LLM run says the specific issuer has 0.72 probability of beating. Suppose the LLM's historical reliability weight on single-issuer EPS calls is 5 (five implied observations per call — conservative, since LLM calls correlate across runs). Posterior mean settles around 0.57 rather than 0.72, with a tighter credible interval than the prior alone. Ninety credible intervals come free from scipy.stats.beta.ppf.
Setting llm_weight honestly matters. An LLM reliability of "one implied observation" is conservative and appropriate when the model has not been benchmarked on this exact task family. The price-blind LLM research harness describes how to estimate reliability without leaking prices into the model context.
Normal-Inverse-Gamma update (continuous)
Real-valued forecasts — revenue growth, target price, next-quarter free cash flow — need a prior over both the unknown mean and the unknown variance. The Normal-Inverse-Gamma (NIG) distribution is the conjugate prior for a Normal likelihood with both parameters unknown. It tracks four hyperparameters: μ₀ (prior mean), κ₀ (mean pseudo-count), α₀ (variance shape), β₀ (variance scale).
Posterior update after observing x (treated as one noisy draw from the underlying distribution):
μ_n = (κ₀·μ₀ + x) / (κ₀ + 1)
κ_n = κ₀ + 1
α_n = α₀ + 0.5
β_n = β₀ + (κ₀ · (x - μ₀)² / (2 · (κ₀ + 1)))
Runnable Python:
from dataclasses import dataclass
from math import sqrt
from scipy.stats import t as student_t
@dataclass
class NIGUpdater:
mu0: float
kappa0: float
alpha0: float
beta0: float
mu: float = None
kappa: float = None
alpha: float = None
beta: float = None
def __post_init__(self):
self.mu = self.mu0
self.kappa = self.kappa0
self.alpha = self.alpha0
self.beta = self.beta0
@classmethod
def from_reference_class(cls, ref_mean: float, ref_var: float,
mean_pseudo: float = 10.0,
var_pseudo: float = 10.0):
# Matching moments: alpha0 = var_pseudo/2 + 1, beta0 = alpha0 * ref_var.
a0 = var_pseudo / 2.0 + 1.0
b0 = (a0 - 1.0) * ref_var
return cls(mu0=ref_mean, kappa0=mean_pseudo, alpha0=a0, beta0=b0)
def update(self, llm_forecast: float, llm_weight: float = 1.0):
x = llm_forecast
k0, m0 = self.kappa, self.mu
self.mu = (k0 * m0 + llm_weight * x) / (k0 + llm_weight)
self.kappa = k0 + llm_weight
self.alpha = self.alpha + llm_weight / 2.0
self.beta = self.beta + (
llm_weight * k0 * (x - m0) ** 2 / (2.0 * (k0 + llm_weight))
)
def record_outcome(self, x: float):
self.update(x, llm_weight=1.0)
def predictive_mean_and_ci(self, conf: float = 0.90):
# Posterior predictive is Student-t with 2*alpha dof.
dof = 2.0 * self.alpha
scale = sqrt(self.beta * (self.kappa + 1) / (self.alpha * self.kappa))
lo = self.mu + scale * student_t.ppf((1 - conf) / 2, dof)
hi = self.mu + scale * student_t.ppf(1 - (1 - conf) / 2, dof)
return self.mu, lo, hi
The predictive distribution is Student-t, which matters: it has fatter tails than Normal, a feature not a bug for finance targets. Short-tailed Gaussian intervals on EPS growth underestimate the probability of hitting a reference-class outlier. Student-t credible intervals widen automatically when α is small (few implied observations), which is exactly the correct behaviour early in a calibration cycle.
LLM-as-likelihood: calibrating the LLM's own output
The LLM outputs a probability or point estimate, not a likelihood. Converting one to the other requires a historical mapping from "LLM said X" to "reality was Y." The track record is the calibration table.
For binary forecasts, the table has one row per probability bucket and two columns: predicted and realised. For continuous forecasts, it is a scatter of predicted vs. realised with a regression line (or isotonic fit) recording the systematic bias. The sister article on isotonic calibration of LLM forecasts covers the non-parametric mapping in detail; the Bayesian flow builds on top of it.
Runnable Python for maintaining a binary calibration table from outcome logs:
import json
from collections import defaultdict
from pathlib import Path
class LLMCalibrationTable:
def __init__(self, path: str, n_buckets: int = 10):
self.path = Path(path)
self.n_buckets = n_buckets
self.buckets = defaultdict(lambda: {"n": 0, "wins": 0, "sum_p": 0.0})
self._load()
def _bucket_for(self, p: float) -> int:
idx = int(p * self.n_buckets)
return min(idx, self.n_buckets - 1)
def record(self, llm_probability: float, outcome: int):
b = self._bucket_for(llm_probability)
self.buckets[b]["n"] += 1
self.buckets[b]["wins"] += outcome
self.buckets[b]["sum_p"] += llm_probability
self._save()
def reliability(self, llm_probability: float, min_n: int = 20) -> float:
# Effective "implied observations" a new LLM call is worth.
b = self._bucket_for(llm_probability)
row = self.buckets.get(b)
if not row or row["n"] < min_n:
return 1.0 # weak default until bucket has samples
avg_pred = row["sum_p"] / row["n"]
avg_real = row["wins"] / row["n"]
# Smaller weight when predicted and realised diverge.
gap = abs(avg_pred - avg_real)
return max(0.25, 5.0 * (1.0 - 2.0 * gap))
def _save(self):
self.path.write_text(json.dumps({str(k): v for k, v in self.buckets.items()}))
def _load(self):
if self.path.exists():
raw = json.loads(self.path.read_text())
for k, v in raw.items():
self.buckets[int(k)] = v
The table starts empty. Every outcome log entry calls record(prob, outcome) and the reliability weight used by BetaBinomialUpdater.update comes from reliability(prob). Until a bucket accumulates twenty outcomes, the call returns a weak default of 1.0. Past that, the weight rises when bucket-level prediction matches bucket-level reality and falls when they diverge. The shape of this logic is standard in adaptive-weight ensembles; Winkler's 1967 paper on the elicitation and evaluation of subjective probabilities4 sketches the underlying argument.
Pattern: LLM + reference class + recent outcomes
A three-source ensemble in a single recurring forecast loop:
from dataclasses import dataclass
@dataclass
class ThreeSourceForecaster:
ref_base_rate: float
ref_pseudo_count: float
calibration_table: LLMCalibrationTable
def forecast(self, llm_probability: float,
recent_outcomes: list[int]) -> tuple[float, float, float]:
# 1. Build prior from reference class.
updater = BetaBinomialUpdater.from_base_rate(
self.ref_base_rate, self.ref_pseudo_count
)
# 2. Feed recent realised outcomes (most weight on newest).
for i, y in enumerate(reversed(recent_outcomes)):
decay = 0.9 ** i
updater.alpha += decay * y
updater.beta += decay * (1 - y)
# 3. Feed LLM vote with reliability-based weight.
w = self.calibration_table.reliability(llm_probability)
updater.update(llm_probability, llm_weight=w)
return updater.posterior_mean_and_ci(conf=0.90)
def record(self, llm_probability: float, outcome: int):
self.calibration_table.record(llm_probability, outcome)
The prior, the recency channel, and the LLM channel all update the same Beta posterior. Nothing prevents adding a fourth source (an ensemble of LLMs, or a simple statistical baseline) with its own weight. Because the Beta family is closed under this kind of update, the posterior is always a Beta; credible intervals remain a closed-form lookup. This is the same pattern Raiffa and Schlaifer5 documented for conjugate families in applied statistical decision theory.
A practitioner running this loop with conviction-scaled Kelly sizing passes the posterior mean (not the LLM output) into the sizer and uses the credible-interval width as a separate sanity gate on position size.
When Bayesian updating helps
- Binary and small-N categorical forecasts — earnings beats, FOMC decisions, binary policy outcomes, merger-arb completions. Conjugate update is closed-form and cheap.
- Continuous forecasts with well-defined reference classes — issuer growth rates, spread levels, volatility regimes. NIG + reference class + LLM as one noisy observation.
- Recurring forecasts — any task repeated dozens of times per month. The prior actually updates. Calibration-table bucket counts rise fast enough that reliability weights become informative within a few months.
- Ensembling with humans or other models — each source gets a weight; the Beta or NIG posterior is the coherent combination.
When it doesn't
- One-off novel events. If there is no defensible reference class, there is no prior to anchor the update. Falling back to a uniform Beta(1, 1) makes the update mathematically valid but practically useless — the posterior is whatever the LLM said. Better to log the forecast as uncalibrated and exclude it from anything downstream of Kelly.
- LLM output already posterior-shaped. Some prompts explicitly instruct the model to combine a base rate with evidence and return a calibrated answer. Feeding that into a Beta update applies the prior twice. The 8-step LLM research prompt template makes this state visible: if step 8 already produced a Bayesian synthesis, the outer update collapses to identity.
- Outcome data delayed years. Forecasts whose ground truth only lands in 2030 will never meaningfully update the prior during the holding period. Structural forecasts of very long-dated outcomes need a different discipline (scenario analysis, expected-value bounds) rather than Bayesian tracking.
- Heavy regime change. When the reference class ceases to be representative — for example, forecasting rate-hike paths on a twenty-year sample that mostly predates 2022 — the prior is actively misleading. The fix is to shorten the reference window or add a regime indicator, not to crank down the prior weight blindly.
Connects to
- Calibrating LLM Forecasts with Isotonic Regression — sister piece on the non-parametric calibration mapping that feeds this update.
- Eval Harness for Finance LLMs — how the reliability weights used above get measured in the first place.
- Brier Scores and Log Loss for Forecasters — proper scoring rules for grading the posterior against realised outcomes.
- The Auditable Research Diary Schema — the log format that stores (prediction, outcome) pairs without fabrication.
- Conviction-Scaled Kelly Sizing — consumer of the posterior mean and credible-interval width.
- The 8-Step LLM Research Prompt Template — produces the single LLM forecast the update step consumes.
- Calibration Dojo — browser sandbox for checking reliability-curve behaviour on a pasted log of predictions and outcomes.
- Kelly Sizer — posterior mean + credible interval drop straight into fractional-Kelly sizing.
References
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Reference text for posterior-predictive distributions and hierarchical updates.
- Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). "Probabilistic forecasts, calibration and sharpness." Journal of the Royal Statistical Society B 69(2), pp. 243–268. Formal definitions of calibration used to grade posteriors.
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press, chapters 3 and 5. Compact derivations of Beta-Binomial and Normal-Inverse-Gamma conjugacy.
Footnotes
-
Kelly, J. L. (1956). "A New Interpretation of Information Rate." Bell System Technical Journal 35(4), pp. 917–926. Foundational link between probability calibration and sizing; downstream of any Bayesian posterior used for bet sizing. ↩
-
Kahneman, D., & Tversky, A. (1979). "Intuitive Prediction: Biases and Corrective Procedures." TIMS Studies in Management Science 12, pp. 313–327. Original formulation of the outside-view / reference-class argument. ↩
-
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown. Empirical case for base-rate anchoring in human forecasting. ↩
-
Winkler, R. L. (1967). "The Assessment of Prior Distributions in Bayesian Analysis." Journal of the American Statistical Association 62(319), pp. 776–800. Methods for eliciting α, β and their pseudo-count interpretation. ↩
-
Raiffa, H., & Schlaifer, R. (1961). Applied Statistical Decision Theory. Harvard Business School. Canonical treatment of conjugate priors including Beta-Binomial and Normal-Inverse-Gamma. ↩