TL;DR

Free-form "analyze this" prompts produce unreproducible, uncalibrated LLM output. A structured 8-step research prompt — (1) reference class, (2) decomposition, (3) evidence inventory, (4) pre-mortem, (5) base-rate framing, (6) extreme check, (7) invalidation conditions, (8) JSON output — produces output that calibrates better, reproduces across model versions, and catches obvious failure modes at authoring time. Below: the full template, why each step matters, and the one step people always skip (#6, the extreme check).

Why structure matters

Two prompts produce wildly different outputs:

A: "What's the outlook for SYNTHETIC_A?"

B: "For SYNTHETIC_A, complete the following 8 steps and return a JSON with the specified keys..."

Prompt A produces fluent, unverifiable prose. Prompt B produces a structured output where every claim is attached to a step, every step is auditable, and the JSON schema forces a calibrated probability at the end. The "vibes" of the output converge toward the structure.

The 8 steps

1. Reference class

Which historical class of events is this most similar to? List 3–5 reference events with dates and outcomes.

This is the Tetlock "superforecasters" move: start with the base rate of the reference class before looking at the specific case. Without this step, LLMs anchor immediately on the specific case and miss the outside view.

2. Decomposition

Break the question into 3–5 sub-questions that would each move the probability materially. For each, state what evidence would resolve it.

Forces the model to enumerate the dependency structure. If the model cannot decompose, it does not understand the question well enough to predict it.

3. Evidence inventory

List the 5 strongest pieces of evidence supporting a YES outcome and the 5 strongest pieces of evidence supporting a NO outcome. Rank each by weight (high/medium/low).

A balance-sheet for the forecast. The act of listing NO-side evidence surfaces counter-arguments the LLM would otherwise suppress to make the analysis flow.

4. Pre-mortem

Imagine this forecast is wrong in 30 days. Write the obituary. What were the three most likely ways the reasoning failed?

Named by Gary Klein; adopted by Kahneman in Thinking, Fast and Slow. The LLM's pre-mortem surfaces base-rate violations ("I ignored that X usually recovers within 2 weeks"), hidden assumptions, and scenario-dependence that the forward analysis missed.

5. Base-rate framing

Frame the question as frequency rather than probability. "Of 100 similar situations, how many result in YES?"

Addresses the classic Kahneman result that humans (and LLMs trained on human text) are better at frequency reasoning than at probability reasoning. "65 of 100" and "65%" seem equivalent but produce different LLM outputs.

6. Extreme check

State a plausible path to YES. State a plausible path to NO. If your probability estimate is above 85% or below 15%, you must describe a plausible alternative scenario within 3 sentences.

The step most people skip. It prevents overconfidence. An LLM that cannot articulate a plausible path against its own claim does not understand the question well enough to be that confident.

7. Invalidation conditions

What specific, observable event would make you materially change your probability? List 3, each with a threshold.

Converts the forecast from narrative to falsifiable. These become live monitoring signals. Without them, the forecast is not a prediction — it's a story.

8. JSON output

Return a JSON object with the following keys exactly: probability_yes (0..1), reference_class (string[]), top_supporting_evidence (string[]), top_contradicting_evidence (string[]), invalidation_conditions (string[]), confidence_band ("low" | "medium" | "high"). No prose outside the JSON.

Enforces structured output. The probability must exist; the evidence must be listed; the invalidation conditions are machine-readable. This is the payload the downstream risk layer consumes.

Example system prompt

You are an 8-step research assistant for event forecasting.

For the question below, complete each of the 8 steps in order:

  1. Reference class — list 3-5 historical analogs with dates + outcomes.
  2. Decomposition — break into 3-5 sub-questions; specify resolving evidence for each.
  3. Evidence inventory — 5 strongest YES + 5 strongest NO, each weighted high/medium/low.
  4. Pre-mortem — imagine the forecast is wrong in 30 days; write the obituary in <=3 paragraphs.
  5. Base-rate framing — "Of 100 similar situations, how many result in YES?"
  6. Extreme check — if your estimate is >85% or <15%, describe a plausible alternative scenario.
  7. Invalidation conditions — 3 observable events that would flip your mind, each with a threshold.
  8. JSON output — return exactly:
     {
       "probability_yes": <0..1>,
       "reference_class": [...],
       "top_supporting_evidence": [...],
       "top_contradicting_evidence": [...],
       "invalidation_conditions": [...],
       "confidence_band": "low" | "medium" | "high"
     }

Rules:
  - No market prices, no tickers, no positions in any step.
  - If evidence is thin, return low confidence. Do not speculate.
  - No prose outside the final JSON.

What this doesn't fix

The 8-step template improves structure and calibration but does not add information. If the research pack is thin, the output will honestly reflect that (confidence_band = "low"). The template also does not replace isotonic post-hoc calibration — raw LLM probabilities remain miscalibrated even with the template; you still need isotonic calibration on a dated log of (probability, outcome) pairs.

Verify your prompt across models

Run the same 8-step prompt through the Prompt Regression Tester on Haiku, Sonnet, Opus, GPT-5, Gemini 2.5. Look for:

  • Consistency of reference class: all models should independently name overlapping historical analogs. Divergence here means the question is too ambiguous.
  • Consistency of probability: within ±10 percentage points across models. Wider divergence means the question is underspecified.
  • Pre-mortem quality: Haiku often skips step 4; Sonnet usually handles it; Opus produces the most material ones.

References

  • Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction.
  • Klein, G. (2007). "Performing a Project Premortem." Harvard Business Review.
  • Kahneman, D. (2011). Thinking, Fast and Slow.