TL;DR

Most LLM trading research fails invisibly because the LLM sees the current price inside its own context and every "analysis" becomes a retroactive justification for that price. The fix is architectural: split context into a market-visible half (prices, positions, PnL) and a research half (filings, earnings, macro, news). The LLM only ever sees the research half. It emits a structured probability and thesis. The execution layer, which is allowed to see prices, decides what to do with it. Below: the pattern, why it works, three failure modes to avoid, and a 30-line Python scaffold.

The failure mode

Ask any LLM to analyze a stock after giving it recent price action. Count the number of "given the recent drop" and "against the recent strength" framings in its response. The LLM is not analyzing; it is confabulating a narrative that rationalizes the price. This is the single most-common way retail LLM research fails: the model looks fluent because it always had the answer — the answer was the current price — and it just had to invent a justification.

The fix is not a better prompt. The fix is removing the price from the prompt.

The pattern

Research context is split into two pipelines that never merge before the LLM call:

        ┌─────────────────────┐
ticker ─┤ fetch_research_pack │──► filings, earnings, news, fundamentals
        └─────────────────────┘
                                      ▼
        ┌─────────────────────┐   research_pack (no prices, no positions, no PnL)
price ──┤    research_prompt  │◄──┘
        └─────────────────────┘
                                      ▼
                                  LLM call
                                      ▼
                        {probability, thesis, invalidation_conditions}
                                      ▼
        ┌─────────────────────┐
price ──┤  risk + execution   │
        └─────────────────────┘
                                      ▼
                               sized_order

The arrow from price to research_prompt is deliberately absent. The execution layer has prices. The research layer does not.

Why this works

Without price in context, the LLM has no answer to retroactively justify. Its output depends only on the evidence in the research pack, which is (by construction) the same evidence a thoughtful analyst would use — earnings cadence, filings language, competitive dynamics, industry headwinds, not "the chart is red today."

The surprising result: price-blind analyses are materially less confident-sounding than price-informed analyses, but their calibration — over hundreds of paired trials against ground truth — is dramatically better. This mirrors the classic result in forecasting research that suppressing confidence-inflating information improves Brier scores.

A minimal scaffold

from dataclasses import dataclass
from typing import Literal, List

@dataclass
class ResearchPack:
    ticker_anonymous: str  # "SYNTHETIC_A" — do NOT pass real ticker
    filings_excerpts: list[str]
    earnings_transcripts: list[str]
    competitor_context: list[str]
    macro_regime_summary: str
    # NOTE: no prices, no charts, no position info, no PnL

@dataclass
class ResearchOutput:
    probability_up_30d: float  # 0..1
    thesis: str
    invalidation_conditions: list[str]
    confidence_band: Literal["low", "medium", "high"]

def research(pack: ResearchPack) -> ResearchOutput:
    system = (
        "You are a research analyst. You are reviewing an event, not a price. "
        "Return a structured JSON with keys: probability_up_30d, thesis, "
        "invalidation_conditions, confidence_band. No speculation about price. "
        "If evidence is thin, return low confidence."
    )
    prompt = json.dumps(asdict(pack))
    # Call Anthropic / OpenAI / Gemini here with BYO key.
    # Critical: NEVER include price, PnL, or position information in `prompt`.
    ...

The important move is in the ResearchPack dataclass: by defining the input shape, you can grep your own codebase for research( calls and prove they never carry prices. The failure mode is usually accidental — a debug print or a retry wrapper that adds context. The dataclass catches it at the boundary.

Three failure modes to avoid

1. Ticker leakage. If the LLM knows the ticker, it knows the price (from training data through its cutoff, and from its knowledge of the current market). Anonymize to SYNTHETIC_A / SYNTHETIC_B in the research step. The execution layer, which has the real ticker, does the mapping.

2. Date leakage for hot names. "Late 2024" + "chip export restrictions" + "foundry delay" implies a specific set of tickers to any LLM trained past that window. Either abstract dates to relative form ("fiscal year in review") or use the LLM version cut-off principle: only pass dates/events that do not uniquely identify the name.

3. Retry-induced price contamination. Structured-output retry loops sometimes include the previous (failed) response in the new prompt. If that previous response mentioned price, your "price-blind" second call isn't. Audit your retry path and strip.

How to verify

The Prompt Regression Tester lets you run the same research prompt against multiple models and compare outputs. The Hallucination Detector catches fabricated numbers in the response. Neither is a substitute for the architectural separation — but both help you detect the boundary breaking.

The true test: run your research prompt on a set of past events where you know the outcome. Compare stated probability to realized frequency via a reliability curve. The Calibration Dojo uses exactly this mechanic for generic questions; the pattern transfers directly to your own research log.

References

  • Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction.
  • Lopez de Prado, M. (2018). Advances in Financial Machine Learning (Chapter 7 on backtesting bias, particularly look-ahead and confirmation effects).
  • Kahneman, D. (2011). Thinking, Fast and Slow (on anchoring and confirmation bias).