Research-Diary Schema: Auditable LLM Research

TL;DR

An LLM-assisted research workflow that does not log its predictions, rationale, and outcomes cannot be evaluated later. A structured research diary, append-only and timestamped and versioned, captures every idea at the moment of commitment, before the market confirms or refutes it. Twelve fields is enough. Implementation is a jsonlines file or a SQLite table; discipline is daily. The payoff is three post-hoc analyses that are otherwise impossible: Bayesian calibration of the forecaster, proper-scoring evaluation via Brier and log loss, and backtest-overfitting detection via PBO and Deflated Sharpe. Without the diary, every claim of edge rests on unfalsifiable memory. With it, the claim can be tested against a frozen record written before the outcome was known.

Why a diary is not a trade log

A trade log records executed actions. Fills, sizes, timestamps, realized PnL. It is an accounting artifact and it is useful for compliance and attribution. It is not a research artifact.

A research diary records every IDEA. The ones that were rejected after the rationale was written. The ones that went to paper and never to live. The ones that were cancelled before fill. The ones that became positions. Each idea is recorded at the moment of commitment, before the outcome is known, with prediction, probability, rationale, and invalidation conditions.

The distinction matters because calibration is a statement about the forecaster, not the trader. A forecaster who assigns 70 percent probability to an event must see that event happen roughly 70 percent of the time across all such calls. If only the ideas that became positions are logged, the calibration sample is censored by the trader's sizing and risk rules. The forecaster is measured on a subset chosen by a different process. The subset is biased toward ideas the trader liked enough to act on, which usually means ideas that already had price confirmation.

Selection bias in calibration data is silent and severe. A forecaster who is reliably wrong on half the rejected ideas and reliably right on the positions taken will look perfectly calibrated in a trade-log-only dataset. The trade log records only the cases where the trader's filter happened to align with the outcome. A research diary records the counterfactual: what the model said about the ideas that never became trades. Without those rows, there is no way to distinguish a good forecaster paired with a lucky filter from a genuinely good forecaster.

The same logic explains why paper trades belong in the diary. Paper ideas are the forecasts the trader found plausible enough to track but not large enough to act on. They are high-information rows because the filter was softer; the sample is less censored. Across a year of operation, the paper-idea Brier score is usually the most reliable estimate of the forecaster's true quality. The live-idea Brier is contaminated by sizing decisions that covary with the forecast.

The schema: twelve fields

The schema below is small enough to fill in under a minute per idea and large enough to support the three downstream analyses. Fields are chosen for machine-readability first and human skimmability second.

CREATE TABLE diary (
    idea_id                TEXT PRIMARY KEY,           -- uuid4
    ts_opened              DATETIME NOT NULL,
    subject                TEXT NOT NULL,              -- SYNTHETIC_A, or a synthetic descriptor
    horizon_days           INTEGER NOT NULL,
    prediction_type        TEXT NOT NULL
        CHECK (prediction_type IN ('binary','continuous','ranking')),
    prediction_value       TEXT NOT NULL,              -- JSON: the structured forecast
    prediction_probability REAL NOT NULL,              -- calibrated probability, 0..1
    rationale              TEXT NOT NULL,              -- LLM output as rationale
    model_used             TEXT NOT NULL,              -- name + version, e.g. sonnet-4.6-20260401
    prompt_hash            TEXT NOT NULL,              -- sha256 of the full prompt
    invalidation_conditions TEXT NOT NULL,             -- JSON: what would falsify the prediction
    decision               TEXT NOT NULL
        CHECK (decision IN ('pass','paper','live','cancelled')),
    ts_closed              DATETIME,                   -- filled at resolution
    outcome                TEXT,                       -- JSON: what actually happened
    realized_pnl_usd       REAL                        -- nullable; NULL for pass/paper
);

CREATE INDEX diary_ts_opened ON diary(ts_opened);
CREATE INDEX diary_model     ON diary(model_used);
CREATE INDEX diary_decision  ON diary(decision);

Field notes that matter:

prediction_value is structured JSON, not prose. A binary forecast is {"direction": "up"}. A continuous forecast is {"pct_change": 0.04, "ci_low": -0.01, "ci_high": 0.09}. A ranking forecast is {"order": ["SYNTHETIC_A", "SYNTHETIC_B"]}. Structured values are what the outcome row is compared against; prose rationale cannot be machine-scored.
prediction_probability is the single scalar that gets fed to Brier and log loss. For continuous forecasts it is the model's stated probability that the realized value falls inside the stated CI.
prompt_hash pins the exact prompt used. When the prompt template changes in month 7, the old ideas are still attributable to the old prompt, and cohort analysis by prompt version becomes possible.
invalidation_conditions is the pre-commitment that forces honesty. If the idea's thesis is "Fed dovish pivot within 30 days lifts duration," the invalidation is "a hawkish FOMC statement within the window." Writing this at open time prevents post-hoc goal-shifting when the outcome lands ambiguously.
realized_pnl_usd is NULL for pass and paper ideas. That is not a gap in the data; it is the data. Post-hoc analysis treats those rows as forecasts-only and scores them on probability versus outcome.

The schema is deliberately free of narrative-score, confidence-bucket, or tag fields. Those belong in a view, not a table. Derived fields are cheap to compute; denormalizing them into the diary is a trap that invites retroactive edits.

One more choice worth calling out: subject is a free-text synthetic descriptor, not a real ticker. Operators working in regulated settings should keep the diary free of real ticker-to-direction pairings for reasons separate from schema hygiene, but the research-auditability argument is narrower. A synthetic label lets the diary be shared, quoted, or published without redaction passes. When the diary is the evidence for a published methodology claim, that portability matters.

Discipline: the daily ritual

Two writes per idea per day. No third write.

Morning, at idea generation. Open the idea before checking for price confirmation. The LLM emits rationale and probability; the operator types them into the diary. The SQLite insert commits the row. ts_opened, prediction_value, prediction_probability, rationale, prompt_hash, and invalidation_conditions all land in that single write. The sequence matters: the rationale is frozen before the chart is open.

Evening, or at the horizon close. Fill in ts_closed, outcome, and realized_pnl_usd (if applicable). The opening row is never touched.

Append-only enforcement is a trigger. Closed ideas (those with ts_closed IS NOT NULL) reject all updates to the opening fields.

CREATE TRIGGER diary_no_rewrite
BEFORE UPDATE OF
    prediction_value, prediction_probability, rationale,
    prompt_hash, invalidation_conditions, subject,
    horizon_days, prediction_type, model_used, ts_opened
ON diary
WHEN OLD.ts_closed IS NOT NULL
BEGIN
    SELECT RAISE(ABORT, 'diary entries are append-only after close');
END;

The trigger is not about preventing malice. It is about preventing the quiet, honest impulse to "fix" a rationale that looks naive in hindsight. That impulse is what destroys calibration datasets. The trigger makes the impulse impossible.

For teams, commit the SQLite file to a git-LFS-tracked repo and require signed commits. The WAL-mode file is a reasonable git artifact as long as it is checkpointed before commit. For solo operators, a daily timestamped backup to an object store is enough. A jsonlines variant works identically: one line per idea, second file for outcome rows, join by idea_id at analysis time. The choice between SQLite and jsonlines is ergonomic, not structural. The schema is the same.

Two process notes worth writing into the team playbook. First, no LLM output enters the diary as raw JSON with free-text numeric fields; the probability is a float, the invalidation conditions are a list of structured clauses, and the prediction value passes a schema validator before insert. Second, the clock used for ts_opened is the operator's wall clock at commit time, not the model's timestamp. The diary is a log of human-in-the-loop commitments, not a log of model inference events. When fully automated agents write to the diary, the commit timestamp is the agent's decision time after any guardrails have cleared.

The three post-hoc analyses it makes possible

Calibration

Group closed ideas into probability bins (0.0 to 0.1, 0.1 to 0.2, and so on). For each bin, plot predicted probability on the x axis against observed frequency of the outcome on the y axis. A well-calibrated forecaster lands on the diagonal. A systematically overconfident forecaster sits below the diagonal; an underconfident forecaster sits above.

The diary is the only honest input to this plot. A Bayesian updater, of the kind described in the forthcoming D2 piece on Bayesian updating for LLM forecasts, takes the calibration curve as its likelihood function and produces posterior probability estimates that correct for the forecaster's known bias. That correction is applied to future predictions from the same model and prompt. It is a direct feedback loop from diary to inference quality.

The isotonic version of this correction is already covered in Isotonic Calibration for LLM Forecasts; the Calibration Dojo runs it on an uploaded diary export.

Proper scoring

Brier score and log loss are proper in the technical sense: they are minimized in expectation only by reporting the true probability. A forecaster who shades probabilities toward 0.5 to look cautious scores worse than a forecaster who states honest probabilities. Both scores are computed across the full diary (pass, paper, live, cancelled) with the binary outcome coded 0 or 1.

-- Brier score, all closed binary ideas, by model
SELECT
    model_used,
    COUNT(*)                                                          AS n,
    AVG(POWER(prediction_probability - outcome_binary, 2))            AS brier,
    AVG(
        -outcome_binary       * LOG(MAX(prediction_probability, 1e-9))
        -(1 - outcome_binary) * LOG(MAX(1 - prediction_probability, 1e-9))
    )                                                                 AS log_loss
FROM diary
JOIN diary_outcome_binary USING (idea_id)
WHERE ts_closed IS NOT NULL
  AND prediction_type = 'binary'
GROUP BY model_used
ORDER BY brier ASC;

The diary_outcome_binary view flattens the JSON outcome into a 0/1 column. Lower Brier and lower log loss are better. Across a year of daily forecasts, a 0.02 Brier improvement between models is both statistically meaningful and economically significant; it is the difference between a model worth keeping and a model worth retiring. The forthcoming D3 piece on Brier and log loss formalizes the comparison test; the Forecast Scoring Sandbox runs the calculation on an uploaded diary.

Overfitting detection

For the decision = 'live' subset, the realized-PnL stream is a returns series like any other. Deflated Sharpe Ratio (Bailey and Lopez de Prado 2014), PBO (probability of backtest overfitting, Bailey Borwein Lopez de Prado Zhu 2014), and walk-forward out-of-sample coverage are all directly computable from the PnL column.¹ The Backtest Overfitting Score and the Walk-Forward Validator accept the diary export as input.

The diary's contribution to overfitting analysis is the n_trials parameter for DSR. Every pass and paper idea counts as a trial. A trader who takes 12 positions from 300 diary entries has n_trials = 300, not 12. The Deflated Sharpe on the 12-position live stream must be computed against the 300-trial null. Did You Overfit? PBO and Deflated Sharpe walks through the calculation.

Schema extensions for agent workflows

Agent pipelines generate trace logs and cost telemetry that belong alongside each idea. Three additional fields link the diary to the observability and cost layers:

ALTER TABLE diary ADD COLUMN research_trace_id TEXT;  -- fk to trace log
ALTER TABLE diary ADD COLUMN cost_usd          REAL;  -- total inference spend
ALTER TABLE diary ADD COLUMN tokens_used       INTEGER;

research_trace_id joins to the observability store (the forthcoming B1 piece on observability for LLM trading agents specifies the trace schema) so every idea can be replayed. cost_usd joins to the attribution layer from the forthcoming C4 piece on inference cost attribution per trade. With both present, a per-idea return-on-inference metric is a trivial query.

-- Cost-adjusted edge per model
SELECT
    model_used,
    SUM(realized_pnl_usd)                    AS total_pnl,
    SUM(cost_usd)                            AS total_inference_cost,
    SUM(realized_pnl_usd) - SUM(cost_usd)    AS net_pnl,
    1.0 * SUM(realized_pnl_usd) / SUM(cost_usd) AS pnl_per_inference_dollar
FROM diary
WHERE decision = 'live'
  AND ts_closed IS NOT NULL
  AND cost_usd > 0
GROUP BY model_used;

A model that wins on raw PnL but loses on pnl_per_inference_dollar is subsidized by free inference. Once the subsidy ends, the edge ends. The extended schema catches that condition before it becomes a billing surprise.

Two canonical queries

Two queries run weekly are enough to surface the most common failure modes. Both assume the extended schema with outcome flattening.

Monthly Brier by model. Tracks forecaster quality over time. A drift upward means the model is degrading on the current market regime; a drift downward means it is adapting well.

SELECT
    STRFTIME('%Y-%m', ts_closed) AS month,
    model_used,
    COUNT(*)                                                 AS n,
    AVG(POWER(prediction_probability - outcome_binary, 2))   AS brier
FROM diary
JOIN diary_outcome_binary USING (idea_id)
WHERE ts_closed IS NOT NULL
  AND prediction_type = 'binary'
GROUP BY month, model_used
ORDER BY month DESC, brier ASC;

Realized Sharpe by horizon. Tracks whether the trader's chosen horizons actually deliver risk-adjusted return. The annualization factor is SQRT(252/horizon_days) because each idea occupies roughly horizon_days of capital.

SELECT
    horizon_days,
    COUNT(*)                                       AS n_trades,
    AVG(realized_pnl_usd)                          AS mean_pnl,
    AVG(realized_pnl_usd) /
        NULLIF(STDEV(realized_pnl_usd), 0) *
        SQRT(252.0 / horizon_days)                 AS sharpe_annualised
FROM diary
WHERE decision = 'live'
  AND ts_closed IS NOT NULL
  AND realized_pnl_usd IS NOT NULL
GROUP BY horizon_days
ORDER BY horizon_days;

A horizon that shows a high Sharpe with fewer than 30 trades is a candidate for the overfitting analysis, not a finding. The diary makes the distinction explicit. The Walk-Forward Validator accepts the per-horizon live stream as an uploaded CSV and reports out-of-sample coverage alongside the headline Sharpe. Sharpe without coverage is theatre; coverage without Sharpe is incomplete. The diary produces both from the same rows.

A third query is worth running monthly: the rejected-idea base rate. For every 100 diary entries, what fraction became live, paper, pass, and cancelled? If the live fraction drifts upward without a matching improvement in the live-idea Brier, the trader is relaxing the filter without evidence. That drift is usually what precedes a drawdown that the backtest never predicted, because the backtest was run on the historical filter, not the current one.

Common mistakes

Mistake	Consequence	Fix
Writing the diary after the outcome is known	Hindsight bias contaminates rationale; calibration data is unusable	Rationale must commit before price is checked; `ts_opened` enforced by trigger
Not recording rejected ideas	Selection bias in calibration; forecaster looks better than the model actually is	Log every LLM run, including the ones that never became positions
Rewriting entries after the fact	Audit trail breaks; team trust erodes; post-hoc tuning invisible	Append-only trigger on closed rows; signed git commits on the SQLite file
Tracking only headline probability without structured `prediction_value`	Outcome cannot be machine-compared; proper scoring is impossible	JSON schema for each `prediction_type`; reject rows that do not validate
Logging in free-form Markdown notes	No query surface; aggregates require re-parsing	SQLite from day one; prose goes in `rationale`, structure everywhere else
Conflating the diary with the trade log	Calibration sample is censored by risk rules	Separate tables; diary is the superset, trade log is the `decision = 'live'` view

The last row is the most expensive mistake and the hardest to undo. A team that runs a trade log for a year and decides to add calibration analysis cannot recover the rejected-idea rows. Starting with the diary schema from day one costs an extra thirty seconds per idea and pays back the first time a model needs to be replaced.

Connects to

Observability for LLM Trading Agents — the trace log the research_trace_id field points at.
Inference Cost Attribution per Trade — the cost layer the cost_usd field joins to.
Eval Harness for Finance LLMs — pre-release evaluation; the diary is the post-release evaluation.
Bayesian Updating for LLM Forecasts — consumes the calibration curve derived from the diary.
Brier Scores and Log Loss for Forecasters — proper-scoring evaluation on the diary.
Isotonic Calibration for LLM Forecasts — post-hoc probability correction.
Did You Overfit? PBO and Deflated Sharpe — overfitting analysis on the decision = 'live' subset.
Calibration Dojo — reliability diagrams from a diary export.
Forecast Scoring Sandbox — Brier and log loss over the diary.
Backtest Overfitting Score — PBO and Deflated Sharpe on the live subset.
Walk-Forward Validator — OOS coverage on the live returns stream.

References

Tetlock, P. E., & Mellers, B. A. (2014). "Forecasting Tournaments: Tools for Increasing Transparency and Improving the Quality of Debate." Current Directions in Psychological Science 23(4), pp. 290-295. The Good Judgment Project papers on forecasting discipline and the discipline of writing predictions down before outcomes are known.
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Random House. Chapter 5 on the rituals of explicit probability commitment; Chapter 8 on the counterfactual sample.
Silver, N. (2012). The Signal and the Noise: Why So Many Predictions Fail - But Some Don't. Penguin. Chapters on calibration as the primary virtue of a forecaster.
Bailey, D. H., Borwein, J., Lopez de Prado, M., & Zhu, Q. J. (2014). "Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance." Notices of the AMS 61(5), pp. 458-471. The PBO paper.
Gneiting, T., & Raftery, A. E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." Journal of the American Statistical Association 102(477), pp. 359-378. The formal basis for Brier and log loss as the only admissible scoring rules.

Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management 40(5), pp. 94-107. ↩