Inference Cost Attribution per Idea and Trade

TL;DR

A practitioner running an LLM-assisted research loop wants two numbers: what did this idea cost, and what did an executed trade cost once rejected ideas are accounted for. Monthly provider bill divided by trade count is not the answer, because ten research calls produce two trades and the other eight still cost real dollars. Without per-request cost attribution rolled up through the research, decision, and execution funnel, the agent looks profitable until the account renewal arrives and the retries, thinking tokens, and cache writes show up in one lump. The fix is an append-only cost-attribution schema keyed on a trace_id, two canonical SQL queries that normalize spend against validation rate, and a small amortization step for pooled costs. What follows is the schema, the queries, and a runnable attribution script.

Cost-per-raw-request is not cost-per-trade

A research agent that examines ten candidate ideas and acts on two pays for ten research chains. If each chain costs $0.40 on average, the raw per-call figure reads $0.40 but the cost-per-trade is $2.00. That 5x multiplier is the validation rate — the fraction of research work that survives into an execution. Any dashboard that does not divide by validation rate reports a number the P&L does not see.

Four failure modes compound the gap. First, retries. A provider returns HTTP 529 or an intermittent tool-use error, the agent re-runs the call, and both attempts bill. Second, thinking tokens. Extended-thinking requests on Claude 4.x and reasoning-effort requests on GPT-5 class models charge for internal reasoning tokens the client never reads.¹ Third, cache writes. The first call that hydrates a 120k-token system prompt into Anthropic's cache pays a 25% premium on input tokens and only amortizes across subsequent reads within the five-minute TTL.² Fourth, tool-use round trips. A single idea that triggers three function calls bills four model turns, not one.

Monthly bill divided by trade count papers over all four. Per-request logging solves it, but only if the schema lets the analyst roll up from event to idea to trade. That requires a trace_id that survives across calls and a clean separation between what the request cost and whether the idea was acted on. See Token Cost Reality for LLM Trading Research for the list-price side of this; the present article covers the attribution side.

The attribution schema

Three append-only tables carry the attribution. cost_events records every billable interaction with a model provider. trace_summary records the idea the events belong to and the outcome that idea reached. trade_execution records the order that closed the loop when the outcome was live. A trace_id joins the three.

CREATE TABLE cost_events (
  event_id            TEXT PRIMARY KEY,
  trace_id            TEXT NOT NULL,
  parent_event_id     TEXT,
  provider            TEXT NOT NULL,
  model               TEXT NOT NULL,
  model_version       TEXT NOT NULL,
  request_type        TEXT NOT NULL,        -- chat, tool_use, batch, embeddings
  tokens_input        INTEGER NOT NULL DEFAULT 0,
  tokens_output       INTEGER NOT NULL DEFAULT 0,
  tokens_cached_read  INTEGER NOT NULL DEFAULT 0,
  tokens_cached_write INTEGER NOT NULL DEFAULT 0,
  tokens_thinking     INTEGER NOT NULL DEFAULT 0,
  cost_usd            NUMERIC(12,6) NOT NULL,
  latency_ms          INTEGER,
  ts                  TIMESTAMP NOT NULL,
  idempotency_key     TEXT,
  http_status         INTEGER,
  retry_of_event_id   TEXT,
  raw_usage_json      TEXT
);

CREATE INDEX idx_cost_events_trace  ON cost_events(trace_id);
CREATE INDEX idx_cost_events_ts     ON cost_events(ts);
CREATE INDEX idx_cost_events_idem   ON cost_events(idempotency_key);
CREATE INDEX idx_cost_events_model  ON cost_events(provider, model);

CREATE TABLE trace_summary (
  trace_id          TEXT PRIMARY KEY,
  idea_description  TEXT NOT NULL,
  universe          TEXT,
  outcome           TEXT NOT NULL
    CHECK (outcome IN ('no-action', 'paper', 'live')),
  rejection_reason  TEXT,
  ts_opened         TIMESTAMP NOT NULL,
  ts_closed         TIMESTAMP,
  operator          TEXT NOT NULL           -- agent id or human id
);

CREATE INDEX idx_trace_outcome ON trace_summary(outcome);
CREATE INDEX idx_trace_opened  ON trace_summary(ts_opened);

CREATE TABLE trade_execution (
  trace_id          TEXT NOT NULL,
  order_id          TEXT PRIMARY KEY,
  venue             TEXT NOT NULL,
  symbol            TEXT NOT NULL,          -- use SYNTHETIC_A for examples
  side              TEXT NOT NULL CHECK (side IN ('buy','sell')),
  fills_json        TEXT NOT NULL,
  realized_pnl_usd  NUMERIC(14,4),
  ts_submitted      TIMESTAMP NOT NULL,
  ts_closed         TIMESTAMP,
  FOREIGN KEY (trace_id) REFERENCES trace_summary(trace_id)
);

CREATE INDEX idx_trade_trace ON trade_execution(trace_id);

Three design choices deserve a note. parent_event_id and retry_of_event_id are separate on purpose: a parent event is the prior turn in a conversation, a retry is a replay of the same logical call. Both need to be visible for correct attribution. idempotency_key is populated from the client-side key sent on the request, not from the provider response — this lets the analyst detect double-billing when retries succeed late. raw_usage_json stores the full provider usage block verbatim so that a future re-pricing (when list prices change or a missed field is discovered) can be recomputed from source. The schema is append-only; corrections go in as compensating rows with a linking retry_of_event_id.

The two canonical queries

The two questions at the top of the article have one query each. Both assume SQLite syntax; PostgreSQL and DuckDB require only minor dialect shifts.

Cost per idea

WITH idea_cost AS (
  SELECT
    t.trace_id,
    t.outcome,
    t.idea_description,
    SUM(e.cost_usd)            AS total_cost_usd,
    SUM(e.tokens_input)        AS tokens_in,
    SUM(e.tokens_output)       AS tokens_out,
    SUM(e.tokens_thinking)     AS tokens_think,
    COUNT(e.event_id)          AS n_calls
  FROM trace_summary t
  LEFT JOIN cost_events e ON e.trace_id = t.trace_id
  WHERE t.ts_opened >= DATE('now', '-30 days')
  GROUP BY t.trace_id
)
SELECT
  outcome,
  COUNT(*)                                AS n_ideas,
  ROUND(AVG(total_cost_usd), 4)           AS mean_usd,
  ROUND(MIN(total_cost_usd), 4)           AS min_usd,
  -- approximate quantiles via window functions
  ROUND(
    (SELECT total_cost_usd FROM idea_cost ic2
      WHERE ic2.outcome = ic.outcome
      ORDER BY total_cost_usd
      LIMIT 1 OFFSET (COUNT(*) OVER (PARTITION BY outcome) / 2)
    ), 4)                                 AS p50_usd,
  ROUND(MAX(total_cost_usd), 4)           AS max_usd
FROM idea_cost ic
GROUP BY outcome;

The query returns a row per outcome class. The no-action row is the research overhead: money spent on ideas that did not reach paper or live. The paper and live rows show cost conditional on an idea surviving validation. For production dashboards the approximate-quantile subquery is usually replaced with DuckDB's quantile_cont or PostgreSQL's percentile_cont; SQLite's JSON1 extension also supports a simpler group-sort-offset pattern.³

Cost per validated trade

WITH totals AS (
  SELECT
    SUM(e.cost_usd)                                      AS total_cost_usd,
    COUNT(DISTINCT t.trace_id)                           AS n_traces,
    COUNT(DISTINCT CASE WHEN t.outcome = 'live'
                        THEN t.trace_id END)             AS n_live,
    COUNT(DISTINCT CASE WHEN t.outcome IN ('paper','live')
                        THEN t.trace_id END)             AS n_validated
  FROM trace_summary t
  LEFT JOIN cost_events e ON e.trace_id = t.trace_id
  WHERE t.ts_opened >= DATE('now', '-30 days')
),
profit AS (
  SELECT COUNT(DISTINCT te.trace_id) AS n_profitable
  FROM trade_execution te
  WHERE te.realized_pnl_usd > 0
    AND te.ts_closed >= DATE('now', '-30 days')
)
SELECT
  total_cost_usd,
  n_traces,
  n_validated,
  n_live,
  n_profitable,
  ROUND(total_cost_usd * 1.0 / NULLIF(n_validated, 0), 4)  AS cost_per_validated_trade,
  ROUND(total_cost_usd * 1.0 / NULLIF(n_live, 0), 4)       AS cost_per_live_trade,
  ROUND(total_cost_usd * 1.0 / NULLIF(n_profitable, 0), 4) AS cost_per_profitable_trade
FROM totals, profit;

The denominator choice is the argument. cost_per_validated_trade uses paper + live because both represent an idea that survived the research gate; paper trades are a legitimate form of validation even when no capital was committed. cost_per_live_trade is stricter: it charges the full research budget to capital-at-risk ideas only. cost_per_profitable_trade is the most pessimistic: it assigns all spend to the subset of live trades that actually booked positive realized PnL. Reporting all three separates research overhead from decision overhead from realized economics.

A worked example. A month with $620 of model spend, 410 traces opened, 87 marked paper or live, 23 marked live, and 11 of those 23 profitable, yields $7.13 per validated trade, $26.96 per live trade, and $56.36 per profitable trade. A per-call average of $0.40 does not see any of those numbers.

Denominator	n	Cost per unit (USD)	What it reports
Raw cost_events	1,550	0.40	Average model call price
All traces opened	410	1.51	Cost per idea examined
Paper + live (validated)	87	7.13	Cost per surviving idea
Live trades only	23	26.96	Cost per capital deployment
Profitable live trades	11	56.36	Cost per realized-PnL event

Pooling and amortization

Three pooled cost types need explicit amortization to produce correct per-trace attribution: prompt-cache writes, thinking tokens, and batch submissions.

Cache writes. Anthropic's prompt caching bills the first call at a 25% premium on the cached prefix and subsequent reads at 10% of the uncached input rate, with a five-minute TTL sliding on each read.² Charging the full write cost to the first trace overstates that trace's cost by the full amortization potential. The correct treatment is to recognize the write event, track the reads that hit the same cache entry within its TTL window, and pro-rate the write cost across the reads (and the initial write, which is itself a read).

Thinking tokens. Extended-thinking requests charge for internal reasoning tokens at output-token rates on Claude 4.x and at reasoning-token rates on GPT-5 class models.¹ These belong to the specific request that enabled thinking; the amortization rule is simply that tokens_thinking multiplies by the correct unit rate and rolls into cost_usd for that event, not a pool.

Batch API. Anthropic and OpenAI both offer batch endpoints at 50% of synchronous pricing for 24-hour turnaround.¹ A batch job produces a job-level invoice that must be split across the per-request line items before the events can be joined to traces. See Batch API Economics for Finance for the tradeoff analysis.

A runnable amortizer for the cache-write case:

from __future__ import annotations
from datetime import timedelta
from typing import Iterable
import pandas as pd

CACHE_TTL = timedelta(minutes=5)

def attribute_cache_write(events: pd.DataFrame) -> pd.DataFrame:
    """Pro-rate cache-write cost over write + all reads within the same TTL.

    Input columns: event_id, trace_id, ts, provider, model,
                   tokens_cached_write, tokens_cached_read, cost_usd.
    Output: events with cost_usd adjusted and a new column
            'cost_adjustment_usd' explaining the delta.
    """
    df = events.sort_values("ts").reset_index(drop=True).copy()
    df["cost_adjustment_usd"] = 0.0

    write_rows = df.index[df["tokens_cached_write"] > 0].tolist()
    for wi in write_rows:
        w = df.loc[wi]
        window_end = w["ts"] + CACHE_TTL
        mask = (
            (df["provider"] == w["provider"])
            & (df["model"] == w["model"])
            & (df["ts"] >= w["ts"])
            & (df["ts"] <= window_end)
            & ((df["tokens_cached_read"] > 0) | (df.index == wi))
        )
        consumers = df.index[mask].tolist()
        if len(consumers) < 2:
            continue
        share = w["cost_usd"] / len(consumers)
        df.loc[wi, "cost_usd"] = share
        df.loc[wi, "cost_adjustment_usd"] = share - w["cost_usd"]
        for ci in consumers:
            if ci == wi:
                continue
            df.loc[ci, "cost_usd"] += share
            df.loc[ci, "cost_adjustment_usd"] += share
    return df

The function preserves total spend (sums across all rows are unchanged by construction) while moving cost from the single write event onto the reads that benefited. That matters when a single trace paid the write and many later traces benefited: without amortization, the first trace looks expensive and the later ones look free; with amortization, the load is distributed to the ideas that actually consumed the cached context.

Dashboard output

Three tables communicate the state of the funnel. A minimal reader script:

import sqlite3
import pandas as pd

conn = sqlite3.connect("cost_attribution.db")

daily = pd.read_sql("""
  SELECT DATE(ts) AS day, provider, model,
         ROUND(SUM(cost_usd), 2) AS usd,
         SUM(tokens_input + tokens_output + tokens_thinking) AS tokens
  FROM cost_events
  WHERE ts >= DATE('now', '-30 days')
  GROUP BY day, provider, model
  ORDER BY day DESC, usd DESC
""", conn)

per_idea = pd.read_sql("""
  SELECT t.outcome,
         ROUND(AVG(x.c), 4) AS mean_usd,
         ROUND(MIN(x.c), 4) AS min_usd,
         ROUND(MAX(x.c), 4) AS max_usd,
         COUNT(*) AS n
  FROM trace_summary t
  JOIN (SELECT trace_id, SUM(cost_usd) AS c FROM cost_events GROUP BY trace_id) x
    ON x.trace_id = t.trace_id
  WHERE t.ts_opened >= DATE('now', '-30 days')
  GROUP BY t.outcome
""", conn)

rolling = pd.read_sql("""
  SELECT DATE(t.ts_opened) AS day,
         ROUND(SUM(e.cost_usd) /
           NULLIF(COUNT(DISTINCT CASE WHEN t.outcome='live' THEN t.trace_id END), 0),
           4) AS cost_per_live_trade
  FROM trace_summary t LEFT JOIN cost_events e ON e.trace_id = t.trace_id
  WHERE t.ts_opened >= DATE('now', '-30 days')
  GROUP BY day ORDER BY day
""", conn)

print("\n== Daily spend by provider and model =="); print(daily.to_string(index=False))
print("\n== Cost-per-idea distribution (30d) =="); print(per_idea.to_string(index=False))
print("\n== Cost-per-live-trade rolling =="); print(rolling.tail(14).to_string(index=False))

The script runs in a cron, writes to a static HTML via pandas' to_html, and drops the output into a dashboard that the operator checks before renewing the API plan. Combined with the observability layer described in Observability for LLM Trading Agents, the two schemas share the trace_id primary key and can be joined to answer questions about latency versus cost versus realized PnL in a single view.

Anti-patterns

Bill-level tracking only. The provider dashboard shows $620 for the month. No attribution is possible because the unit is a calendar window, not an idea. Every analytical question collapses to a monthly average.

Using the provider usage endpoint as source of truth. Anthropic's usage endpoint and OpenAI's usage API are useful reconciliation tools, but they double-count retries when the client retried a timed-out request that later succeeded on the provider side.¹⁴ The client-side idempotency_key is the only signal that distinguishes a logical retry from two distinct requests. Treat provider endpoints as a reconciliation check against client-side logs, not as the primary record.

Cost-per-request divided by idea count without outcome filtering. This reports a number that looks like cost-per-idea but is really cost-per-call. An idea that takes twelve calls is counted twelve times in the denominator when outcome-level deduplication is missing. The trace_id join fixes this; a raw AVG over cost_events does not.

Treating cache-write cost as free research. Cache writes are real dollars billed at a 25% premium on input tokens. Attributing them entirely to the amortization pool with no charge to any specific trace hides real spend. The amortizer above distributes the cost; zeroing it out does not.

Ignoring thinking tokens. Extended-thinking and reasoning-effort requests can emit tens of thousands of tokens the client never parses. A cost model that only charges tokens_input + tokens_output under-reports Claude 4.x thinking spend by the full thinking-token column and under-reports GPT-5 class spend similarly. See Thinking Tokens for Finance Tasks for the tradeoff.

Failing to version the model. A model column without a model_version column collapses spend across price changes. When a provider refreshes pricing mid-month, queries grouped by model alone silently blend old and new unit economics.

Connects to

Observability for LLM Trading Agents — schema-compatible trace_id and event-level logging layer.
Bounded-Cost Agentic Research — setting a per-trace spend cap that the attribution schema can enforce.
Model Selection Framework for Finance — which model tier to use given the per-trade cost target.
Batch API Economics for Finance — how to amortize batch-job invoices across per-request line items.
Thinking Tokens for Finance Tasks — pricing the internal-reasoning column correctly.
Token Cost Reality for LLM Trading Research — list-price side of the same question.
Agent Cost Envelope Calculator — forward estimate of cost-per-trade under assumed validation rates.
Token Cost Optimizer — per-call cost estimator that feeds the envelope.
Batch vs Real-Time Cost Calculator — compares synchronous and batch-API spend for the same workload.

References

Anthropic. "Token Counting API." docs.anthropic.com/en/api/messages-count-tokens, 2026. Pre-request token estimation for budgeting.
OpenAI. "Pricing." openai.com/api/pricing, 2026. Per-model input, output, and cached-input rates.
Google DeepMind. "Gemini API Pricing and Context Caching." ai.google.dev/pricing, 2026.
Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio." Journal of Portfolio Management 40(5). Cited for the adjacent question of whether the trades an attribution system measures are themselves real alpha; see also Sharpe Ratio Trap.

Anthropic. "Messages API — Usage and Billing." docs.anthropic.com, 2026. Documents input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens, and extended-thinking token accounting. ↩ ↩² ↩³ ↩⁴
Anthropic. "Prompt Caching." docs.anthropic.com, 2026. Five-minute TTL, 25% premium on writes, 10% rate on reads. ↩ ↩²
SQLite. "JSON1 Extension and Window Functions." sqlite.org/json1.html and sqlite.org/windowfunctions.html, 2026. ↩
OpenAI. "Usage API Reference." platform.openai.com/docs/api-reference/usage, 2026. ↩