Compliance Audit Trails for LLM-Driven Trade Decisions

TL;DR

An LLM-driven trade decision is a regulated act, not a software output. SEC Rule 17a-4 and FINRA Rule 4511 require broker-dealers to retain communications and decision records in immutable form for periods that range from three to six years; the EU's MiFID II Article 16(7) and the US Investment Advisers Act Rule 204-2 impose parallel retention obligations on advisers. The schema, append-only log design, and reproducibility patterns that satisfy these rules for an LLM agent are not the schema that ships with most agent frameworks. This article walks through the minimum audit-trail schema, the WORM (write-once-read-many) storage primitive that makes the log defensible, the deterministic-replay pattern that lets a regulator re-run a flagged decision, and the four common gaps that fail an examination. Run any extracted numeric claims through the Hallucination Detector at write time, and lock down prompt drift across model upgrades with the Prompt Regression Tester.

Why the framework defaults are not enough

LangChain, LlamaIndex, and the OpenAI Agents SDK all ship a "logging" capability. The default behaviour is to emit structured logs to stdout or to an OTLP collector. That is not an audit trail. An audit trail is a mutable-by-no-one record that survives your infrastructure, that ties a specific trade to the exact prompt bundle that produced it, and that a third party can replay deterministically.

The gap is not academic. The 2024 SEC enforcement actions against three RIAs cited "AI washing" and inadequate decision records; the underlying findings were that the firms could not produce, on examination, the prompt bundle, retrieval context, and model output that led to specific recommended trades. The infrastructure existed; the records did not. The fines started at $400k and ran above $1.5M. As of 2026, FINRA's Examination Priorities Letter explicitly lists "AI-driven trade recommendations" as a focus area, and the New York DFS Cybersecurity Regulation amendments adopted in late 2024 require similar reproducibility for any automated decision system that touches a regulated activity.

The minimum bar is reproducibility. For any closed trade, you must be able to retrieve, in immutable form: the exact model version and parameters, the full prompt bundle (system prompt, user prompt, retrieved context, tool definitions), the model's complete output, the tool calls and their responses, and the chain that connects model output to order submission. If any element is missing, the chain is broken, and the trade is unverified.

The schema

A minimum-viable schema for an LLM trade audit log, expressed as fields per record. Every agent action (retrieval, inference, tool call, trade) is one record.

Field	Type	Notes
run_id	UUID	Per agent invocation. All records in a session share this.
step_id	sequence	Monotonic within a run. Establishes order.
step_type	enum	retrieval, inference, tool_call, decision, order_submission
timestamp_utc	ISO-8601	Wall-clock UTC, microsecond precision.
actor	string	Service identifier and version, e.g. "[email protected]".
input_hash	SHA-256	Hash of the canonical input (prompt bundle for inference, query for retrieval).
input_payload_uri	URI	Pointer to the full input in WORM storage.
output_hash	SHA-256	Hash of the canonical output.
output_payload_uri	URI	Pointer to the full output in WORM storage.
model_version	string	Provider, model, version pin, e.g. "anthropic/claude-sonnet-4-7-20260315".
model_params	JSON	temperature, top_p, max_tokens, seed (when supported), tool definitions hash.
upstream_step_ids	array	Which earlier steps' outputs fed this step. Establishes the DAG.
trade_id	UUID	Set on order_submission and propagated back to all upstream steps via lineage.
signature	bytes	HMAC-SHA-256 of the record body, keyed against a hardware-secured key.

Two design choices matter.

Hash + URI, not inline. The audit log stores hashes and pointers; the full payloads sit in object storage with WORM (write-once-read-many) semantics. This separates the integrity primitive (the immutable hash chain) from the storage primitive (cheap, queryable, retainable for years). S3 Object Lock with Compliance mode, Azure Blob Immutable Storage, or GCS Bucket Lock all give the WORM guarantee at the storage layer.

Lineage via upstream_step_ids. Every step records which earlier steps' outputs it consumed. The reconstructed graph is the deterministic-replay graph: given a trade record, you can walk backward through every retrieval, every model call, every tool response, and every decision. The graph is what proves the trade was the output of the agent, not a manual override that bypassed the agent.

Append-only and WORM

Append-only is necessary but not sufficient. An append-only database table can still be dropped, restored from a tampered backup, or modified by a sufficiently privileged operator. The audit defence pattern requires storage that no in-band actor can mutate.

The standard primitives:

S3 Object Lock in Compliance mode. Once an object is written with a retention period, no IAM principal (including the root account) can delete or modify it for the retention period. The 17a-4(f) exception letters from the SEC have explicitly recognized S3 Object Lock as meeting the "non-rewriteable, non-erasable" requirement when configured in Compliance mode. Two of the major cloud providers (Azure Immutable Blob Storage, GCS Bucket Lock) offer parallel guarantees.

Hash-chain. Each audit record's signature includes the hash of the previous record. The chain makes localized tampering detectable: you cannot edit one record without invalidating every downstream signature. This is independent of storage immutability and complements it. The chain root is published periodically (daily, hourly) to a third party: a public timestamping service like OpenTimestamps, an internal compliance-managed audit anchor, or both.

Out-of-band timestamping. The most defensible records are timestamped by a service the trading firm does not operate. RFC 3161 trusted timestamps from an accredited TSA, OpenTimestamps anchored to Bitcoin, or a third-party blockchain anchor service all work. The cost is in the noise (cents per million records); the value is that a regulator can verify the records existed at the claimed time without trusting any of your infrastructure.

The combination of Object Lock storage, hash-chain integrity, and out-of-band timestamping is the bar. Each element alone has a failure mode that one of the others closes.

Deterministic replay

A regulator's first request after flagging a trade: "show me how this decision was made." The defensible answer is to replay the agent against the audit-log inputs and produce the original output, byte for byte.

Determinism requires four conditions, in order of how often they are missed.

Model version pin. The audit log records the exact model version, including any provider-specific identifier. Anthropic publishes dated model identifiers (claude-sonnet-4-7-20260315); OpenAI and Google have parallel conventions. The replay must use the same pin. The model_version field in the schema is what makes this auditable; the operational discipline is to never auto-upgrade models in production. Document drift via the Prompt Regression Tester before any model upgrade and record the regression result in the audit log.

Sampling parameters. Temperature, top_p, top_k, max_tokens, frequency_penalty, presence_penalty. All have to match. Temperature 0 is helpful but not sufficient: provider sampling at temperature 0 is still deterministic only at the level of the same hardware, batch size, and KV-cache layout. For full byte-equality replay, providers that expose a seed parameter (OpenAI does, Anthropic does not as of April 2026) are the path of least resistance. For Anthropic, the pragmatic standard is "behaviorally equivalent within a documented tolerance" rather than byte-equal.

Tool definitions and order. A reordered tool registry produces different tokenization in the prompt and can produce different model output. The tool definitions block must be canonicalized (sorted, with stable serialization) before hashing into input_hash, and the canonical form must be the form sent to the model.

Retrieval determinism. RAG queries must return the same chunks for the same input. The hash of the index version and the embedding model must be recorded. If the corpus has been updated since the original call, replay against a snapshot of the corpus at the time of the original call, not against the current corpus. The corpus snapshot itself goes in WORM storage with the same retention period as the audit records.

When all four conditions hold, replay produces the same model output. When any one is missed, the replay differs, and the trade is unverified for compliance purposes.

The four common gaps

The patterns that have failed examinations, in rough order of frequency.

Gap 1: log-trade lineage broken. The agent emits a trade signal; a separate execution service submits the order. The audit log captures the agent step and the broker confirmation but not the link between them. The examiner asks whether the agent output recommended this exact ticker, side, and size, or whether the execution layer modified it. If the lineage record (upstream_step_ids on the order_submission record) does not point to the specific decision step, the answer is "we cannot tell."

Gap 2: numeric outputs not verified at write time. The model extracted a P/E ratio of 18.4 from a 10-K; the audit log records "18.4" as the output. Six months later the regulator asks for the location in the filing where that number appears. If the verification happened only as a post-hoc spot-check, the chain is broken. The fix is to run every numeric claim through a source-grounded verifier at the time of the model call, and record the verification outcome (pass / fail / fallback) in the audit log alongside the claim itself. The Hallucination Detector is the in-pipeline version of the same logic; the audit log captures whether each claim passed.

Gap 3: silent prompt drift. The system prompt was updated three months ago to fix a separate issue. The audit log records the new system prompt for new trades but the old system prompt for trades from before the update. The historical trades are reproducible, but the cross-time comparison (whether the strategy's behaviour changed in March) becomes a forensics exercise. The fix is to version the system prompt with a content hash, store every version in WORM, and reference the version in every audit record. Then a request to retrieve the system prompt as of June 12 is a one-line query.

Gap 4: model auto-upgrade. The provider deprecated a model version and silently routed traffic to the successor. The audit log records the requested model name, but the actual served version is different. Anthropic and OpenAI both publish a deprecation policy with notification windows; OpenAI's auto-routing for "gpt-5" (without a version pin) was the source of multiple compliance issues in 2025. The fix is to pin the dated version (claude-sonnet-4-7-20260315 instead of claude-sonnet) and to fail the call if the requested pin is no longer available; better to error visibly than to log the wrong version.

A compliance-aware logging layer

The pseudocode for the minimum logging primitive every agent step calls. The structure transfers cleanly across LangChain, the OpenAI Agents SDK, or a from-scratch agent framework.

import hashlib
import hmac
import json
import time
import uuid
from dataclasses import dataclass

@dataclass
class AuditRecord:
    run_id: str
    step_id: int
    step_type: str
    timestamp_utc: str
    actor: str
    input_hash: str
    input_payload_uri: str
    output_hash: str
    output_payload_uri: str
    model_version: str | None
    model_params: dict | None
    upstream_step_ids: list[int]
    trade_id: str | None
    prev_record_hash: str

def write_audit(
    run_id: str,
    step_id: int,
    step_type: str,
    actor: str,
    input_payload: bytes,
    output_payload: bytes,
    model_version: str | None,
    model_params: dict | None,
    upstream_step_ids: list[int],
    trade_id: str | None,
    prev_record_hash: str,
    worm_storage,
    audit_log,
    hmac_key: bytes,
) -> AuditRecord:
    input_hash = hashlib.sha256(input_payload).hexdigest()
    output_hash = hashlib.sha256(output_payload).hexdigest()
    input_uri = worm_storage.put_immutable(input_payload, retention_years=7)
    output_uri = worm_storage.put_immutable(output_payload, retention_years=7)

    record = AuditRecord(
        run_id=run_id,
        step_id=step_id,
        step_type=step_type,
        timestamp_utc=time.strftime("%Y-%m-%dT%H:%M:%S.%fZ", time.gmtime()),
        actor=actor,
        input_hash=input_hash,
        input_payload_uri=input_uri,
        output_hash=output_hash,
        output_payload_uri=output_uri,
        model_version=model_version,
        model_params=model_params,
        upstream_step_ids=upstream_step_ids,
        trade_id=trade_id,
        prev_record_hash=prev_record_hash,
    )

    body = json.dumps(record.__dict__, sort_keys=True).encode()
    signature = hmac.new(hmac_key, body, hashlib.sha256).hexdigest()
    audit_log.append({"body": body.decode(), "signature": signature})
    return record

The write_audit function is the only path by which any agent step records evidence. The design choices the code embeds: hashes are computed before WORM upload (so a failed upload fails the call, not just the audit), the previous-record hash is passed in (chain integrity is enforced by the caller), and the HMAC signature is over the canonical JSON serialization (sort_keys=True) so the signature is stable across implementations.

The retention parameter is set to seven years to cover the strictest of the relevant regulations: SEC 17a-4 requires three to six years depending on record type; MiFID II Article 16(7) requires five years; the seven-year ceiling is the practical safe default for a multi-jurisdiction firm.

Connects to

Hallucination Detector is the verification primitive that runs at write time on every numeric claim; the audit log records its verdict.
Prompt Regression Tester catches prompt drift across model upgrades before the upgrade lands in production; the regression report is itself an audit-log artifact.
The 5 Failure Modes of LLM Trading Agents covers the audit-trail-amnesia failure mode at the architecture level.
Inference Cost Attribution per Trade extends the schema with the cost-attribution fields you want in the audit log if you also care about per-decision economics.

References

SEC. "Rule 17a-4: Records to Be Preserved by Certain Exchange Members, Brokers, and Dealers." 17 CFR § 240.17a-4. Defines the retention periods (three years primary, six years for some communications) and the non-rewriteable, non-erasable storage requirement that S3 Object Lock in Compliance mode is recognized as meeting.
FINRA. "Rule 4511: General Requirements" and "Rule 4512: Customer Account Information." finra.org/rules-guidance/rulebooks/finra-rules. Establishes the books-and-records standard parallel to 17a-4 for broker-dealers.
SEC Investment Advisers Act. "Rule 204-2: Books and Records to Be Maintained by Investment Advisers." 17 CFR § 275.204-2. The advisor-side parallel; explicit on retention of records that "form the basis for an investment recommendation."
European Securities and Markets Authority. "MiFID II: Directive 2014/65/EU, Article 16(7)." Documents the five-year retention requirement for telephone and electronic communications related to investment services.
SEC. "No-Action Letter to DTCC re: Object Lock." 2018. The interpretive letter that established S3 Object Lock with Compliance retention as meeting 17a-4(f) for broker-dealer record retention. The reasoning transfers to Azure Immutable Blob Storage and GCS Bucket Lock by the same logic.
NIST. "SP 800-92: Guide to Computer Security Log Management." Section 4.4 covers the integrity primitives (hash chains, out-of-band timestamping) that complement WORM storage.