Audit defensibility for LLM-driven finance decisions is six explicit layers: source-of-truth attestation, prompt versioning, model-output recording, deterministic post-processing, structured-schema validation, and tamper-evident logging. MiCA Article 32 mandates record-keeping for crypto-asset service providers; SEC Books and Records Rule 17a-4 and FINRA Rule 4511 mandate equivalent records for US securities. The audit gap that collapses any of these six layers reconstructs the unrecoverable hole that BaFin's MaRisk module AT 9 and the SEC's recent enforcement actions specifically target. The methodology below maps each layer to the regulatory anchor and the failure mode that motivates it.
TL;DR
Six load-bearing layers, each anchored to a specific regulation:
| Layer | What it captures | Regulatory anchor | Failure mode if collapsed |
|---|---|---|---|
| 1. Source attestation | Hash + URL + retrieval timestamp of every external input | MiCA Art. 32; SEC 17a-4 | Cannot prove which version of the source the decision used |
| 2. Prompt versioning | Git-committed prompts with semver tags | MiCA Art. 36; FINRA 4511 | Cannot reproduce the LLM's reasoning trace |
| 3. Model output recording | Raw response + tokens consumed + tool calls | Federal Reserve SR 11-7 | Cannot demonstrate the model was queried at all |
| 4. Deterministic post-processing | Pure-function reduction from output → decision | MiFID II Art. 17 algo trading | Cannot recreate the decision from the model output |
| 5. Schema validation | Strict-mode structural check on the final payload | SEC 17a-4 immediate-retrievability | Cannot prove the decision was well-formed |
| 6. Tamper-evident logging | Append-only logs with cryptographic hash chain | BaFin MaRisk AT 9 | Cannot prove the logs are intact |
Skipping any layer creates an audit gap that the regulator's reconstruction request cannot bridge.
Why six layers, not three
The naïve audit pipeline is three layers: input, model, output. The audit-bound regulators (MiFID II, MiCA, SEC, FINRA, BaFin) consistently surface failures that the naïve pipeline cannot recover from:
- A pre-trade research note used a 10-K filing fetched at 09:14 UTC; the same 10-K was amended later that day. Which version was used? The naïve pipeline records the URL but not the content hash; the amendment is undetectable.
- The LLM was prompted with a system message that included a position-size cap of 1%. A week later the cap was changed to 2%. The decision in question used the 1% cap. Which prompt version was active when the decision was made? The naïve pipeline records the model output but not the prompt.
- The model output included three candidate trades; the post-processing logic selected one. The selection logic was a probabilistic ranker that has since been re-fit. How was the decision reached? The naïve pipeline records the final selection but not the ranker.
The six-layer methodology specifically closes the three additional gaps: prompt versioning (layer 2), deterministic post-processing recording (layer 4), and a logging architecture that prevents retroactive editing (layer 6). Each is necessary because regulators do not accept "we have logs that show the answer" as evidence, they require "we have logs that show the answer was reachable from inputs we can reproduce."
Layer 1: source-of-truth attestation
Every external input the LLM consumes, market data, SEC filings, news articles, earnings transcripts, needs a hash, URL, and retrieval timestamp recorded in the audit log. The hash is the load-bearing element: it lets the auditor verify that the input the LLM saw is the same input the auditor can re-fetch (or, more commonly, the version the source serves now is different and the original hash is the only evidence of what was used).
MiCA Article 32 paragraph 3 requires CASPs to "maintain records of all transactions, services and activities undertaken for at least five years"1. The same record-keeping obligation reaches the inputs to the decision, not just the decision itself. The EU's broader MiFID II Article 16 paragraph 7 extends the record retention to algorithmic trading inputs explicitly2.
Implementation: for every input the LLM consumes, store (SHA-256, URL, fetch_timestamp_UTC, retrieval_method). The hash should be computed before the input is passed to the LLM. The fetch is a single side-effecting operation; the hash is the only proof it happened.
Layer 2: prompt versioning
The prompt is the LLM's instruction set. Without prompt versioning, the auditor cannot reconstruct what the LLM was asked to do. Git-committed prompts with semver tags are the minimum; the prompt hash should be recorded alongside the model output.
A workflow that edits prompts in-place between calls without committing each version has destroyed the audit trail. The SEC's 17a-4(f) requirement that records be "preserved in a non-rewriteable, non-erasable format" applies here, a Word document of prompt drafts is not a defensible record3.
FINRA Rule 4511 paragraph (a) requires members to "make and preserve books and records... in such manner and in such form as may be prescribed by FINRA rules"4. The form has to be reproducible; an evolving prompt with no version history is not.
Layer 3: model output recording
The raw model response, the tool calls made during the response, the tokens consumed (input/output/cache), the model identifier (including version pin), and the timestamp at the start and end of the call. All recorded together.
Federal Reserve Supervisory Letter SR 11-7 ("Guidance on Model Risk Management") explicitly requires that "model output and inputs are well-documented and reviewable"5. The MRM framework predates LLMs but its principles apply: any model used for a regulated decision must be traceable input-to-output.
The token-count fields matter for cost audit (which the regulator may not care about) and for verifying the call actually happened (which the regulator does care about, a missing token count is suspect).
Layer 4: deterministic post-processing
The post-processing step transforms the model's free-form output into the structured decision. Most LLM-driven pipelines run several post-processing layers: parse JSON, validate schema, apply filters, select among candidates, compute position sizes, route to broker.
Each step must be a pure function with no hidden state. The post-processing script should be committed to git with the same semver tagging as the prompts. The audit trail records the script hash alongside the model output; the auditor can re-run the deterministic post-processing on the original output and verify the same decision emerges.
MiFID II Article 17 specifically requires that algorithmic trading systems be "tested and authorised" before deployment2. A non-deterministic post-processing layer cannot be tested-and-authorised in the regulatory sense; the deterministic constraint is the only way to satisfy the requirement.
Layer 5: schema validation
The final decision payload (trade instruction, research note, risk assessment) must pass a strict-mode schema validator before it is acted on. The Structured Schema Validator (Finance) is one such implementation; the Hallucination Detector complements it on the content side.
The schema is the contract between the LLM-driven pipeline and the downstream execution layer. SEC 17a-4(b)(4) requires records be "immediately retrievable" and the schema is what makes them so, a schema-validated payload is a queryable JSON object, not a free-form text blob.
The strict-mode rule set is mandatory here. Lenient-mode validators that accept partial payloads with defaults defeat the audit purpose: the auditor cannot tell which fields were the LLM's intent vs the validator's default.
Layer 6: tamper-evident logging
Append-only logs with a cryptographic hash chain (each log entry's hash includes the previous entry's hash) prevent retroactive editing. The simplest implementation: SQLite with a hash-chain trigger; the more defensible: a write-once-read-many (WORM) storage layer with cryptographic attestation.
BaFin's MaRisk module AT 9 (outsourcing) and AT 7.2 (IT systems) require "complete, comprehensible and consistent" audit trails6. The German regulator has specifically signalled that LLM-driven trading decisions fall under these requirements; the relevant guidance is forthcoming as of 2026.
A logging architecture that allows retroactive edits (e.g., a regular database that lets administrators run UPDATE statements on the audit-log table) is not defensible. The hash chain is the cheap defence; the WORM storage is the institutional-grade defence.
The Constitutional AI parallel
Anthropic's Constitutional AI methodology7 documents the difficulty of getting LLMs to follow structured rules. The same difficulty applies to audit-trail design: an LLM that "knows" it should declare its sources often fails to do so under pressure. The six layers exist because the LLM's compliance with audit requirements is unreliable; the architecture has to be defensive.
The Anthropic responsible-scaling-policy publication8 documents how safety-relevant features are tested before deployment. The audit-trail analog is to test the six layers before going live: simulate a regulatory request, attempt to reconstruct a historical decision, identify the layer that breaks first. The break-point is the audit gap.
When the six layers are overkill
For a retail solo trader running paper trades on no client capital, audit defensibility is not the binding constraint. The six layers are designed for environments where the trade has external consequences, client capital, regulatory reporting, public disclosure.
For a retail solo on personal capital, layers 1, 2, 3, and 6 are still the minimum (source attestation, prompt versioning, output recording, append-only logs). Layers 4 and 5 (deterministic post-processing, schema validation) are optional but recommended — they prevent silent failure even in unregulated contexts.
For institutional contexts (RIA, hedge fund, broker-dealer), all six layers are mandatory. The cost of building them once is small relative to the cost of an SEC enforcement action that finds an audit-trail gap; the SEC's recent enforcement priorities specifically include AI-driven decision-making9.
Connects to
- Compliance Audit Trails for LLM Trades — the implementation guide for the six layers.
- BaFin + EU Guide for Retail AI Traders — the EU-specific regulatory framework.
- Research Diary Schema Auditable — the schema-level pattern for the research-output portion.
- Hallucination Detector — content-level audit gate.
- Structured Schema Validator (Finance) — shape-level audit gate.
- FTC vs NLT Regulatory Cost — compliance-cost calculator.
References
Footnotes
-
European Parliament. (2023). "Regulation (EU) 2023/1114 on markets in crypto-assets (MiCA)." Official Journal of the European Union. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32023R1114 ↩
-
European Parliament. (2014). "Directive 2014/65/EU on markets in financial instruments (MiFID II)." Article 17. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32014L0065 ↩ ↩2
-
SEC. "Books and Records Requirements." 17 CFR 240.17a-4. https://www.sec.gov/divisions/marketreg/mrnote.htm ↩
-
FINRA. "Rule 4511. General Requirements." finra.org/rules-guidance/rulebooks/finra-rules/4511. https://www.finra.org/rules-guidance/rulebooks/finra-rules/4511 ↩
-
Board of Governors of the Federal Reserve System. (2011). "Guidance on Model Risk Management." SR Letter 11-7. https://www.federalreserve.gov/boarddocs/srletters/2011/sr1107.htm ↩
-
BaFin. "MaRisk — Minimum Requirements for Risk Management." Modules AT 7.2 and AT 9. (Available via BaFin's publications archive; consult the current consolidated text for the operative paragraph numbers.) ↩
-
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic Research. https://arxiv.org/abs/2212.08073 ↩
-
Anthropic. (2023, updated 2024-2026). "Anthropic's Responsible Scaling Policy." https://www.anthropic.com/news/anthropics-responsible-scaling-policy ↩
-
SEC. "Examinations of Investment Advisers." sec.gov, accessed 2026-05-21. Reference for current enforcement priorities on AI-driven advisory. ↩
Frequently asked questions
- Do I need all six layers if I'm running on personal capital with no clients?
- Layers 1, 2, 3, and 6 are the minimum — they protect against future 'what was I thinking' questions. Layers 4 and 5 add silent-failure defence; recommended but not load-bearing for retail solo.
- Can I use a cloud logging service for layer 6, or do I need on-prem WORM storage?
- Cloud logging with object-lock retention is acceptable for retail and small-institutional contexts. WORM storage is the institutional default. The hash-chain pattern works on either.
- How does this interact with MiCA's specific rules for crypto-asset service providers?
- MiCA Article 32 has the same record-retention shape as MiFID II for traditional securities. The six layers apply identically; the difference is in which inputs are subject to recording.
- What about model versioning when the vendor updates the model under me?
- Record the exact model identifier in layer 3. If the vendor changes the model under that identifier without notice, the hash chain in layer 6 detects the inconsistency.
- Are these layers sufficient for SEC enforcement defense?
- Not sufficient — they are the technical foundation. Legal review by a securities attorney is mandatory for any pipeline touching client capital. The layers are necessary but not sufficient.