Federal Reserve Supervisory Letter SR 11-7 ("Guidance on Model Risk Management") was written in 2011 for bank trading-book models. Its three-lines-of-defence framework, model development with documentation, independent model validation, and internal audit, maps cleanly onto a solo LLM research workflow. Ignoring the framework re-invents the failures the regulation was designed to prevent. The methodology below shows how a single-operator workflow implements each line of defence at the scale and rigour appropriate for retail LLM-driven research.
TL;DR
SR 11-7's three lines of defence, adapted for a single-operator LLM workflow:
| Line | Bank version | Solo LLM workflow version |
|---|---|---|
| 1st (development) | Quant team with documentation standards | Git-versioned prompts + reproducible eval harness |
| 2nd (independent validation) | Separate Model Risk team with veto power | Cross-model agreement check + offline eval against held-out set |
| 3rd (internal audit) | Internal Audit + Board reporting | Monthly self-audit + machine-checkable invariants |
The solo workflow has no separate person for the 2nd line. The cross-model agreement check (run a second LLM family on the same prompt and compare outputs) is the cheap surrogate. The 3rd line uses automated invariants that fail loudly on regression.
Why SR 11-7 still matters in 2026
The Federal Reserve issued SR 11-7 in response to 2008-era model failures: Value-at-Risk models that underestimated tail risk, credit-default-swap pricing models that didn't reflect counterparty correlation, securitization models that ignored regime change1. Each failure shared a structural pattern: the model was developed in isolation, validated weakly (or by the same team), and audited rarely.
LLM-driven research loops repeat the pattern at a smaller scale. A solo researcher writes a prompt, tests it on a few examples, deploys it. The prompt is "validated" by the same person who wrote it. No automated regression catches drift. When the LLM provider updates the model under the same identifier (or when the prompt's edge case starts firing), the failure goes undetected until the P&L surfaces it.
The SR 11-7 framework is the structural defence. It does not require a bank-scale organisation; it requires architectural separation between building the model and validating it. For a solo operator, the separation is between yesterday's code and today's review, the same person plays both roles at different times with different tools.
1st line of defence: development with documentation
SR 11-7 requires that model documentation be thorough enough for someone unfamiliar with the model to evaluate and use it (per the guidance's published text)1. For the LLM workflow:
- Git-versioned prompts with commit messages that explain why each change was made. Not "tweak prompt"; instead "raise temperature from 0 to 0.2 to break stuck-in-template-output regression observed on AAPL Q3 transcript."
- Reproducible eval harness that runs the prompt on a fixed test set and produces a passable/failable verdict. The test set is the canonical examples plus the regression cases that motivated previous prompt changes.
- Model identifier pinning in the eval harness. "claude-sonnet-4-6" without a date suffix is insufficient; pin to a dated snapshot where the vendor supports it.
The eval harness is the load-bearing artefact. A prompt change that breaks the eval harness should not deploy until either the change is reverted or the eval expectation is updated (with documentation of why). The harness is the development-side discipline that prevents silent regression.
2nd line of defence: independent validation
SR 11-7 specifies that "validation should be conducted by staff with appropriate incentives, competence, and influence"1, explicitly different staff from the model developers. For a solo operator this is impossible literally, so the substitute is automated independence.
Three automated substitutes:
- Cross-model agreement. Run the same prompt through a second LLM family (e.g., Sonnet 4.6 and Gemini 2.5 Pro). If the outputs disagree substantively, flag for human review. Disagreement is a signal of fragility; agreement is a weak but non-zero signal of robustness.
- Held-out-set evaluation. Build a test set of (input, expected-output) pairs that the prompt-writer never sees while writing prompts. Run the eval after each prompt change; the held-out performance is the independent metric.
- Adversarial-input testing. Use the Prompt Injection Tester and edge-case inputs (malformed dates, unit-mismatched numbers, contradictory source documents) on every prompt change. Failure on adversarial inputs is the early-warning signal.
The Stanford HAI AI Index Report 2025 documents that cross-model agreement on finance-specific tasks ranges from 60–85%2. The 15–40% disagreement rate is where the 2nd line of defence catches errors the developer missed.
3rd line of defence: internal audit
SR 11-7's internal-audit requirement specifies "periodic review of the overall framework"1. For the solo LLM workflow:
- Monthly self-audit. Once a month, freeze the workflow, sample 20 random output decisions from the past 30 days, manually verify each against source. Track the verification pass rate over time; a drop is a signal.
- Machine-checkable invariants. Properties that should always hold (e.g., "every trade-decision JSON has all six required fields"; "no rationale contains the substring 'I don't have access to'"). The invariant-checker runs continuously and fails loudly on violation.
- Drift detection. Compare the distribution of outputs (e.g., the histogram of position sizes, the rate of "no trade" decisions) across rolling 30-day windows. A material distributional change is a flag for review.
The self-audit is the slowest layer but the most penetrating. A 20-sample monthly check at $30/hour of operator time is $30/month, a tiny fraction of the workflow's API spend. Skipping it means the failure modes accumulate undetected.
The fallback-chain pattern
A complementary architecture for the 2nd line of defence is the fallback chain: Haiku-class first pass, escalate to Sonnet on low confidence, escalate to Opus on disagreement between Haiku and Sonnet. The Fallback Chain Simulator models the cost and quality of this pattern.
The fallback chain is not strictly an MRM concept, but it implements one of MRM's structural ideas: redundant computation with independence between layers. A retail solo workflow that runs a single LLM call per decision has no redundancy; the fallback chain re-introduces it cheaply.
The cost of the fallback chain is bounded: most calls resolve at the cheap tier, only the ambiguous ones escalate. The expected cost is typically 1.3–1.8× the single-cheap-tier cost, far less than always running the expensive tier. The quality lift is meaningful, disagreement-triggered escalation catches the worst class of single-model failure.
What the framework does not catch
SR 11-7 is silent on data quality. A model trained on bad data is not a model risk in SR 11-7's sense; it is a data risk. For LLM-driven workflows the analog is the source the LLM is conditioned on, if the source document is hallucinated by the upstream pipeline (e.g., a fabricated press release), MRM cannot catch it. The Hallucination Detector is the data-side defence.
The framework is also silent on regime change. A model validated on 2020-2023 data has no defence against a 2024 regime that didn't exist in the validation set. Walk-forward validation (see /articles/walk-forward-window-sizing-decision/) is the regime-change defence; SR 11-7 assumes the model class is stationary.
The Anthropic responsible-scaling-policy publication3 documents a complementary framework, capability-based safety thresholds — that addresses risks SR 11-7 does not (e.g., the risk that the model itself produces dangerous content). For LLM-driven finance workflows the two frameworks together cover most operational risks.
Where the framework adds friction without value
For a solo retail trader running paper trades on a $5k account, the full three-lines architecture is overkill. The right scaled-down version:
- 1st line: keep git versioning, skip the formal eval harness; rely on inspection.
- 2nd line: occasional cross-model agreement on a sample of decisions; not every call.
- 3rd line: quarterly self-audit instead of monthly.
For institutional contexts (RIA, hedge fund, broker-dealer) the full framework is mandatory. The cost of building it (1-2 weeks of engineering for the eval harness, monthly time for self-audit) is small relative to the cost of an MRM-related enforcement action.
Connects to
- Model Selection Finance Survives Bench Cycles — the model-selection layer above MRM.
- Vendor Lock-In and Cross-Provider Fallback — the fallback-chain pattern in depth.
- Calibration Drift in LLM Confidence Scores — the drift-detection layer of the 3rd line.
- Model Selector for Finance — engine for model-tier selection.
- Hallucination Detector — companion content-validation gate.
- Fallback Chain Simulator — companion for the redundancy architecture.
References
- Basel Committee on Banking Supervision. (2020). "Principles for the sound management of operational risk." BIS Working Paper. https://www.bis.org/bcbs/publ/d514.htm
- OECD. "OECD AI Principles." https://oecd.ai/en/ai-principles
- Office of the Comptroller of the Currency. (2011). "OCC Bulletin 2011-12: Sound Practices for Model Risk Management." Companion to SR 11-7.
Footnotes
-
Board of Governors of the Federal Reserve System. (2011). "Supervisory Guidance on Model Risk Management." SR Letter 11-7, OCC Bulletin 2011-12. https://www.federalreserve.gov/boarddocs/srletters/2011/sr1107.htm ↩ ↩2 ↩3 ↩4
-
Stanford Institute for Human-Centered AI (HAI). (2025). "AI Index Report 2025." Chapter on technical performance and reliability across model families. https://hai.stanford.edu/ai-index/2025-ai-index-report ↩
-
Anthropic. "Anthropic's Responsible Scaling Policy." https://www.anthropic.com/news/anthropics-responsible-scaling-policy ↩
Frequently asked questions
- Is SR 11-7 legally binding for a solo retail trader?
- No — it is supervisory guidance for banks and bank holding companies regulated by the Federal Reserve. The framework is recommended as best practice, not mandated outside that perimeter.
- Can I outsource the 2nd-line validation to a vendor?
- Yes for some components. Cross-model agreement is naturally vendor-friendly; held-out-set evaluation is harder to outsource because the test set has to be domain-specific.
- How does this interact with EU AI Act requirements?
- The EU AI Act imposes risk management and human oversight on 'high-risk AI systems' including some finance applications. The Act adds requirements (transparency, post-market monitoring) SR 11-7 does not.
- What's the minimum eval-harness size to be useful?
- 50 examples is the floor; 200 is the comfortable point. Below 50, the eval has no statistical power to detect meaningful changes.
- Should the eval harness include adversarial inputs by default?
- Yes. Maintain a separate adversarial-input set the prompt-writer never sees. Adversarial inputs are the cheapest way to find prompt fragility.