Federal Reserve SR 11-7 has governed model risk management at US banking institutions since April 2011. In the fifteen years since its publication, the guidance has matured from aspirational principles to deeply embedded compliance infrastructure — model inventories, validation teams, model risk committees, independent review requirements. Banks got good at managing the risk of statistical credit models, stress testing frameworks, and prepayment models.
Then large language models arrived in decision-relevant banking workflows — and most bank model risk programs are still catching up. The LLM compliance gap is not a question of whether SR 11-7 applies. It does. The questions now are: how to apply its three-tier validation framework to non-traditional models, how to satisfy its independent review requirements for systems that can produce arbitrarily variable outputs, and critically — how to build the enforcement layer that closes the gap between validation findings and runtime behavior.
This article covers SR 11-7's history and scope, how the Federal Reserve and OCC have clarified its application to AI models, what the validation framework requires for LLMs in practice, and where pre-execution governance infrastructure fits into the SR 11-7 compliance picture.
SR 11-7: The Foundational Framework
SR 11-7 was published in response to systemic model failures during the 2008 financial crisis, where model risk — particularly in mortgage valuation and risk measurement — contributed to catastrophic losses and systemic instability. The guidance established model risk management as a distinct discipline requiring: model identification and inventory, model validation by personnel independent from model development, ongoing monitoring, and model risk governance including Board and senior management accountability.
The guidance defines a model with deliberate breadth: "a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates." The inclusion of "system or approach" and the phrase "quantitative estimates" — which the guidance clarifies includes classification outputs and categorical predictions — means the definition encompasses LLMs used in decision-relevant contexts.
The OCC's 2021 update to model risk management guidance (OCC Bulletin 2021-21, "Model Risk Management") explicitly extended SR 11-7 principles to AI and machine learning models. The guidance stated that "the same principles of sound model risk management apply" regardless of model type, and noted specific considerations for AI/ML models including explainability, bias testing, and behavioral stability requirements. For national banks, federal savings associations, and state member banks, the consensus is clear: AI models in decision-relevant workflows are models under SR 11-7, with additional considerations for non-traditional model types.
LLMs in the SR 11-7 Scope
The question of LLM scope under SR 11-7 hinges on intended use. Not every LLM deployment in a bank is a model under SR 11-7. An internal chatbot that helps employees find HR policy documents is not a model. An LLM that generates credit risk narratives for underwriter review, classifies customer complaints for regulatory reporting, produces adverse action code recommendations, or assists relationship managers in evaluating commercial loan applications — these are models, and SR 11-7 applies in full.
The key factors that bring an LLM into SR 11-7 scope are:
- Decision relevance — Does the output inform a decision about a customer, counterparty, or risk position? If yes, the model designation applies regardless of whether the AI is the final decision-maker.
- Materiality — Does the LLM output materially affect outcomes that could expose the bank or its customers to financial, legal, or reputational risk? Materiality determines the rigor tier within the validation framework.
- Regulatory interaction — Is the LLM output used in any regulatory reporting, compliance determination, or supervisory examination context? ECOA adverse action analysis, BSA/AML classification, fair lending monitoring — all require the full SR 11-7 program.
The Three-Tier Validation Framework Applied to LLMs
SR 11-7's validation framework distinguishes three types of model evaluation. For traditional quantitative models, each tier has established methodologies. For LLMs, banks are developing new approaches to satisfy the same requirements.
Conceptual Soundness for LLMs
Conceptual soundness validation asks whether the model's design is appropriate for its intended use. For traditional credit models, this involves reviewing the mathematical specification, testing theoretical predictions against historical data, and assessing whether the model captures the dynamics it purports to represent.
For LLMs, conceptual soundness validation requires a different approach. The "mathematical specification" of an LLM is its behavioral policy — the system prompt, any fine-tuning, and the deployment constraints that shape outputs. Validation must assess: Is the behavioral policy accurately specified? Does it correctly encode the constraints the bank intends to apply? Can violations of that specification be detected?
This is where most bank LLM programs have a gap. They can validate that an LLM produces good outputs on a curated test set. What they cannot validate with traditional methods is that the model will reliably stay within its behavioral specification on the full distribution of production inputs — including adversarial inputs, edge cases, and the long tail of unusual queries that accumulate at scale.
Conceptual soundness validation for LLMs must therefore include: red-team testing for behavioral boundary violations, consistency testing across semantically equivalent inputs, and evidence that the behavioral specification is technically enforced — not merely aspirationally stated.
Ongoing Monitoring Requirements
SR 11-7's ongoing monitoring requirement is particularly challenging for LLMs. For traditional models, monitoring involves tracking a defined output metric — a credit score distribution, a loss rate prediction, a prepayment speed — against historical benchmarks. Monitoring alerts trigger when the metric deviates beyond a defined threshold.
For LLMs producing natural language outputs, the monitoring challenge is more complex. The relevant metrics include:
- Policy violation rate — What fraction of outputs contain content that violates the model's behavioral specification? This requires a policy enforcement mechanism that can evaluate each output against defined rules, not just a periodic sampling review.
- Output distribution stability — Are the statistical properties of model outputs (length, sentiment distribution, topic distribution, confidence language patterns) stable over time? Significant shifts may indicate prompt injection, model drift after updates, or systematic changes in input patterns.
- Human override rate — What fraction of AI recommendations are overridden by the human decision-makers using them? Rising override rates are a monitoring signal indicating either deteriorating model quality or changing business context.
- Adverse action pattern analysis — For credit-relevant AI, are adverse action patterns consistent with the bank's fair lending policies? Disparate impact analysis of AI-influenced decisions requires the ability to link each decision to the AI output that influenced it.
SR 11-7's ongoing monitoring requirement presupposes that you can retrospectively review what the model produced and under what conditions. Most LLM deployments in banking lack the audit trail infrastructure to support this. If a model produces a credit risk narrative that influences an underwriter's decision, and that decision is later challenged in a fair lending examination, the bank must be able to reconstruct exactly what the AI said, what inputs it processed, and which version of the model produced the output. Post-hoc log reconstruction from generic application logs is typically insufficient for the specificity examinations require.
Section 4: The Independent Review Requirement
SR 11-7 Section 4 is the governance backbone of the framework: model validation must be conducted by individuals with appropriate expertise who are independent from the model development and model use functions. This independence requirement is not nominal — examiners look for organizational separation, separate reporting lines, and evidence that validation findings are not pre-approved by the development team.
For LLM deployments, the independent review requirement creates a specific challenge: who is qualified to conduct the validation, and what does independence mean when the model is a commercial API product from a technology vendor?
Independence in Practice for LLM Models
For internally developed or fine-tuned LLMs, the independence requirement maps relatively directly to traditional model validation structure: the model risk management function, separate from the AI development team, conducts the validation. The challenge is building validation capability in-house for non-traditional model types — many bank model risk functions have limited expertise in LLM evaluation methodology.
For vendor LLMs, SR 11-7 Section 6 applies — vendor model requirements. This section explicitly states that bank reliance on vendor models does not transfer model risk responsibility. Banks must conduct their own validation activities on vendor models. The vendor's model card, safety evaluation documentation, or benchmark results are inputs to the bank's validation, not substitutes for it. The independent review must assess the vendor model in the bank's specific deployment context, on data representative of the bank's customers and use cases.
What Validation Must Produce
The independent validation must produce written documentation of findings, including:
- Assessment of conceptual soundness for the intended use context
- Results of outcome analysis testing with identified performance metrics
- Assessment of ongoing monitoring adequacy and recommended monitoring metrics
- Identified limitations of the model and conditions under which outputs may be unreliable
- Use restrictions or conditions on deployment (e.g., limited to specific use cases, requires human review of outputs above certain risk thresholds)
- Recommendations for risk mitigants to address identified limitations
Validation findings must be formally reported to model risk management and the model risk committee. Findings rated as material — particularly those identifying behavioral limitation or policy compliance risks — must be addressed before the model is approved for production use, or the deployment must operate under explicit use restrictions with enhanced monitoring.
Pre-Execution Governance vs. Post-Hoc Audit: The Critical Distinction
The most significant gap in most bank LLM compliance programs is the confusion between monitoring (detecting when a model behaves badly after the fact) and controls (preventing bad behavior before it occurs). SR 11-7 requires both — but the controls component is more demanding than the monitoring component, and it is the component most frequently absent.
SR 11-7 Section 5 addresses model risk controls alongside validation. Controls are risk management measures that limit the impact of model error or model misuse. For LLMs, post-hoc audit — reviewing a sample of outputs periodically — is a monitoring activity. It tells you what went wrong. It does not prevent harm from occurring.
Pre-execution governance infrastructure — systems that evaluate LLM output against policy before delivery — satisfies SR 11-7's controls requirement in a way that post-hoc audit cannot. Every output is evaluated. Policy violations are blocked or modified before they influence a decision. The enforcement action is logged with a signed audit record, creating the documentation that supports independent review findings.
How CoreGuard Aligns with SR 11-7 Section 4
Section 4's independent review requirement benefits directly from deterministic enforcement infrastructure. The independent validation team can make specific, testable assertions about model behavior when a pre-execution enforcement layer is in place:
| SR 11-7 Requirement | Without Pre-Execution Enforcement | With CoreGuard Pre-Execution Enforcement |
|---|---|---|
| Policy compliance verification | PARTIAL — Sample-based review, cannot confirm 100% policy adherence | SATISFIED — Every output evaluated; enforcement rate documented in audit log |
| Behavioral limitation documentation | PARTIAL — Limitations identified in testing but not operationally enforced | SATISFIED — Policy rules encode limitations as hard blocks; enforcement evidence available |
| Use restriction enforcement | ABSENT — Use restrictions exist in policy documents but not technically enforced | SATISFIED — Use restrictions configured as enforcement rules; violations blocked at runtime |
| Ongoing monitoring — policy violation rate | PARTIAL — Requires periodic manual review; sampling lag between violation and detection | SATISFIED — Real-time policy violation rate computed from enforcement log; alert thresholds configurable |
| Audit trail for examination | PARTIAL — Generic application logs may not capture model output with sufficient fidelity | SATISFIED — Signed, immutable decision certificates with policy version, rule triggered, and disposition for every governed output |
Vendor Model Requirements: Section 6 Application
SR 11-7 Section 6 addresses vendor model use and is increasingly relevant as banks deploy commercial LLMs from technology providers. The section requires that banks establish standards for vendor model use that ensure the models are subject to appropriate validation and ongoing monitoring. Key requirements include:
Validation of vendor models in the bank's deployment context. Vendor LLM evaluation reports prepared for general audiences are not sufficient. The bank must validate the model on data representative of its specific customers, loan types, and decision contexts. This requires the bank to have an evaluation methodology and evaluation data — not just the ability to read a vendor model card.
Ongoing monitoring of vendor model behavior. Model updates from vendors — which for commercial LLMs can occur without advance notice — require re-validation assessment. If a vendor updates the underlying model in a way that changes behavioral characteristics relevant to the bank's use case, the model risk program must be able to detect this change. Pre-execution enforcement infrastructure that logs output characteristics provides the baseline necessary to detect behavioral drift after vendor model updates.
Contracts with vendor model providers. SR 11-7 Section 6 recommends that contracts with vendor model providers include provisions for: access to model documentation needed for validation, notification of material model changes, and service level commitments relevant to model performance. Banks relying on third-party LLM APIs should review whether their vendor contracts satisfy these requirements.
Commercial LLM providers routinely update their models — sometimes with advance notice, sometimes without. For banks with SR 11-7-scoped LLM deployments, an unnoticed behavioral change in the underlying model that affects credit-relevant outputs creates a significant model risk event. Pre-execution enforcement infrastructure with logged output metrics provides an early detection mechanism: when the enforcement log shows a change in block rate, output length distribution, or topic distribution following a known or suspected model update, the model risk team has an objective signal to trigger re-validation assessment.
Building the SR 11-7 Compliance Program for LLMs
A practical SR 11-7 compliance program for LLM deployments in banking requires five integrated components:
1. Model inventory with LLM classification. Every LLM deployment in decision-relevant workflows must be inventoried. The inventory must record: intended use, materiality assessment, applicable SR 11-7 tier, validation status, use restrictions, and monitoring metrics. This should be a living document maintained by the model risk function.
2. Conceptual soundness validation methodology for LLMs. The independent validation team needs a documented methodology for evaluating LLM conceptual soundness — including behavioral specification review, red-team testing protocols, and criteria for assessing whether behavioral constraints are technically enforceable. This methodology should be approved by the model risk committee and reviewed annually.
3. Pre-execution enforcement infrastructure. For high-materiality LLM deployments, the validation program should require deployment of a pre-execution enforcement layer as a condition of production use. This satisfies the controls requirement in SR 11-7 Section 5 and provides the audit trail that Section 4 independent review requires. The enforcement layer configuration should be documented as part of the model inventory.
4. Ongoing monitoring program with LLM-specific metrics. The ongoing monitoring program must be extended to cover LLM-specific metrics: policy violation rates, output distribution stability, human override rates, and (for credit AI) adverse action pattern analysis. Monitoring thresholds should be calibrated based on validation findings and reviewed by the model risk committee.
5. Vendor model management process. For third-party LLMs, a formal vendor model management process should document: the validation conducted, the evidence reviewed from the vendor, the use restrictions applied, and the monitoring approach for detecting vendor model changes. The process should include a protocol for triggered re-validation when monitoring signals indicate potential behavioral change.