SR 11-7 AI Governance: How Banks Are Enforcing Model Risk Management in 2026

Q: What is the difference between model monitoring and model enforcement under SR 11-7?

Model monitoring (detecting when a model behaves badly after the fact) satisfies the ongoing monitoring component of SR 11-7, but does not satisfy the risk management controls requirement. SR 11-7 Section 5 requires that model risk be controlled — meaning mitigated before harm occurs. For high-risk LLM deployments, this requires pre-execution enforcement: a governance layer that evaluates AI outputs against policy before they reach decision-makers, with documented evidence that the enforcement occurred. Post-hoc monitoring alone creates a compliance gap under the controls requirement.

Q: How should banks handle SR 11-7 vendor model requirements for third-party LLMs?

SR 11-7 Section 6 addresses vendor and third-party model use and requires that banks conduct validation activities on vendor models — not merely accept vendor claims about model quality. For third-party LLMs, banks must obtain sufficient information to validate the model's conceptual soundness, test it in the deployment context, and assess ongoing monitoring adequacy. Vendor-provided model cards, safety evaluations, and benchmark results are inputs to the bank's validation, not substitutes for it. Banks retain full model risk responsibility for vendor LLM deployments.

SR 11-7 AI Governance: How Banks Are Enforcing Model Risk Management in 2026

Federal Reserve SR 11-7 has governed model risk management at US banking institutions since April 2011. In the fifteen years since its publication, the guidance has matured from aspirational principles to deeply embedded compliance infrastructure — model inventories, validation teams, model risk committees, independent review requirements. Banks got good at managing the risk of statistical credit models, stress testing frameworks, and prepayment models.

Then large language models arrived in decision-relevant banking workflows — and most bank model risk programs are still catching up. The LLM compliance gap is not a question of whether SR 11-7 applies. It does. The questions now are: how to apply its three-tier validation framework to non-traditional models, how to satisfy its independent review requirements for systems that can produce arbitrarily variable outputs, and critically — how to build the enforcement layer that closes the gap between validation findings and runtime behavior.

This article covers SR 11-7's history and scope, how the Federal Reserve and OCC have clarified its application to AI models, what the validation framework requires for LLMs in practice, and where pre-execution governance infrastructure fits into the SR 11-7 compliance picture.

SR 11-7: The Foundational Framework

SR 11-7 was published in response to systemic model failures during the 2008 financial crisis, where model risk — particularly in mortgage valuation and risk measurement — contributed to catastrophic losses and systemic instability. The guidance established model risk management as a distinct discipline requiring: model identification and inventory, model validation by personnel independent from model development, ongoing monitoring, and model risk governance including Board and senior management accountability.

The guidance defines a model with deliberate breadth: "a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates." The inclusion of "system or approach" and the phrase "quantitative estimates" — which the guidance clarifies includes classification outputs and categorical predictions — means the definition encompasses LLMs used in decision-relevant contexts.

OCC Bulletin 2021-21: The Explicit Extension to AI

The OCC's 2021 update to model risk management guidance (OCC Bulletin 2021-21, "Model Risk Management") explicitly extended SR 11-7 principles to AI and machine learning models. The guidance stated that "the same principles of sound model risk management apply" regardless of model type, and noted specific considerations for AI/ML models including explainability, bias testing, and behavioral stability requirements. For national banks, federal savings associations, and state member banks, the consensus is clear: AI models in decision-relevant workflows are models under SR 11-7, with additional considerations for non-traditional model types.

LLMs in the SR 11-7 Scope

The question of LLM scope under SR 11-7 hinges on intended use. Not every LLM deployment in a bank is a model under SR 11-7. An internal chatbot that helps employees find HR policy documents is not a model. An LLM that generates credit risk narratives for underwriter review, classifies customer complaints for regulatory reporting, produces adverse action code recommendations, or assists relationship managers in evaluating commercial loan applications — these are models, and SR 11-7 applies in full.

The key factors that bring an LLM into SR 11-7 scope are:

Decision relevance — Does the output inform a decision about a customer, counterparty, or risk position? If yes, the model designation applies regardless of whether the AI is the final decision-maker.
Materiality — Does the LLM output materially affect outcomes that could expose the bank or its customers to financial, legal, or reputational risk? Materiality determines the rigor tier within the validation framework.
Regulatory interaction — Is the LLM output used in any regulatory reporting, compliance determination, or supervisory examination context? ECOA adverse action analysis, BSA/AML classification, fair lending monitoring — all require the full SR 11-7 program.

The Three-Tier Validation Framework Applied to LLMs

SR 11-7's validation framework distinguishes three types of model evaluation. For traditional quantitative models, each tier has established methodologies. For LLMs, banks are developing new approaches to satisfy the same requirements.

Tier 1 — Conceptual Soundness

Design Validation

For LLMs: Does the behavioral specification correctly represent the intended use case? Can the model's constraints be technically verified? Does the training data and fine-tuning approach support the claimed capabilities for the deployment context?

Tier 2 — Ongoing Monitoring

Performance Tracking

For LLMs: Tracking output distributions, policy violation rates, behavioral drift across model versions, and consistency metrics. Static test sets are insufficient — monitoring must be continuous and cover the full distribution of production inputs.

Tier 3 — Outcomes Analysis

Decision Quality Review

For LLMs: Evaluating the downstream quality of decisions informed by AI output. For credit AI, this requires linking AI recommendation patterns to loan performance outcomes, adverse action accuracy, and fair lending compliance metrics.

Conceptual Soundness for LLMs

Conceptual soundness validation asks whether the model's design is appropriate for its intended use. For traditional credit models, this involves reviewing the mathematical specification, testing theoretical predictions against historical data, and assessing whether the model captures the dynamics it purports to represent.

For LLMs, conceptual soundness validation requires a different approach. The "mathematical specification" of an LLM is its behavioral policy — the system prompt, any fine-tuning, and the deployment constraints that shape outputs. Validation must assess: Is the behavioral policy accurately specified? Does it correctly encode the constraints the bank intends to apply? Can violations of that specification be detected?

This is where most bank LLM programs have a gap. They can validate that an LLM produces good outputs on a curated test set. What they cannot validate with traditional methods is that the model will reliably stay within its behavioral specification on the full distribution of production inputs — including adversarial inputs, edge cases, and the long tail of unusual queries that accumulate at scale.

Conceptual soundness validation for LLMs must therefore include: red-team testing for behavioral boundary violations, consistency testing across semantically equivalent inputs, and evidence that the behavioral specification is technically enforced — not merely aspirationally stated.

Ongoing Monitoring Requirements

SR 11-7's ongoing monitoring requirement is particularly challenging for LLMs. For traditional models, monitoring involves tracking a defined output metric — a credit score distribution, a loss rate prediction, a prepayment speed — against historical benchmarks. Monitoring alerts trigger when the metric deviates beyond a defined threshold.

For LLMs producing natural language outputs, the monitoring challenge is more complex. The relevant metrics include:

Policy violation rate — What fraction of outputs contain content that violates the model's behavioral specification? This requires a policy enforcement mechanism that can evaluate each output against defined rules, not just a periodic sampling review.
Output distribution stability — Are the statistical properties of model outputs (length, sentiment distribution, topic distribution, confidence language patterns) stable over time? Significant shifts may indicate prompt injection, model drift after updates, or systematic changes in input patterns.
Human override rate — What fraction of AI recommendations are overridden by the human decision-makers using them? Rising override rates are a monitoring signal indicating either deteriorating model quality or changing business context.
Adverse action pattern analysis — For credit-relevant AI, are adverse action patterns consistent with the bank's fair lending policies? Disparate impact analysis of AI-influenced decisions requires the ability to link each decision to the AI output that influenced it.

The Audit Trail Gap

SR 11-7's ongoing monitoring requirement presupposes that you can retrospectively review what the model produced and under what conditions. Most LLM deployments in banking lack the audit trail infrastructure to support this. If a model produces a credit risk narrative that influences an underwriter's decision, and that decision is later challenged in a fair lending examination, the bank must be able to reconstruct exactly what the AI said, what inputs it processed, and which version of the model produced the output. Post-hoc log reconstruction from generic application logs is typically insufficient for the specificity examinations require.

Section 4: The Independent Review Requirement

SR 11-7 Section 4 is the governance backbone of the framework: model validation must be conducted by individuals with appropriate expertise who are independent from the model development and model use functions. This independence requirement is not nominal — examiners look for organizational separation, separate reporting lines, and evidence that validation findings are not pre-approved by the development team.

For LLM deployments, the independent review requirement creates a specific challenge: who is qualified to conduct the validation, and what does independence mean when the model is a commercial API product from a technology vendor?

Independence in Practice for LLM Models

For internally developed or fine-tuned LLMs, the independence requirement maps relatively directly to traditional model validation structure: the model risk management function, separate from the AI development team, conducts the validation. The challenge is building validation capability in-house for non-traditional model types — many bank model risk functions have limited expertise in LLM evaluation methodology.

For vendor LLMs, SR 11-7 Section 6 applies — vendor model requirements. This section explicitly states that bank reliance on vendor models does not transfer model risk responsibility. Banks must conduct their own validation activities on vendor models. The vendor's model card, safety evaluation documentation, or benchmark results are inputs to the bank's validation, not substitutes for it. The independent review must assess the vendor model in the bank's specific deployment context, on data representative of the bank's customers and use cases.

What Validation Must Produce

The independent validation must produce written documentation of findings, including:

Assessment of conceptual soundness for the intended use context
Results of outcome analysis testing with identified performance metrics
Assessment of ongoing monitoring adequacy and recommended monitoring metrics
Identified limitations of the model and conditions under which outputs may be unreliable
Use restrictions or conditions on deployment (e.g., limited to specific use cases, requires human review of outputs above certain risk thresholds)
Recommendations for risk mitigants to address identified limitations

Validation findings must be formally reported to model risk management and the model risk committee. Findings rated as material — particularly those identifying behavioral limitation or policy compliance risks — must be addressed before the model is approved for production use, or the deployment must operate under explicit use restrictions with enhanced monitoring.

Pre-Execution Governance vs. Post-Hoc Audit: The Critical Distinction

The most significant gap in most bank LLM compliance programs is the confusion between monitoring (detecting when a model behaves badly after the fact) and controls (preventing bad behavior before it occurs). SR 11-7 requires both — but the controls component is more demanding than the monitoring component, and it is the component most frequently absent.

SR 11-7 Section 5 addresses model risk controls alongside validation. Controls are risk management measures that limit the impact of model error or model misuse. For LLMs, post-hoc audit — reviewing a sample of outputs periodically — is a monitoring activity. It tells you what went wrong. It does not prevent harm from occurring.

SR 11-7 Section 5 — Risk Controls Principle:

Model risk should be managed. Validation outcomes and monitoring findings should inform the controls framework. Where validation identifies limitations or conditions under which model outputs may be unreliable, controls must mitigate those conditions.

Applied to LLMs:

If validation identifies that an LLM may produce adverse action narratives that violate Regulation B requirements under certain input conditions — the control is not periodic review of a sampled output log. The control is a mechanism that prevents non-compliant narratives from reaching decision-makers. This is pre-execution enforcement.

Pre-execution governance infrastructure — systems that evaluate LLM output against policy before delivery — satisfies SR 11-7's controls requirement in a way that post-hoc audit cannot. Every output is evaluated. Policy violations are blocked or modified before they influence a decision. The enforcement action is logged with a signed audit record, creating the documentation that supports independent review findings.

How EVE CoreGuard Aligns with SR 11-7 Section 4

Section 4's independent review requirement benefits directly from deterministic enforcement infrastructure. The independent validation team can make specific, testable assertions about model behavior when a pre-execution enforcement layer is in place:

SR 11-7 Requirement	Without Pre-Execution Enforcement	With EVE CoreGuard Pre-Execution Enforcement
Policy compliance verification	PARTIAL — Sample-based review, cannot confirm 100% policy adherence	SATISFIED — Every output evaluated; enforcement rate documented in audit log
Behavioral limitation documentation	PARTIAL — Limitations identified in testing but not operationally enforced	SATISFIED — Policy rules encode limitations as hard blocks; enforcement evidence available
Use restriction enforcement	ABSENT — Use restrictions exist in policy documents but not technically enforced	SATISFIED — Use restrictions configured as enforcement rules; violations blocked at runtime
Ongoing monitoring — policy violation rate	PARTIAL — Requires periodic manual review; sampling lag between violation and detection	SATISFIED — Real-time policy violation rate computed from enforcement log; alert thresholds configurable
Audit trail for examination	PARTIAL — Generic application logs may not capture model output with sufficient fidelity	SATISFIED — Signed, immutable decision certificates with policy version, rule triggered, and disposition for every governed output

Vendor Model Requirements: Section 6 Application

SR 11-7 Section 6 addresses vendor model use and is increasingly relevant as banks deploy commercial LLMs from technology providers. The section requires that banks establish standards for vendor model use that ensure the models are subject to appropriate validation and ongoing monitoring. Key requirements include:

Validation of vendor models in the bank's deployment context. Vendor LLM evaluation reports prepared for general audiences are not sufficient. The bank must validate the model on data representative of its specific customers, loan types, and decision contexts. This requires the bank to have an evaluation methodology and evaluation data — not just the ability to read a vendor model card.

Ongoing monitoring of vendor model behavior. Model updates from vendors — which for commercial LLMs can occur without advance notice — require re-validation assessment. If a vendor updates the underlying model in a way that changes behavioral characteristics relevant to the bank's use case, the model risk program must be able to detect this change. Pre-execution enforcement infrastructure that logs output characteristics provides the baseline necessary to detect behavioral drift after vendor model updates.

Contracts with vendor model providers. SR 11-7 Section 6 recommends that contracts with vendor model providers include provisions for: access to model documentation needed for validation, notification of material model changes, and service level commitments relevant to model performance. Banks relying on third-party LLM APIs should review whether their vendor contracts satisfy these requirements.

The Model Update Monitoring Problem

Commercial LLM providers routinely update their models — sometimes with advance notice, sometimes without. For banks with SR 11-7-scoped LLM deployments, an unnoticed behavioral change in the underlying model that affects credit-relevant outputs creates a significant model risk event. Pre-execution enforcement infrastructure with logged output metrics provides an early detection mechanism: when the enforcement log shows a change in block rate, output length distribution, or topic distribution following a known or suspected model update, the model risk team has an objective signal to trigger re-validation assessment.

Building the SR 11-7 Compliance Program for LLMs

A practical SR 11-7 compliance program for LLM deployments in banking requires five integrated components:

1. Model inventory with LLM classification. Every LLM deployment in decision-relevant workflows must be inventoried. The inventory must record: intended use, materiality assessment, applicable SR 11-7 tier, validation status, use restrictions, and monitoring metrics. This should be a living document maintained by the model risk function.

2. Conceptual soundness validation methodology for LLMs. The independent validation team needs a documented methodology for evaluating LLM conceptual soundness — including behavioral specification review, red-team testing protocols, and criteria for assessing whether behavioral constraints are technically enforceable. This methodology should be approved by the model risk committee and reviewed annually.

3. Pre-execution enforcement infrastructure. For high-materiality LLM deployments, the validation program should require deployment of a pre-execution enforcement layer as a condition of production use. This satisfies the controls requirement in SR 11-7 Section 5 and provides the audit trail that Section 4 independent review requires. The enforcement layer configuration should be documented as part of the model inventory.

4. Ongoing monitoring program with LLM-specific metrics. The ongoing monitoring program must be extended to cover LLM-specific metrics: policy violation rates, output distribution stability, human override rates, and (for credit AI) adverse action pattern analysis. Monitoring thresholds should be calibrated based on validation findings and reviewed by the model risk committee.

5. Vendor model management process. For third-party LLMs, a formal vendor model management process should document: the validation conducted, the evidence reviewed from the vendor, the use restrictions applied, and the monitoring approach for detecting vendor model changes. The process should include a protocol for triggered re-validation when monitoring signals indicate potential behavioral change.

SR 11-7 AI Governance FAQ

Are LLMs models under SR 11-7?

Yes, for deployments where LLM output informs or makes decisions about customers, credit, risk, or compliance. SR 11-7 defines models as quantitative methods or systems that apply techniques to process input data into estimates used for decision-making. LLMs used for credit analysis, risk narrative generation, complaint classification, or any other decision-relevant workflow meet this definition. The OCC confirmed this scope in Bulletin 2021-21, extending SR 11-7 principles explicitly to AI and machine learning models.

What does SR 11-7 Section 4 require for independent review of AI models?

SR 11-7 Section 4 requires that model validation be conducted by individuals who are independent from the model development and use functions. For LLM deployments, the validation team must evaluate the model's conceptual soundness, test it on out-of-distribution inputs, analyze its outputs for the intended use context, and assess ongoing monitoring adequacy. The independent review must be documented and must produce a validation report with findings, limitations, and any use restrictions or conditions.

How does ongoing monitoring under SR 11-7 apply to LLM deployments?

SR 11-7's ongoing monitoring requirement applies throughout the model lifecycle and requires tracking performance indicators that signal potential model deterioration or changed behavior. For LLMs, this includes monitoring output distributions for drift, tracking policy violation rates, analyzing the rate of human override of AI recommendations, and reviewing adverse action patterns for disparate impact signals. Monitoring reports must be escalated to model risk committees and trigger re-validation when monitoring metrics indicate material performance changes.

What is the difference between model monitoring and model enforcement under SR 11-7?

Model monitoring — detecting when a model behaves badly after the fact — satisfies the ongoing monitoring component of SR 11-7, but does not satisfy the risk management controls requirement. SR 11-7 Section 5 requires that model risk be controlled, meaning mitigated before harm occurs. For high-risk LLM deployments, this requires pre-execution enforcement: a governance layer that evaluates AI outputs against policy before they reach decision-makers, with documented evidence that the enforcement occurred. Post-hoc monitoring alone creates a compliance gap under the controls requirement.

How should banks handle SR 11-7 vendor model requirements for third-party LLMs?

SR 11-7 Section 6 requires that banks conduct validation activities on vendor models — not merely accept vendor claims about model quality. For third-party LLMs, banks must obtain sufficient information to validate the model's conceptual soundness, test it in the deployment context, and assess ongoing monitoring adequacy. Vendor-provided model cards, safety evaluations, and benchmark results are inputs to the bank's validation, not substitutes for it. Banks retain full model risk responsibility for vendor LLM deployments.