Most enterprise AI deployments fail their first compliance review for the same reason: the team treated LLM governance as a variation on traditional software risk management. They added logging. They ran bias testing. They wrote an AI policy. They deployed.
Then the compliance team or an external auditor asked a simple question: "Can you show me the rules that governed this AI's behavior, and can you show me those rules were enforced on each individual decision?" And the answer was no.
LLMs introduce compliance risks that have no precedent in traditional software: they are non-deterministic (the same input can produce different outputs), they hallucinate (they generate plausible-sounding content that is factually wrong), they can exhibit emergent biases that weren't present in validation testing, and they produce no inherent audit record of the decision logic that generated their outputs. None of these characteristics are adequately addressed by software change management, code review, or standard application logging.
This checklist is structured in three parts. The first covers universal requirements that apply to every LLM deployment in a regulated industry. The second covers industry-specific requirements for lending, healthcare, and financial services. The third covers the audit trail gap — the single most common finding in AI compliance reviews.
Why LLMs Are Different from Prior AI Deployments
Before the checklist, it is worth being precise about what makes LLMs uniquely difficult to govern in regulated contexts.
Non-Determinism
A traditional credit scorecard takes a fixed input feature vector and produces a deterministic output. Given the same inputs, you get the same score every time. This property is fundamental to traditional model validation: you can reproduce any historical decision exactly by rerunning the model on the same inputs.
LLMs are not deterministic. With temperature greater than zero, the same prompt produces different outputs on successive calls. This means you cannot reconstruct historical AI decisions by replaying inputs. The only record of what the model said is what was captured at the time of inference. If that capture is incomplete — truncated, sampled, or absent — the record is gone. For regulated decisions, this is not an acceptable posture.
Hallucination
LLMs generate text that is statistically likely given the context. They do not "know" whether their outputs are accurate. When deployed in contexts where accuracy matters — medical information, regulatory requirements, credit terms — the hallucination risk is not a theoretical concern. Models consistently generate confident, plausible, and wrong information about drug interactions, regulatory requirements, and loan terms.
Hallucination is not a bug that will be patched. It is a fundamental property of how autoregressive language models work. The compliance implication is that LLM outputs in regulated contexts require validation against ground-truth policy sources — either by a deterministic enforcement layer that checks outputs against defined policy packs, or by human review of every output before it reaches a customer or a decision record.
Emergent Bias
Traditional bias testing in lending models involves analyzing whether protected class correlations appear in model outputs. For a logistic regression, you can test the model on synthetic test cases and characterize its behavior completely. For an LLM, bias can be context-dependent and emergent: the model may exhibit disparate behavior based on subtle cues in the conversation context that were not present in bias testing scenarios.
This means that LLM bias testing is necessarily incomplete. You cannot test every possible context. The implication is that bias monitoring must be ongoing — tracking outcomes in production against demographic distributions — and that the governance layer must actively constrain outputs that could constitute disparate treatment or disparate impact.
The Absent Audit Record
A traditional model produces a score. That score is logged. The score, along with the policy rules that converted it into a decision, constitutes the audit record. The audit record is complete and reconstructible.
An LLM produces text. That text is generated probabilistically. The policy that should have governed what text was appropriate is typically not enforced by any technical component — it exists as instructions in the system prompt, which the model follows imperfectly. There is no separate audit record of the governance logic that applied to a given inference.
This is the audit trail gap. It is addressed in the third section of this checklist.
The checkboxes below are interactive — check them off as you complete each item. They are grouped by: (1) Universal requirements applying to all regulated LLM deployments, (2) Industry-specific requirements for lending, healthcare, and finance, and (3) Audit trail requirements that address the most common compliance gap.
Section 1: Universal Pre-Deployment Requirements
Policy Definition
-
Define the intended use case precisely — what decisions or communications the LLM is authorized to support, and what it is explicitly not authorized to do. Required by: EU AI Act Art. 13, SR 11-7 model purpose documentation
-
Identify all regulatory regimes that apply to the deployment context. For a credit workflow: ECOA, FCRA, UDAAP, state lending laws, SR 11-7. For healthcare: HIPAA, state privacy laws, FDA guidance on AI/ML-based software. List them explicitly. Required by: organizational risk management; required for informed policy pack design
-
Document prohibited output categories in writing — what the AI must never say, regardless of user input. For lending: specific rate quotes without disclosures, protected class references, advice that constitutes unauthorized practice of law. These must be enforced by technical controls, not only by instruction. Required by: UDAAP; EU AI Act Art. 9; SR 11-7 design documentation
-
Conduct a bias and fairness pre-assessment using synthetic test cases covering all protected class dimensions relevant to your regulatory context (race, gender, age, national origin, religion, familial status, disability where applicable). Required by: ECOA (Reg. B), FHA, EU AI Act Art. 9(7) testing requirements
-
Define human oversight protocol: which AI outputs require human review before affecting a customer decision, who is responsible for review, and what the escalation path is when the AI produces uncertain or anomalous outputs. Required by: EU AI Act Art. 14; SR 11-7 model use procedures
Technical Enforcement Layer
-
Deploy a deterministic pre-execution policy enforcement layer that evaluates AI actions against defined rules before they are executed, not after. Post-hoc filtering does not satisfy enforcement requirements. Required by: EU AI Act Art. 9 (risk reduction measures); SR 11-7 design controls
-
Verify that the enforcement layer is deterministic: given the same input and policy state, it must produce the same decision on every call. Stochastic guardrails based on secondary LLM calls are not deterministic and cannot produce reliable audit records. Required by: audit reliability standards; SR 11-7 model validation principles
-
Measure enforcement layer latency. It must add less than 5ms to inference time to be operationally viable. If it adds more, it will face pressure to be disabled or bypassed in production. Operational requirement; regulatory compliance requires that controls are actually used
-
Implement a fail-closed posture: if the enforcement layer is unavailable or errors, AI inference must be blocked, not permitted to proceed uncontrolled. A fail-open enforcement layer provides no compliance guarantee. Required by: SR 11-7 model use procedures; risk management best practice
Audit Trail Architecture
-
For every AI inference in a decision-relevant workflow, produce a structured audit record that includes: timestamp, input data (or cryptographic hash if PII), which policies were evaluated, the disposition (ALLOW / BLOCK / MODIFY), and which specific rules triggered the disposition. Required by: EU AI Act Art. 12; SR 11-7 audit trail requirements; ECOA adverse action documentation
-
Implement tamper-evidence for audit records. Use HMAC signing and hash chaining so that any retroactive modification of the audit trail breaks the chain and is detectable by auditors. Required by: EU AI Act Art. 12 (implied by "traceability"); standard audit integrity requirement
-
Define and implement audit record retention periods matching the longer of: (a) your industry-specific record retention requirements, or (b) the statute of limitations for adverse action claims in your jurisdiction. Required by: ECOA (25-month retention minimum for applications); HIPAA (6 years); SR 11-7 (duration of model use)
-
Test the audit trail retrieval process. An auditor should be able to pull the complete governance record for any specific AI decision within 24 hours of request. If you cannot do this in testing, you will not be able to do it during an examination. Operational requirement; regulatory examiners expect responsive production of records
Section 2: Industry-Specific Requirements
ECOA / Regulation B Requirements
-
Ensure the AI cannot reference, infer from, or be influenced by protected class information (race, color, religion, national origin, sex, marital status, age, receipt of public assistance) in generating any credit-related content. ECOA §701(a); Regulation B §1002.4(a)
-
If the AI generates adverse action notices or denial rationales, implement a deterministic policy layer that validates the stated reasons against the actual factors used in the credit decision — not against the AI's generated text. Regulation B §1002.9(a)(2); CFPB 2022 adverse action circular
-
Document how the AI's outputs fit within your institution's adverse action reason code structure. The AI should not generate novel adverse action reasons that are not in your approved code set. Regulation B §1002.9(b)(2); SR 11-7 model use documentation
-
Run disparate impact analysis on AI-generated content in addition to model outputs. If the AI systematically uses different language or provides less helpful guidance to applicants based on factors correlated with protected class, this is a UDAAP risk independent of the credit decision. CFPB UDAAP authority; Dodd-Frank §1031
HIPAA Technical Safeguard Requirements
-
Conduct a HIPAA risk analysis for the LLM deployment before go-live, including assessment of threats to the confidentiality, integrity, and availability of PHI that the AI processes or generates. HIPAA Security Rule 45 CFR §164.308(a)(1)
-
Implement technical access controls that prevent the LLM from accessing PHI beyond what is necessary for the specific clinical or administrative task — minimum necessary standard. HIPAA Privacy Rule 45 CFR §164.514(d); Security Rule audit controls §164.312(b)
-
Ensure any foundation model provider processing PHI has executed a Business Associate Agreement (BAA). Review the BAA for data retention and subprocessor provisions specific to AI training data use. HIPAA §164.308(b)(1); Note: many AI providers have BAA-specific terms that restrict PHI use for model training
-
If the AI provides clinical decision support that is not "physician-facing" (i.e., outputs go directly to patients or are used without clinician review), assess whether the software qualifies as SaMD and whether FDA authorization is required. FDA SaMD guidance (2019); 21st Century Cures Act CDS provisions (21 USC §360j(o))
SR 11-7 / Model Risk Requirements
-
Register the LLM deployment in your institution's model inventory with a materiality classification. The classification determines the required depth of validation and ongoing monitoring. SR 11-7 Section II (model inventory and tiering)
-
Complete Tier 1 (conceptual soundness) validation for the governance layer — not the underlying LLM. Document the policy pack design logic, test coverage, and any known gaps in policy coverage. SR 11-7 Section III.C (conceptual soundness); OCC 2021-21
-
Establish ongoing monitoring metrics for the governance layer: policy trigger rates by rule, block rates by category, and any systematic patterns in which input types receive different governance dispositions. SR 11-7 Section III.D (ongoing monitoring); benchmark frequency to model materiality
-
Include the governance layer in your model risk committee reporting. Policy changes to the enforcement layer constitute model changes and require appropriate change management controls. SR 11-7 Section III.E (model validation governance)
The Audit Trail Gap — Why Most AI Deployments Fail Compliance Review
The most common finding in enterprise AI compliance reviews is the absence of an audit trail that demonstrates, at the individual decision level, what governance rules applied and what the outcome was. This gap is different from insufficient logging. Organizations usually have logs. What they don't have is a governance record.
The distinction: a log shows you what the AI said. A governance record shows you what rules governed what the AI was permitted to say, and demonstrates that those rules were applied consistently. These are different artifacts with different structures.
A complete governance record for a single AI inference in a regulated workflow includes:
- Inference identifier: A unique, non-reusable identifier for this specific AI call, linkable to the broader transaction record
- Policy version: The exact version of the policy pack that was active at the time of the inference — not the current version, the version that was running at that moment
- Evaluation result: ALLOW, BLOCK, or MODIFY — and if MODIFY, the specific modification applied
- Triggered rules: The specific policy rules that were evaluated, with their identifiers, and which rules, if any, produced a non-ALLOW disposition
- Cryptographic signature: An HMAC-SHA256 signature over the record content, enabling retroactive verification that the record has not been modified
- Chain linkage: A hash of the previous record in the chain, creating a tamper-evident sequence
This is the structure of a decision certificate — the artifact that answers the auditor's question: "Show me the rules that governed this AI decision, and show me they were applied at the time of inference."
Organizations that deploy LLMs without this infrastructure are not non-compliant in a technical violation sense — most regulations have not yet specified this level of technical detail. But they are in a weak position when examiners ask the governance question. The organizations that are building this infrastructure now are positioning themselves to demonstrate compliance, not just assert it.
Deployment Checklist Summary
The items above can be prioritized into a deployment sequence:
- Phase 1 (pre-deployment, required): Policy definition, prohibited output categories, regulatory scope documentation, bias pre-assessment
- Phase 2 (at deployment, required): Deterministic enforcement layer, fail-closed posture, governance event emission, audit record generation
- Phase 3 (at deployment, required for regulated workflows): Tamper-evident audit trail, retention period configuration, retrieval testing
- Phase 4 (ongoing): Monitoring against policy trigger rate baselines, quarterly bias analysis, model risk committee reporting, policy version management
- Phase 5 (industry-specific): ECOA adverse action validation, HIPAA BAA execution and risk analysis, SR 11-7 inventory registration and Tier 1 validation
The organizations that treat Phase 1 and 2 as optional — "we'll add governance later once the model proves its value" — consistently find that retrofitting governance infrastructure onto a live AI deployment is more disruptive and expensive than building it in from the start. The compliance conversation happens sooner or later. The question is whether you control the timing.