How to Use This Framework
This framework operationalizes the four obligations that matter most in production: Article 12 (record-keeping), Article 14 (human oversight), Article 15 (accuracy, robustness, cybersecurity), and Articles 26 and 27 (post-market monitoring and fundamental rights impact assessment). Published by EVE AI Core. This is a working tool, not a marketing document. Use it on a whiteboard with your CISO, General Counsel, DPO, and engineering lead.
For each obligation you will:
- Read the plain-English summary and the architectural properties that define "compliant".
- Score your current state on the 5-level maturity rubric (0–4).
- Walk the diagnostic checklist. Any unchecked box is a gap.
- Map gaps onto the priority matrix (effort vs. regulatory severity).
- Populate the composite scorecard with target maturity and person-week estimates.
Time needed: 2–3 hours for a first pass with a cross-functional team. Budget a full day if you are also gathering evidence artifacts.
Team required: CISO or AI governance lead (owner), General Counsel or DPO, ML/platform engineering lead, SRE/observability lead, and — if you serve regulated users — a compliance or internal audit representative. Do not run this solo. The gaps you miss are the gaps in the person running the assessment.
Honesty principle: score what you can prove to a regulator today, not what you intend to build. If a property depends on a Jira ticket, it is Level 0 for this assessment.
- Obligation 1 — Article 12: Record-Keeping
- Obligation 2 — Article 14: Human Oversight
- Obligation 3 — Article 15: Accuracy, Robustness, Cybersecurity
- Obligation 4 — Articles 26 & 27: Post-Market Monitoring + FRIA
- Composite Scorecard Template
- Priority Matrix — Effort vs. Regulatory Severity
- Critical Failure Modes
- What Good Looks Like: Reference Architecture
- Next Step
Obligation 1 — Article 12: Record-Keeping
a. The obligation in plain English
Article 12(1) requires that "high-risk AI systems shall technically allow for the automatic recording of events ('logs') over the lifetime of the system." Article 12(2) specifies these logs must enable "a degree of traceability of the AI system's functioning that is appropriate to the intended purpose." In practice this means every decision the system makes must be reconstructable after the fact, and the record must be trustworthy.
b. What "compliant" actually looks like
- Logs are generated synchronously with the decision, not reconstructed after. The logging call is on the critical path; if logging fails, the decision does not ship.
- Each log entry is cryptographically bound to its predecessor (hash-chain, Merkle tree, or equivalent) so any deletion, insertion, or modification is detectable by an independent party.
- Logs are canonicalized before signing using a deterministic serialization (RFC 8785 JCS or equivalent), so two verifiers produce identical hashes.
- Log write permissions are separated from read permissions, and no role — including platform administrator — can retroactively edit an entry without leaving a detectable signature break.
- Retention spans the system lifetime plus the regulatory tail (typically 6–10 years for high-risk domains), with integrity verifiable across the full window.
- Each entry contains the minimum reconstruction set: input hash, model/policy version, decision, confidence, actor identity, timestamp (monotonic + wall clock), and upstream correlation ID.
- An independent third party can verify the chain given only a public verification key and the log export — no access to your infrastructure required.
c. Scoring rubric
| Level | Description |
| 0 — Absent | No structured decision logs exist. Post-hoc reconstruction from scattered application logs only. |
| 1 — Manual | Engineers export logs on request. No automatic capture of decision context. Integrity relies on institutional trust. |
| 2 — Automated, not tamper-evident | Logs written to Datadog, Splunk, Elastic, or CloudWatch. Complete, but mutable by anyone with platform admin rights. Deletion leaves no signature. |
| 3 — Automated, tamper-evident, not independently verifiable | Hash-chained, canonicalized, signed. But verification requires vendor tooling or access to your signing keys — regulator cannot verify offline. |
| 4 — Fully compliant | Hash-chained + JCS-canonicalized + HMAC/asymmetrically signed + Merkle-aggregated roots published. Third party verifies with public key and the export alone. |
d. Common gap patterns
- Application logs written to Datadog/Splunk/Elastic/CloudWatch are mutable by anyone with platform admin rights — not Article 12 compliant regardless of retention settings.
- Decision logs generated by a post-hoc scraper that reads transaction tables. The decision happened; the "log" is a downstream interpretation. A regulator will reject this.
- Signed logs using
json.dumps(sort_keys=True)instead of a canonicalization scheme. Two runtimes produce different bytes; signatures diverge; audit fails. - "Immutable" S3 buckets with object-lock configured for 90 days but no hash chain. Deletion is prevented, silent replacement is not.
- Log retention set at the logging platform's default (30–90 days). High-risk regulatory tails require years.
- Correlation IDs exist at the HTTP layer but not at the decision layer — you can find the request, not the decision path inside it.
- Signing keys rotated without publishing prior public keys; historical signatures become unverifiable.
- Merkle roots computed but never published externally. "Trust us, the root is correct" is not independent verifiability.
e. Diagnostic questions
Check each box only if you could produce the evidence to an auditor today.
f. How to close the gap
Implement a JCS canonicalization library, an append-only hash-chain store with HMAC-SHA256 or asymmetric signing, a Merkle aggregator that publishes roots to an external witness (another cloud provider, a partner, or a transparency log), and a verification endpoint that accepts a log export and a public key. Expect 8–16 engineer-weeks for a production-ready implementation plus hardening.
Categories to evaluate: audit-grade logging platforms (not observability — different problem), transparency log services (e.g., certificate-transparency-style systems adapted for decision logs), and governance platforms that offer verifiable-attestation primitives. Be skeptical of any vendor whose "tamper-evident" claim cannot be verified without their tooling.
core/governance/unified_audit_bus.py (JCS-canonicalized, HMAC-signed, hash-chained audit bus across 16 source systems), core/governance/jcs_canonicalize.py (RFC 8785 implementation), core/governance/merkle_aggregator.py (batched Merkle tree aggregation with signed root publication), and the EVE Proof SDK (sdks/proof/) for third-party verification.
Obligation 2 — Article 14: Human Oversight
a. The obligation in plain English
Article 14(1) requires high-risk AI systems be "designed and developed in such a way … that they can be effectively overseen by natural persons during the period in which the AI system is in use." Article 14(4) specifies that overseers must be able to "fully understand the capacities and limitations of the high-risk AI system," "correctly interpret the high-risk AI system's output," decide not to use it, "intervene on the operation of the high-risk AI system or interrupt the system," and do so in "real time."
b. What "compliant" actually looks like
- A human in the chain of control has the authority and the technical means to halt a decision before it executes — not just observe it after the fact.
- Outputs are surfaced in human-interpretable form, with the confidence, the alternatives considered, and the factors that drove the decision.
- An override action is a first-class operation that is logged, attributed to the overseer, and joined to the decision record.
- There is an explicit approval workflow for high-consequence decisions, with the threshold documented and enforced in code (not in a runbook).
- Anomaly detection surfaces unusual behavior to the overseer proactively — overseers cannot be expected to review every decision.
- The overseer has training documentation specific to this system covering known failure modes, typical output ranges, and the specific signals that indicate the system is operating outside its intended scope.
- The kill switch is tested on a documented cadence and the test results are retained.
c. Scoring rubric
| Level | Description |
| 0 — Absent | System executes decisions autonomously with no human-accessible intercept. |
| 1 — Manual | Humans can review output after the fact and manually reverse effects. No pre-execution gate. |
| 2 — Automated but not tamper-evident | A review queue exists for flagged decisions, but approval/reject clicks are not auditable — reviewers can be impersonated or actions edited. |
| 3 — Automated + tamper-evident, not independently verifiable | Three-phase propose/approve/execute workflow with signed approvals, but no external verifiability of overseer identity or the approval chain. |
| 4 — Fully compliant | Pre-execution gate with cryptographically attributed approvals, a documented training program for overseers, tested kill switch, and anomaly detection feeding a queue that humans actually work. |
d. Common gap patterns
- The "human in the loop" is a reviewer who sees decisions on a dashboard after they've already been executed. This is human observation, not Article 14 oversight.
- Approval workflow exists but is routinely auto-approved because the queue is too large for the staffing. Compliance theater.
- The override mechanism exists in theory (a config flag, a feature toggle) but has never been exercised in production. It will not work when you need it.
- Anomaly detection fires into a Slack channel no one monitors. No SLA, no escalation, no on-call.
- The training document for overseers is the public product page. Overseers don't understand the system's failure modes well enough to identify them when they occur.
- Approval clicks are attributed by session cookie, not by cryptographic signature. Anyone with access to the operator machine is "the overseer."
- Kill switch requires filing a ticket and waiting for an engineer. By definition, not real-time.
- Override events are logged in the operator console but not joined to the decision chain — reconstruction after an incident requires correlating two separate systems.
e. Diagnostic questions
f. How to close the gap
Implement a three-phase action workflow (propose → approve → execute) with the execution step gated on a signed approval. Build an operator console with role-based access, cryptographic action attribution, and joined audit trails. Wire anomaly detection to a real queue with real on-call. Expect 12–20 engineer-weeks plus ongoing operational cost (staffing the queue).
Categories to evaluate: AI approval workflow platforms, regulated-industry MLOps platforms with built-in human-in-the-loop primitives, and decision support systems designed for the relevant vertical (finance, healthcare, hiring). Beware platforms that conflate observability with oversight.
core/coreguard/ provides the pre-execution evaluation gate (POST /v1/decisions/evaluate returns ALLOWED/BLOCKED/MODIFIED with risk assessment). core/governance/action_registry.py implements the three-phase propose/approve/execute workflow with role-based gating. The Operator Console provides the human-facing surface with RBAC via core/tasks/rbac.py, cryptographically attributed actions via core/tasks/operator_audit.py, and Red Team Mode for defensive hardening.
Obligation 3 — Article 15: Accuracy, Robustness, Cybersecurity
a. The obligation in plain English
Article 15(1) requires high-risk AI systems be "designed and developed in such a way that they achieve an appropriate level of accuracy, robustness and cybersecurity, and perform consistently in those respects throughout their lifecycle." Article 15(4) specifies they must be "resilient against attempts by unauthorised third parties to alter their use, outputs or performance by exploiting system vulnerabilities." Article 15(5) requires fail-safe behavior: the system must either handle errors gracefully or refuse to operate.
b. What "compliant" actually looks like
- The system has a documented adversarial threat model with explicit attack classes, not a generic "we follow OWASP" statement.
- There is a deterministic enforcement layer between input and inference that cannot be bypassed by prompt content, escalation patterns, or manipulation of upstream systems.
- The enforcement layer is itself resilient — it fails closed (reject) rather than failing open (allow), and enforcement errors are treated as security events.
- Accuracy is measured on a published benchmark with a documented refresh cadence, and degradation triggers a release block.
- Adversarial testing is continuous, not a once-a-year penetration test. New attack vectors are added to the test corpus as they are discovered.
- The enforcement logic is pure / side-effect-free where possible, so its behavior can be unit-tested exhaustively and formally reasoned about.
- Dependency and supply-chain integrity is attested (SBOM, build provenance, pinned dependencies with hash verification).
- There is a documented rollback path to a last-known-good model and policy version, with RTO measured in minutes, not hours.
c. Scoring rubric
| Level | Description |
| 0 — Absent | No adversarial testing. No enforcement layer beyond the model's own refusals. |
| 1 — Manual | Periodic pen-tests. Enforcement via prompt engineering. Incident response is ad-hoc. |
| 2 — Automated but not tamper-evident | Automated adversarial test suite. Enforcement via application-level checks. But enforcement logic can drift silently; no guarantee today's system behaves like last quarter's. |
| 3 — Automated + tamper-evident, not independently verifiable | Deterministic enforcement module with version-pinned rules, test coverage, Hard-Fail-Shut on enforcement errors. But third party cannot independently audit the rule set behavior. |
| 4 — Fully compliant | All of Level 3, plus: pure-function enforcement module auditable by inspection, published threat model with classified attack vectors, SBOM + build attestation, signed model and policy releases, documented rollback RTO. |
d. Common gap patterns
- "We use an LLM to check whether the LLM's output is safe." LLM-as-governance-of-LLM. The governance layer has the same failure modes as the thing it is governing. Auditors will correctly identify this as circular.
- Enforcement implemented as a
system_promptinstruction to "refuse unsafe requests." Prompt-layer safety is a performance improvement, not a compliance control. - Accuracy benchmarks run once, never refreshed. The benchmark is the training distribution. Production drift is invisible.
- Adversarial test suite is hardcoded with the attack vectors discovered in 2023. Today's attacks don't appear in the test set.
- Enforcement layer accepts input and "usually" blocks disallowed actions, but a try/except in the wrong place fails open. The rule "enforcement errors = reject" is not implemented.
- Model and policy versions are not pinned to releases. Production silently rolls to a new version when the vendor updates; you cannot reproduce a decision from last month.
- No SBOM. No build provenance. The binary you deployed is not verifiably the binary your CI built.
- Rollback procedure is "redeploy from the release tag" but has never been exercised under time pressure. It will take a day when you need 10 minutes.
- Input normalization (Unicode NFKC, homoglyph collapse, zero-width strip) happens after pattern matching, not before. Attack prompts using homoglyphs or zero-width characters bypass the regex layer.
e. Diagnostic questions
f. How to close the gap
Implement a pure-function enforcement module with zero I/O and zero global state (auditable by inspection). Build a continuously-refreshed adversarial corpus. Publish an SBOM per release (CycloneDX). Implement build attestation (SLSA Level 2 or higher). Expect 10–18 engineer-weeks plus ongoing corpus maintenance.
Categories to evaluate: AI red-team platforms, model-integrity and provenance services (SLSA-aligned), prompt-injection and LLM firewall vendors (evaluate skeptically — many are themselves LLM-based and reproduce the governance-of-governance problem).
core/governance/failure_mode_invariant.py implements 127 enforcement pillars across 13 pattern groups (~175 compiled regex + NFKC/homoglyph normalization) with Hard-Fail-Shut on enforcement errors. core/governance/novel_attack_detector.py provides a 27-class TF-IDF semantic-similarity backup. core/governance/veto_core.py is the pure-function module (zero I/O, zero threading, zero global state) with 87 exhaustive tests. core/governance/build_attestation.py generates SLSA Level 2 provenance and scripts/generate_sbom.py produces CycloneDX 1.4 SBOMs.
Obligation 4 — Articles 26 & 27: Post-Market Monitoring + FRIA
a. The obligation in plain English
Article 26 requires deployers of high-risk AI systems to "monitor the operation of the high-risk AI system on the basis of the instructions for use" and to "inform the provider or distributor … where they have reason to consider that the use in accordance with the instructions for use may result in [the system] presenting a risk." Article 27 requires, for specified deployers, a Fundamental Rights Impact Assessment (FRIA) describing "the categories of natural persons and groups likely to be affected," "the specific risks of harm likely to have an impact" on them, and the "measures to be taken in case of the materialisation of those risks."
b. What "compliant" actually looks like
- A formal FRIA document exists with named owners, categorized affected populations, identified rights at risk, and mitigation measures — and it is versioned, not a one-time artifact.
- Post-market monitoring is a continuous telemetry pipeline, not a quarterly report. Drift, disparate impact, and incident patterns are surfaced in near-real-time.
- There is a documented incident reporting pathway to the provider (and to the competent authority where required) with response SLAs.
- Disparate-impact metrics are computed per protected-attribute cohort where lawful and relevant, and degradation triggers a review.
- The telemetry feeding the FRIA and post-market monitoring is itself tamper-evident — you cannot monitor compliance with forgeable data.
- Monitoring scope includes upstream dependency changes (model updates, policy updates, vendor changes) because they can invalidate the FRIA's assumptions.
- Incidents trigger a re-assessment of the FRIA, not just a patch. The FRIA is a living document whose version history is part of the compliance record.
c. Scoring rubric
| Level | Description |
| 0 — Absent | No FRIA. Post-market monitoring is whatever the vendor reports. |
| 1 — Manual | FRIA exists as a Word document. Monitoring is quarterly reports assembled by hand. |
| 2 — Automated but not tamper-evident | Dashboards exist showing drift, disparate impact, incident counts. But the underlying data is mutable; a bad quarter can be quietly edited. |
| 3 — Automated + tamper-evident, not independently verifiable | Telemetry pipeline with signed event records. FRIA versioned in code review. But third party cannot independently verify the monitoring data was not filtered. |
| 4 — Fully compliant | Tamper-evident telemetry chain, FRIA as a versioned artifact joined to release records, disparate-impact metrics computed continuously, incident reporting wired to provider and authority, and a documented re-assessment trigger. |
d. Common gap patterns
- FRIA produced once for the audit, never updated. When the model or policy changes, the FRIA's assumptions are silently invalidated.
- Post-market monitoring confused with product analytics. DAU and click-through are not Article 26 signals.
- Drift monitoring exists for the model's technical metrics (loss, perplexity) but not for the decision-level metrics (approval rate, denial rate by cohort, override rate).
- Incident reporting is "file a Jira ticket." No SLA, no external routing, no evidence the ticket reached the provider.
- Disparate-impact metrics are computed once during initial validation and never again. Populations drift; your fairness claims go stale.
- The monitoring dashboard pulls from the same mutable observability platform that fails Article 12. Auditor challenges one, challenges both.
- The FRIA is owned by Legal with no engineering counterpart. The mitigation measures described are not actually implemented in code.
- Incidents are tracked by severity (P1/P2) but not by rights impacted. The categorization a regulator wants is not in the data.
e. Diagnostic questions
f. How to close the gap
Stand up a telemetry pipeline sourced from the same tamper-evident audit substrate as Article 12. Implement cohort-aware disparate-impact metrics with baselines and alerting. Wire incident categorization to the rights categories the FRIA enumerates. Version the FRIA in the same repository as the model/policy releases so the two move together. Expect 10–20 engineer-weeks plus ongoing analyst staffing.
Categories to evaluate: ML observability platforms with fairness modules (ensure integrity claims meet the Article 12 bar), responsible-AI governance platforms that provide FRIA templates and workflow, regulated-industry reporting services with incident-routing to competent authorities in your jurisdiction.
core/governance/unified_audit_bus.py provides the unified tamper-evident substrate that post-market monitoring can draw from (same integrity guarantees as the Article 12 log). core/governance/merkle_aggregator.py publishes signed Merkle roots suitable for regulator-facing integrity attestation. core/accountability/telemetry.py captures decision-level telemetry joined to the audit chain.
Composite Scorecard Template
Copy this table onto the whiteboard. Score honestly.
| Obligation | Current (0–4) | Target | Gap | Effort (person-weeks) | Dependencies / Blockers |
| Art. 12 — Record-keeping | |||||
| Art. 14 — Human oversight | |||||
| Art. 15 — Accuracy, robustness, cybersecurity | |||||
| Arts. 26 & 27 — Post-market + FRIA | |||||
| Totals |
Rule of thumb for effort sizing
- Each level of maturity closed costs roughly 8–12 engineer-weeks for the core implementation, plus 4–8 weeks for hardening and evidence collection.
- Crossing the Level 2 → Level 3 boundary (adding tamper-evidence) is the expensive jump. Budget generously.
- Crossing Level 3 → Level 4 (adding independent verifiability) is mostly about publication and documentation once the cryptographic primitives are in place — but requires coordination with legal and external auditors.
Priority Matrix — Effort vs. Regulatory Severity
Plot each gap on this 2x2 based on the scorecard. Work the upper-right quadrant first.
Effort
Effort
How to categorize regulatory severity
- HIGH: Article 15 gaps (a manipulatable system is the largest exposure), Article 12 Level 0–1 gaps (no reconstructable record = cannot defend any decision).
- MEDIUM: Article 14 gaps where a pre-execution gate is missing (humans cannot intervene), Articles 26/27 gaps where no FRIA exists at all.
- LOW: Upgrades within a maturity level (e.g., Level 3 → Level 4 independent verifiability) where the underlying control already works.
How to categorize effort
- HIGH: Anything that requires cross-system refactoring (e.g., replacing the logging substrate), anything that requires new staffing (the overseer queue), anything that requires key management infrastructure.
- LOW: Documentation, test-suite expansion, publishing already-computed artifacts, rotating already-implemented signing keys.
Critical Failure Modes
Watch for these in your own implementation and in vendor evaluations. Each of them produces something that looks compliant to a casual reviewer but fails under audit.
A 40-page document describing a process that is not implemented in code. An auditor asks to see the last 10 decisions the process produced; you cannot, because the process is aspirational. If the control is not enforced in code, it is not a control.
Using a non-deterministic model to evaluate the safety of another non-deterministic model. The governance layer inherits every failure mode of the thing it governs, plus new ones. A regulator will identify this as circular reasoning. Deterministic enforcement layers (pure functions, rule-based, version-pinned) must sit between input and inference.
Logs generated after the decision, by a process that reads downstream effects and reconstructs the decision. The "log" is a derivation, not a record. If the derivation logic changes, yesterday's logs change. Genuine record-keeping writes the log synchronously with the decision, on the critical path.
A log store that enforces chronological append but allows administrators to insert entries with backdated timestamps. This is not tamper-evidence — it is tamper-convenience. Real tamper-evidence is cryptographic (hash chain, Merkle tree) and produces a detectable signature break on modification.
The post-market monitoring pipeline is itself an unmonitored, un-versioned artifact. When it silently breaks, no one knows. A compliance control you cannot verify is operational is a liability, not an asset.
The most common failure mode. Score what you have today, not what you have planned. A Jira ticket is Level 0.
If producing the audit artifact requires a two-week engineer sprint, it is not a control — it is a reconstruction. The artifact must be a byproduct of normal operation.
What Good Looks Like: Reference Architecture
A fully-compliant stack for a high-risk AI system has six structural components. A stack missing any of these six pieces has a structural compliance gap that no amount of documentation will close.
1. Deterministic pre-execution gate
Receives every proposed decision, evaluates it against version-pinned policy, and returns ALLOWED / BLOCKED / MODIFIED with a risk assessment. This gate is a pure function where possible, is exhaustively unit-tested, and fails closed on any internal error. EVE AI Core reference: core/coreguard/ (POST /v1/decisions/evaluate) backed by core/governance/veto_core.py (pure, zero-I/O) and core/governance/failure_mode_invariant.py (127 pillars, Hard-Fail-Shut).
2. Tamper-evident audit substrate
Receives every decision, every approval, every override, and every configuration change, canonicalizes the payload, signs it, chains it, and aggregates into publishable Merkle roots. EVE AI Core reference: core/governance/unified_audit_bus.py (16 source systems, JCS-canonicalized, HMAC-signed), core/governance/jcs_canonicalize.py (RFC 8785), core/governance/merkle_aggregator.py (batched roots).
3. Human oversight surface
Cryptographic action attribution, a pre-execution approval workflow for high-consequence decisions, and an anomaly queue with SLA. EVE AI Core reference: Operator Console + core/governance/action_registry.py (propose → approve → execute) + core/tasks/rbac.py (5 roles, 9 permissions) + core/tasks/operator_audit.py (JSONL audit trail).
4. Continuous adversarial testing and drift monitoring
Refreshed corpus, cohort-aware metrics, and release-gating on degradation. Feeds the same audit substrate as the rest of the stack. EVE AI Core reference: core/governance/novel_attack_detector.py (27-class TF-IDF semantic similarity backup) + tests/test_failure_mode_invariant.py (56 attack-vector tests) + core/accountability/telemetry.py.
5. Versioned FRIA
Living in the same repository as the model and policy releases, updated on every release, with impact categories joined to the incident taxonomy. EVE AI Core reference: core/governance/unified_audit_bus.py for the telemetry pipeline, plus FRIA as a repo-resident markdown artifact gated by release review.
6. Independent verifiability
A third party, given only your public verification keys and an export, can verify every decision's integrity, every approval's authenticity, and every published Merkle root — without access to your infrastructure or tooling. EVE AI Core reference: EVE Proof SDK (sdks/proof/, eve-proof PyPI package) and POST /api/tve/verify-attestation.
Compliance is a property of your system, not of any single vendor's product. This framework is designed to be copied, adapted, and used — including by customers evaluating vendors other than EVE.
Next Step
Once you have completed this gap analysis and populated the composite scorecard, proceed to the technical implementation plan. The roadmap turns each identified gap into a sequenced engineering plan with module-level design, test plans, and evidence artifacts suitable for an external audit.