EU AI Act Gap Analysis: Articles 12, 14, 15, 26, 27 Scorecard

How to Use This Framework

This framework operationalizes the four obligations that matter most in production: Article 12 (record-keeping), Article 14 (human oversight), Article 15 (accuracy, robustness, cybersecurity), and Articles 26 and 27 (post-market monitoring and fundamental rights impact assessment). Published by EVE AI Core. This is a working tool, not a marketing document. Use it on a whiteboard with your CISO, General Counsel, DPO, and engineering lead.

For each obligation you will:

Read the plain-English summary and the architectural properties that define "compliant".
Score your current state on the 5-level maturity rubric (0–4).
Walk the diagnostic checklist. Any unchecked box is a gap.
Map gaps onto the priority matrix (effort vs. regulatory severity).
Populate the composite scorecard with target maturity and person-week estimates.

Practical Logistics

Time needed: 2–3 hours for a first pass with a cross-functional team. Budget a full day if you are also gathering evidence artifacts.

Team required: CISO or AI governance lead (owner), General Counsel or DPO, ML/platform engineering lead, SRE/observability lead, and — if you serve regulated users — a compliance or internal audit representative. Do not run this solo. The gaps you miss are the gaps in the person running the assessment.

Honesty principle: score what you can prove to a regulator today, not what you intend to build. If a property depends on a Jira ticket, it is Level 0 for this assessment.

Contents

Obligation 1 — Article 12: Record-Keeping
Obligation 2 — Article 14: Human Oversight
Obligation 3 — Article 15: Accuracy, Robustness, Cybersecurity
Obligation 4 — Articles 26 & 27: Post-Market Monitoring + FRIA
Composite Scorecard Template
Priority Matrix — Effort vs. Regulatory Severity
Critical Failure Modes
What Good Looks Like: Reference Architecture
Next Step

Obligation 1 — Article 12: Record-Keeping

a. The obligation in plain English

Article 12(1) requires that "high-risk AI systems shall technically allow for the automatic recording of events ('logs') over the lifetime of the system." Article 12(2) specifies these logs must enable "a degree of traceability of the AI system's functioning that is appropriate to the intended purpose." In practice this means every decision the system makes must be reconstructable after the fact, and the record must be trustworthy.

b. What "compliant" actually looks like

Logs are generated synchronously with the decision, not reconstructed after. The logging call is on the critical path; if logging fails, the decision does not ship.
Each log entry is cryptographically bound to its predecessor (hash-chain, Merkle tree, or equivalent) so any deletion, insertion, or modification is detectable by an independent party.
Logs are canonicalized before signing using a deterministic serialization (RFC 8785 JCS or equivalent), so two verifiers produce identical hashes.
Log write permissions are separated from read permissions, and no role — including platform administrator — can retroactively edit an entry without leaving a detectable signature break.
Retention spans the system lifetime plus the regulatory tail (typically 6–10 years for high-risk domains), with integrity verifiable across the full window.
Each entry contains the minimum reconstruction set: input hash, model/policy version, decision, confidence, actor identity, timestamp (monotonic + wall clock), and upstream correlation ID.
An independent third party can verify the chain given only a public verification key and the log export — no access to your infrastructure required.

c. Scoring rubric

Level	Description
0 — Absent	No structured decision logs exist. Post-hoc reconstruction from scattered application logs only.
1 — Manual	Engineers export logs on request. No automatic capture of decision context. Integrity relies on institutional trust.
2 — Automated, not tamper-evident	Logs written to Datadog, Splunk, Elastic, or CloudWatch. Complete, but mutable by anyone with platform admin rights. Deletion leaves no signature.
3 — Automated, tamper-evident, not independently verifiable	Hash-chained, canonicalized, signed. But verification requires vendor tooling or access to your signing keys — regulator cannot verify offline.
4 — Fully compliant	Hash-chained + JCS-canonicalized + HMAC/asymmetrically signed + Merkle-aggregated roots published. Third party verifies with public key and the export alone.

d. Common gap patterns

Application logs written to Datadog/Splunk/Elastic/CloudWatch are mutable by anyone with platform admin rights — not Article 12 compliant regardless of retention settings.
Decision logs generated by a post-hoc scraper that reads transaction tables. The decision happened; the "log" is a downstream interpretation. A regulator will reject this.
Signed logs using json.dumps(sort_keys=True) instead of a canonicalization scheme. Two runtimes produce different bytes; signatures diverge; audit fails.
"Immutable" S3 buckets with object-lock configured for 90 days but no hash chain. Deletion is prevented, silent replacement is not.
Log retention set at the logging platform's default (30–90 days). High-risk regulatory tails require years.
Correlation IDs exist at the HTTP layer but not at the decision layer — you can find the request, not the decision path inside it.
Signing keys rotated without publishing prior public keys; historical signatures become unverifiable.
Merkle roots computed but never published externally. "Trust us, the root is correct" is not independent verifiability.

e. Diagnostic questions

Check each box only if you could produce the evidence to an auditor today.

Can we produce, for any decision made in the last 12 months, the complete input, model/policy version, output, and decision reasoning within 1 hour?

Is every decision log entry cryptographically linked to the entries before and after it?

If a platform administrator deletes or modifies an entry, will the chain break in a way a third party can detect?

Are log payloads canonicalized (JCS / RFC 8785 or equivalent) before signing?

Can a third party verify the integrity of the entire log chain using only a published public key?

Is the log-write path on the critical path of the decision (i.e., decision fails if logging fails)?

Does the log retention window cover the full regulatory obligation (typically 6–10 years for high-risk)?

Are signing keys rotated on a documented schedule, with historical public keys retained for verification?

f. How to close the gap

Build

Implement a JCS canonicalization library, an append-only hash-chain store with HMAC-SHA256 or asymmetric signing, a Merkle aggregator that publishes roots to an external witness (another cloud provider, a partner, or a transparency log), and a verification endpoint that accepts a log export and a public key. Expect 8–16 engineer-weeks for a production-ready implementation plus hardening.

Buy

Categories to evaluate: audit-grade logging platforms (not observability — different problem), transparency log services (e.g., certificate-transparency-style systems adapted for decision logs), and governance platforms that offer verifiable-attestation primitives. Be skeptical of any vendor whose "tamper-evident" claim cannot be verified without their tooling.

EVE AI Core

core/governance/unified_audit_bus.py (JCS-canonicalized, HMAC-signed, hash-chained audit bus across 16 source systems), core/governance/jcs_canonicalize.py (RFC 8785 implementation), core/governance/merkle_aggregator.py (batched Merkle tree aggregation with signed root publication), and the EVE Proof SDK (sdks/proof/) for third-party verification.

Obligation 2 — Article 14: Human Oversight

a. The obligation in plain English

Article 14(1) requires high-risk AI systems be "designed and developed in such a way … that they can be effectively overseen by natural persons during the period in which the AI system is in use." Article 14(4) specifies that overseers must be able to "fully understand the capacities and limitations of the high-risk AI system," "correctly interpret the high-risk AI system's output," decide not to use it, "intervene on the operation of the high-risk AI system or interrupt the system," and do so in "real time."

b. What "compliant" actually looks like

A human in the chain of control has the authority and the technical means to halt a decision before it executes — not just observe it after the fact.
Outputs are surfaced in human-interpretable form, with the confidence, the alternatives considered, and the factors that drove the decision.
An override action is a first-class operation that is logged, attributed to the overseer, and joined to the decision record.
There is an explicit approval workflow for high-consequence decisions, with the threshold documented and enforced in code (not in a runbook).
Anomaly detection surfaces unusual behavior to the overseer proactively — overseers cannot be expected to review every decision.
The overseer has training documentation specific to this system covering known failure modes, typical output ranges, and the specific signals that indicate the system is operating outside its intended scope.
The kill switch is tested on a documented cadence and the test results are retained.

c. Scoring rubric

Level	Description
0 — Absent	System executes decisions autonomously with no human-accessible intercept.
1 — Manual	Humans can review output after the fact and manually reverse effects. No pre-execution gate.
2 — Automated but not tamper-evident	A review queue exists for flagged decisions, but approval/reject clicks are not auditable — reviewers can be impersonated or actions edited.
3 — Automated + tamper-evident, not independently verifiable	Three-phase propose/approve/execute workflow with signed approvals, but no external verifiability of overseer identity or the approval chain.
4 — Fully compliant	Pre-execution gate with cryptographically attributed approvals, a documented training program for overseers, tested kill switch, and anomaly detection feeding a queue that humans actually work.

d. Common gap patterns

The "human in the loop" is a reviewer who sees decisions on a dashboard after they've already been executed. This is human observation, not Article 14 oversight.
Approval workflow exists but is routinely auto-approved because the queue is too large for the staffing. Compliance theater.
The override mechanism exists in theory (a config flag, a feature toggle) but has never been exercised in production. It will not work when you need it.
Anomaly detection fires into a Slack channel no one monitors. No SLA, no escalation, no on-call.
The training document for overseers is the public product page. Overseers don't understand the system's failure modes well enough to identify them when they occur.
Approval clicks are attributed by session cookie, not by cryptographic signature. Anyone with access to the operator machine is "the overseer."
Kill switch requires filing a ticket and waiting for an engineer. By definition, not real-time.
Override events are logged in the operator console but not joined to the decision chain — reconstruction after an incident requires correlating two separate systems.

e. Diagnostic questions

Can a named human halt any specific decision before it takes external effect?

Is the pre-execution gate on the critical path, or is it advisory?

Is the overseer's approval cryptographically attributable to a specific individual (not just a role or a session)?

Does the overseer have training materials specific to this system's known failure modes?

Is there an anomaly-detection system that routes unusual decisions to the overseer with an SLA?

Do override events generate audit records joined to the decision record?

Has the kill switch been exercised in the last quarter, with the test result retained?

Is the staffing level sufficient to actually review the flagged queue, or has the team quietly started auto-approving?

Are the approval thresholds (what requires human sign-off) documented and enforced in code?

Can the overseer see the confidence score, the alternatives considered, and the policy that produced the decision?

f. How to close the gap

Build

Implement a three-phase action workflow (propose → approve → execute) with the execution step gated on a signed approval. Build an operator console with role-based access, cryptographic action attribution, and joined audit trails. Wire anomaly detection to a real queue with real on-call. Expect 12–20 engineer-weeks plus ongoing operational cost (staffing the queue).

Buy

Categories to evaluate: AI approval workflow platforms, regulated-industry MLOps platforms with built-in human-in-the-loop primitives, and decision support systems designed for the relevant vertical (finance, healthcare, hiring). Beware platforms that conflate observability with oversight.

EVE AI Core

core/coreguard/ provides the pre-execution evaluation gate (POST /v1/decisions/evaluate returns ALLOWED/BLOCKED/MODIFIED with risk assessment). core/governance/action_registry.py implements the three-phase propose/approve/execute workflow with role-based gating. The Operator Console provides the human-facing surface with RBAC via core/tasks/rbac.py, cryptographically attributed actions via core/tasks/operator_audit.py, and Red Team Mode for defensive hardening.

Obligation 3 — Article 15: Accuracy, Robustness, Cybersecurity

a. The obligation in plain English

Article 15(1) requires high-risk AI systems be "designed and developed in such a way that they achieve an appropriate level of accuracy, robustness and cybersecurity, and perform consistently in those respects throughout their lifecycle." Article 15(4) specifies they must be "resilient against attempts by unauthorised third parties to alter their use, outputs or performance by exploiting system vulnerabilities." Article 15(5) requires fail-safe behavior: the system must either handle errors gracefully or refuse to operate.

b. What "compliant" actually looks like

The system has a documented adversarial threat model with explicit attack classes, not a generic "we follow OWASP" statement.
There is a deterministic enforcement layer between input and inference that cannot be bypassed by prompt content, escalation patterns, or manipulation of upstream systems.
The enforcement layer is itself resilient — it fails closed (reject) rather than failing open (allow), and enforcement errors are treated as security events.
Accuracy is measured on a published benchmark with a documented refresh cadence, and degradation triggers a release block.
Adversarial testing is continuous, not a once-a-year penetration test. New attack vectors are added to the test corpus as they are discovered.
The enforcement logic is pure / side-effect-free where possible, so its behavior can be unit-tested exhaustively and formally reasoned about.
Dependency and supply-chain integrity is attested (SBOM, build provenance, pinned dependencies with hash verification).
There is a documented rollback path to a last-known-good model and policy version, with RTO measured in minutes, not hours.

c. Scoring rubric

Level	Description
0 — Absent	No adversarial testing. No enforcement layer beyond the model's own refusals.
1 — Manual	Periodic pen-tests. Enforcement via prompt engineering. Incident response is ad-hoc.
2 — Automated but not tamper-evident	Automated adversarial test suite. Enforcement via application-level checks. But enforcement logic can drift silently; no guarantee today's system behaves like last quarter's.
3 — Automated + tamper-evident, not independently verifiable	Deterministic enforcement module with version-pinned rules, test coverage, Hard-Fail-Shut on enforcement errors. But third party cannot independently audit the rule set behavior.
4 — Fully compliant	All of Level 3, plus: pure-function enforcement module auditable by inspection, published threat model with classified attack vectors, SBOM + build attestation, signed model and policy releases, documented rollback RTO.

d. Common gap patterns

"We use an LLM to check whether the LLM's output is safe." LLM-as-governance-of-LLM. The governance layer has the same failure modes as the thing it is governing. Auditors will correctly identify this as circular.
Enforcement implemented as a system_prompt instruction to "refuse unsafe requests." Prompt-layer safety is a performance improvement, not a compliance control.
Accuracy benchmarks run once, never refreshed. The benchmark is the training distribution. Production drift is invisible.
Adversarial test suite is hardcoded with the attack vectors discovered in 2023. Today's attacks don't appear in the test set.
Enforcement layer accepts input and "usually" blocks disallowed actions, but a try/except in the wrong place fails open. The rule "enforcement errors = reject" is not implemented.
Model and policy versions are not pinned to releases. Production silently rolls to a new version when the vendor updates; you cannot reproduce a decision from last month.
No SBOM. No build provenance. The binary you deployed is not verifiably the binary your CI built.
Rollback procedure is "redeploy from the release tag" but has never been exercised under time pressure. It will take a day when you need 10 minutes.
Input normalization (Unicode NFKC, homoglyph collapse, zero-width strip) happens after pattern matching, not before. Attack prompts using homoglyphs or zero-width characters bypass the regex layer.

e. Diagnostic questions

Do we have a written threat model enumerating the adversarial input classes specific to our high-risk system?

Is there a deterministic enforcement layer between input and inference that cannot be bypassed by prompt content?

If the enforcement layer raises an exception, does the system refuse the decision (fail closed), or does it proceed?

Is input normalized (Unicode NFKC, homoglyph collapse, zero-width strip) before any pattern matching runs?

Is our accuracy benchmark refreshed on a documented cadence, and does degradation block releases?

Do we run adversarial tests continuously (per-PR or nightly), not just at audit time?

Can we produce an SBOM and build attestation for the currently-deployed binary?

Are model and policy versions pinned such that we can reproduce any decision from the last 12 months?

Is the rollback path to a last-known-good version documented, tested, and measured in minutes?

Is the enforcement logic simple enough that a human auditor can read it and understand what it does?

f. How to close the gap

Build

Implement a pure-function enforcement module with zero I/O and zero global state (auditable by inspection). Build a continuously-refreshed adversarial corpus. Publish an SBOM per release (CycloneDX). Implement build attestation (SLSA Level 2 or higher). Expect 10–18 engineer-weeks plus ongoing corpus maintenance.

Buy

Categories to evaluate: AI red-team platforms, model-integrity and provenance services (SLSA-aligned), prompt-injection and LLM firewall vendors (evaluate skeptically — many are themselves LLM-based and reproduce the governance-of-governance problem).

EVE AI Core

core/governance/failure_mode_invariant.py implements 127 enforcement pillars across 13 pattern groups (~175 compiled regex + NFKC/homoglyph normalization) with Hard-Fail-Shut on enforcement errors. core/governance/novel_attack_detector.py provides a 27-class TF-IDF semantic-similarity backup. core/governance/veto_core.py is the pure-function module (zero I/O, zero threading, zero global state) with 87 exhaustive tests. core/governance/build_attestation.py generates SLSA Level 2 provenance and scripts/generate_sbom.py produces CycloneDX 1.4 SBOMs.

Obligation 4 — Articles 26 & 27: Post-Market Monitoring + FRIA

a. The obligation in plain English

Article 26 requires deployers of high-risk AI systems to "monitor the operation of the high-risk AI system on the basis of the instructions for use" and to "inform the provider or distributor … where they have reason to consider that the use in accordance with the instructions for use may result in [the system] presenting a risk." Article 27 requires, for specified deployers, a Fundamental Rights Impact Assessment (FRIA) describing "the categories of natural persons and groups likely to be affected," "the specific risks of harm likely to have an impact" on them, and the "measures to be taken in case of the materialisation of those risks."

b. What "compliant" actually looks like

A formal FRIA document exists with named owners, categorized affected populations, identified rights at risk, and mitigation measures — and it is versioned, not a one-time artifact.
Post-market monitoring is a continuous telemetry pipeline, not a quarterly report. Drift, disparate impact, and incident patterns are surfaced in near-real-time.
There is a documented incident reporting pathway to the provider (and to the competent authority where required) with response SLAs.
Disparate-impact metrics are computed per protected-attribute cohort where lawful and relevant, and degradation triggers a review.
The telemetry feeding the FRIA and post-market monitoring is itself tamper-evident — you cannot monitor compliance with forgeable data.
Monitoring scope includes upstream dependency changes (model updates, policy updates, vendor changes) because they can invalidate the FRIA's assumptions.
Incidents trigger a re-assessment of the FRIA, not just a patch. The FRIA is a living document whose version history is part of the compliance record.

c. Scoring rubric

Level	Description
0 — Absent	No FRIA. Post-market monitoring is whatever the vendor reports.
1 — Manual	FRIA exists as a Word document. Monitoring is quarterly reports assembled by hand.
2 — Automated but not tamper-evident	Dashboards exist showing drift, disparate impact, incident counts. But the underlying data is mutable; a bad quarter can be quietly edited.
3 — Automated + tamper-evident, not independently verifiable	Telemetry pipeline with signed event records. FRIA versioned in code review. But third party cannot independently verify the monitoring data was not filtered.
4 — Fully compliant	Tamper-evident telemetry chain, FRIA as a versioned artifact joined to release records, disparate-impact metrics computed continuously, incident reporting wired to provider and authority, and a documented re-assessment trigger.

d. Common gap patterns

FRIA produced once for the audit, never updated. When the model or policy changes, the FRIA's assumptions are silently invalidated.
Post-market monitoring confused with product analytics. DAU and click-through are not Article 26 signals.
Drift monitoring exists for the model's technical metrics (loss, perplexity) but not for the decision-level metrics (approval rate, denial rate by cohort, override rate).
Incident reporting is "file a Jira ticket." No SLA, no external routing, no evidence the ticket reached the provider.
Disparate-impact metrics are computed once during initial validation and never again. Populations drift; your fairness claims go stale.
The monitoring dashboard pulls from the same mutable observability platform that fails Article 12. Auditor challenges one, challenges both.
The FRIA is owned by Legal with no engineering counterpart. The mitigation measures described are not actually implemented in code.
Incidents are tracked by severity (P1/P2) but not by rights impacted. The categorization a regulator wants is not in the data.

e. Diagnostic questions

Does a current FRIA exist, with named owners, dated within the last 12 months?

Is the FRIA version history preserved and linked to specific model/policy releases?

Do we monitor decision-level metrics (approval rate, denial rate, override rate) continuously, not just technical ML metrics?

Are disparate-impact metrics computed per cohort (where lawful) on a cadence that detects drift?

Is the telemetry feeding post-market monitoring on the same tamper-evident substrate as Article 12 logs?

Is there a documented incident reporting pathway to the provider and the authority, with response SLAs?

Do incidents trigger an FRIA re-assessment, not just a technical patch?

When the model or policy version changes, does the FRIA review gate the release?

Is the monitoring pipeline's own uptime and data-completeness measured? (Monitoring the monitor.)

Could we produce the last 12 months of monitoring data with integrity proofs for a regulator?

f. How to close the gap

Build

Stand up a telemetry pipeline sourced from the same tamper-evident audit substrate as Article 12. Implement cohort-aware disparate-impact metrics with baselines and alerting. Wire incident categorization to the rights categories the FRIA enumerates. Version the FRIA in the same repository as the model/policy releases so the two move together. Expect 10–20 engineer-weeks plus ongoing analyst staffing.

Buy

Categories to evaluate: ML observability platforms with fairness modules (ensure integrity claims meet the Article 12 bar), responsible-AI governance platforms that provide FRIA templates and workflow, regulated-industry reporting services with incident-routing to competent authorities in your jurisdiction.

EVE AI Core

core/governance/unified_audit_bus.py provides the unified tamper-evident substrate that post-market monitoring can draw from (same integrity guarantees as the Article 12 log). core/governance/merkle_aggregator.py publishes signed Merkle roots suitable for regulator-facing integrity attestation. core/accountability/telemetry.py captures decision-level telemetry joined to the audit chain.

Composite Scorecard Template

Copy this table onto the whiteboard. Score honestly.

Obligation	Current (0–4)	Target	Gap	Effort (person-weeks)	Dependencies / Blockers
Art. 12 — Record-keeping
Art. 14 — Human oversight
Art. 15 — Accuracy, robustness, cybersecurity
Arts. 26 & 27 — Post-market + FRIA
Totals

Rule of thumb for effort sizing

Each level of maturity closed costs roughly 8–12 engineer-weeks for the core implementation, plus 4–8 weeks for hardening and evidence collection.
Crossing the Level 2 → Level 3 boundary (adding tamper-evidence) is the expensive jump. Budget generously.
Crossing Level 3 → Level 4 (adding independent verifiability) is mostly about publication and documentation once the cryptographic primitives are in place — but requires coordination with legal and external auditors.

Priority Matrix — Effort vs. Regulatory Severity

Plot each gap on this 2x2 based on the scorecard. Work the upper-right quadrant first.

Low Severity

High Severity

High
Effort

Defer Monitor. Revisit when a lower-effort workaround becomes available or when severity shifts.

Hard Yards Plan now. Fund fully. These are the gaps that fail an audit and cannot be patched in a sprint.

Low
Effort

Quick Wins Batch into the next sprint. No reason these are still open.

Do Today Blocks everything. Execute before any other compliance work on the stack.

How to categorize regulatory severity

HIGH: Article 15 gaps (a manipulatable system is the largest exposure), Article 12 Level 0–1 gaps (no reconstructable record = cannot defend any decision).
MEDIUM: Article 14 gaps where a pre-execution gate is missing (humans cannot intervene), Articles 26/27 gaps where no FRIA exists at all.
LOW: Upgrades within a maturity level (e.g., Level 3 → Level 4 independent verifiability) where the underlying control already works.

How to categorize effort

HIGH: Anything that requires cross-system refactoring (e.g., replacing the logging substrate), anything that requires new staffing (the overseer queue), anything that requires key management infrastructure.
LOW: Documentation, test-suite expansion, publishing already-computed artifacts, rotating already-implemented signing keys.

Critical Failure Modes

Watch for these in your own implementation and in vendor evaluations. Each of them produces something that looks compliant to a casual reviewer but fails under audit.

1. Compliance Theater

A 40-page document describing a process that is not implemented in code. An auditor asks to see the last 10 decisions the process produced; you cannot, because the process is aspirational. If the control is not enforced in code, it is not a control.

2. LLM-as-Governance-of-LLM

Using a non-deterministic model to evaluate the safety of another non-deterministic model. The governance layer inherits every failure mode of the thing it governs, plus new ones. A regulator will identify this as circular reasoning. Deterministic enforcement layers (pure functions, rule-based, version-pinned) must sit between input and inference.

3. Retrospective Logging

Logs generated after the decision, by a process that reads downstream effects and reconstructs the decision. The "log" is a derivation, not a record. If the derivation logic changes, yesterday's logs change. Genuine record-keeping writes the log synchronously with the decision, on the critical path.

4. Append-Permissive Ledgers

A log store that enforces chronological append but allows administrators to insert entries with backdated timestamps. This is not tamper-evidence — it is tamper-convenience. Real tamper-evidence is cryptographic (hash chain, Merkle tree) and produces a detectable signature break on modification.

5. Monitoring the Monitor

The post-market monitoring pipeline is itself an unmonitored, un-versioned artifact. When it silently breaks, no one knows. A compliance control you cannot verify is operational is a liability, not an asset.

6. "It's in the Backlog"

The most common failure mode. Score what you have today, not what you have planned. A Jira ticket is Level 0.

7. Evidence on Demand, Not Evidence by Design

If producing the audit artifact requires a two-week engineer sprint, it is not a control — it is a reconstruction. The artifact must be a byproduct of normal operation.

What Good Looks Like: Reference Architecture

A fully-compliant stack for a high-risk AI system has six structural components. A stack missing any of these six pieces has a structural compliance gap that no amount of documentation will close.

1. Deterministic pre-execution gate

Receives every proposed decision, evaluates it against version-pinned policy, and returns ALLOWED / BLOCKED / MODIFIED with a risk assessment. This gate is a pure function where possible, is exhaustively unit-tested, and fails closed on any internal error. EVE AI Core reference: core/coreguard/ (POST /v1/decisions/evaluate) backed by core/governance/veto_core.py (pure, zero-I/O) and core/governance/failure_mode_invariant.py (127 pillars, Hard-Fail-Shut).

2. Tamper-evident audit substrate

Receives every decision, every approval, every override, and every configuration change, canonicalizes the payload, signs it, chains it, and aggregates into publishable Merkle roots. EVE AI Core reference: core/governance/unified_audit_bus.py (16 source systems, JCS-canonicalized, HMAC-signed), core/governance/jcs_canonicalize.py (RFC 8785), core/governance/merkle_aggregator.py (batched roots).

3. Human oversight surface

Cryptographic action attribution, a pre-execution approval workflow for high-consequence decisions, and an anomaly queue with SLA. EVE AI Core reference: Operator Console + core/governance/action_registry.py (propose → approve → execute) + core/tasks/rbac.py (5 roles, 9 permissions) + core/tasks/operator_audit.py (JSONL audit trail).

4. Continuous adversarial testing and drift monitoring

Refreshed corpus, cohort-aware metrics, and release-gating on degradation. Feeds the same audit substrate as the rest of the stack. EVE AI Core reference: core/governance/novel_attack_detector.py (27-class TF-IDF semantic similarity backup) + tests/test_failure_mode_invariant.py (56 attack-vector tests) + core/accountability/telemetry.py.

5. Versioned FRIA

Living in the same repository as the model and policy releases, updated on every release, with impact categories joined to the incident taxonomy. EVE AI Core reference: core/governance/unified_audit_bus.py for the telemetry pipeline, plus FRIA as a repo-resident markdown artifact gated by release review.

6. Independent verifiability

A third party, given only your public verification keys and an export, can verify every decision's integrity, every approval's authenticity, and every published Merkle root — without access to your infrastructure or tooling. EVE AI Core reference: EVE Proof SDK (sdks/proof/, eve-proof PyPI package) and POST /api/tve/verify-attestation.

Compliance is a property of your system, not of any single vendor's product. This framework is designed to be copied, adapted, and used — including by customers evaluating vendors other than EVE.

Next Step

Once you have completed this gap analysis and populated the composite scorecard, proceed to the technical implementation plan. The roadmap turns each identified gap into a sequenced engineering plan with module-level design, test plans, and evidence artifacts suitable for an external audit.

Step 4 Compliance Roadmap Turn your scorecard gaps into a sequenced engineering plan with module design, tests, and audit-ready evidence artifacts. Step 1 Risk Classification Confirm your system is in scope. Prohibited, high-risk, limited-risk, minimal-risk, or GPAI. Step 2 Obligation Mapping Break each applicable article into concrete technical controls and map them to your existing stack. Architecture EVE CoreGuard Deterministic pre-execution governance that addresses Articles 12, 14, and 15 by construction. Verification EVE Proof Third-party-verifiable decision certificates for Article 12 record-keeping and Article 26 reporting. Engagement Schedule an Assessment Work with our compliance-architecture team to score your current state and scope the remediation work.

Not legal advice. This framework is a compliance-engineering resource designed to support gap-analysis discussions between technical and legal teams. It is not a substitute for advice from qualified counsel admitted in a relevant Member State jurisdiction. Article references are to Regulation (EU) 2024/1689 as published in the Official Journal; implementing acts and AI Office guidance may refine individual obligations. Confirm your maturity scoring and remediation plan in writing before relying on it for a deployment, market-placement, or conformity-assessment decision. Published under the same license as the rest of the EVE AI Core documentation — designed to be copied, adapted, and used, including by customers evaluating vendors other than EVE. EVE AI Core makes no representation that use of this framework creates an attorney-client relationship or satisfies any regulatory obligation.

EU AI Act Gap Analysis Framework

How to Use This Framework

Obligation 1 — Article 12: Record-Keeping

a. The obligation in plain English

b. What "compliant" actually looks like

c. Scoring rubric

d. Common gap patterns

e. Diagnostic questions

f. How to close the gap

Obligation 2 — Article 14: Human Oversight

a. The obligation in plain English

b. What "compliant" actually looks like

c. Scoring rubric

d. Common gap patterns

e. Diagnostic questions

f. How to close the gap

Obligation 3 — Article 15: Accuracy, Robustness, Cybersecurity

a. The obligation in plain English

b. What "compliant" actually looks like

c. Scoring rubric

d. Common gap patterns

e. Diagnostic questions

f. How to close the gap

Obligation 4 — Articles 26 & 27: Post-Market Monitoring + FRIA

a. The obligation in plain English

b. What "compliant" actually looks like

c. Scoring rubric

d. Common gap patterns

e. Diagnostic questions

f. How to close the gap

Composite Scorecard Template

Rule of thumb for effort sizing

Priority Matrix — Effort vs. Regulatory Severity

How to categorize regulatory severity

How to categorize effort

Critical Failure Modes

What Good Looks Like: Reference Architecture

1. Deterministic pre-execution gate

2. Tamper-evident audit substrate

3. Human oversight surface

4. Continuous adversarial testing and drift monitoring

5. Versioned FRIA

6. Independent verifiability

Next Step