
Manual red teaming is the gold standard for AI safety evaluation. A skilled adversarial tester probes the system with creative attacks, documents failures, and provides recommendations for hardening. The problem is that manual red teaming doesn't scale. A human tester might generate a few hundred attack vectors in a week. An AI system in production faces millions of interactions per day, and adversaries are not constrained to a schedule.
AEGIS — Adversarial Evolution of Governance through Iterative Self-Hardening — was built to solve this problem. It is a closed-loop adversarial testing system that continuously generates attacks against EVE AI Core's governance infrastructure, evaluates the results, identifies bypasses, auto-generates detection patterns, and re-tests. The loop never stops.
Why Manual Red Teaming Isn't Enough
Manual red teaming has three structural limitations:
- Coverage. No human team can anticipate every encoding, every multi-turn manipulation strategy, every combination of context injection and persona hijack. The attack surface of a production AI system is combinatorially vast.
- Speed. Adversaries evolve faster than quarterly assessments. A new jailbreak technique can spread across the internet in hours. By the time a manual report is written, reviewed, and acted upon, the window of vulnerability may already have been exploited.
- Consistency. Human testers have biases. They tend to test the attacks they know, the categories they specialize in, and the techniques that have worked before. Novel attack classes — especially cross-domain combinations — fall through the gaps.
The AEGIS Loop
AEGIS operates as a five-stage closed loop that runs continuously against the governance stack:
- Generate. The attack generator produces adversarial inputs across all 57 categories. Generation is parameterized by rigor level (1.0 to 1000.0), meaning early passes use simple attacks and later passes use increasingly sophisticated combinations, encodings, and multi-turn strategies.
- Evaluate. Each generated attack is executed against the full governance pipeline — charter rules, CRD scoring, cognitive locks, and hardware veto. The result is classified as BLOCKED, DETECTED, PARTIAL_BYPASS, or FULL_BYPASS.
- Identify. Partial and full bypasses are analyzed to extract the mechanism of evasion. What rule did it circumvent? What encoding did it use? What context manipulation was required?
- Harden. For each identified bypass mechanism, AEGIS generates detection patterns — compiled regex, token sequence matchers, and semantic classifiers — that are added to the governance stack. These patterns are tested against the existing corpus to ensure they don't create false positives.
- Re-test. The hardened system is re-tested against the full attack corpus, including the attacks that previously succeeded. The cycle repeats until no bypasses remain at the current rigor level, then rigor escalates.
The 57 Attack Categories
AEGIS organizes attacks into categories that cover the full spectrum of adversarial techniques observed in production AI systems:
- Prompt injection — Direct instruction override, indirect injection via retrieved content, system prompt extraction
- Governance bypass — Attempts to disable safety checks, modify charter rules, or convince the system to ignore its own constraints
- Credential leaks — Extraction of API keys, internal configurations, system prompts, or architecture details
- Persona hijack — Forcing the system to adopt a different identity, "jailbreak" personas like DAN, or role-play scenarios that erode boundaries
- Encoding evasion — Base64, ROT13, Unicode homoglyphs, zero-width characters, and other techniques to smuggle content past text-based detectors
- Multi-turn manipulation — Slowly escalating across multiple messages, establishing trust before requesting harmful actions
- Context poisoning — Injecting false information into the conversation history to influence future responses
- Authority impersonation — Claiming to be an administrator, developer, or authorized user to gain elevated privileges
Each category has sub-categories (totaling 210+ distinct attack patterns), and each sub-category has multiple generation templates parameterized by rigor level.
Escalating Rigor
The rigor parameter controls the sophistication of generated attacks. At rigor 1.0, AEGIS generates straightforward, well-known attacks: basic prompt injection, obvious persona hijack attempts, unencoded harmful requests. These are the attacks that any competent safety system should block.
At rigor 10.0, attacks become more nuanced: multi-turn context manipulation, payload splitting across messages, Unicode evasion, and combinations of techniques. At rigor 100.0, attacks chain together multiple evasion strategies with adversarial suffixes, context flooding, and attention manipulation. At rigor 1000.0, AEGIS generates novel combinations that have never been documented in any public red team report.
Key insight: The rigor escalation is monotonic. AEGIS never reduces rigor once a level is cleared. This ensures that the governance stack is always being tested against the most sophisticated attacks it has ever faced — and that hardening at one level doesn't create regressions at lower levels.
Manual vs. AEGIS
| Dimension | Manual Red Team | AEGIS |
| Attack generation speed | ~100 vectors/week | ~10,000 vectors/hour |
| Category coverage | Tester-dependent | All 57 categories per run |
| Pattern generation | Manual documentation | Auto-compiled regex |
| Regression testing | Re-run manually | Continuous re-testing |
| Novel attack discovery | Depends on expertise | Combinatorial generation |
| Cost scaling | Linear with team size | Fixed compute cost |
| Availability | Business hours | 24/7 continuous |
AEGIS does not replace human red teamers. It amplifies them. Human testers bring creativity, domain expertise, and the ability to reason about novel attack surfaces that AEGIS's generation templates haven't yet covered. AEGIS brings scale, consistency, and the ability to run continuously without fatigue or bias.
The Hardening Feedback Loop
The most valuable output of AEGIS is not the attacks it generates — it is the detection patterns it creates. Every time AEGIS identifies a bypass, it generates a detection pattern that is added to the governance stack. These patterns are compiled into efficient regex matchers, token sequence detectors, or semantic classifiers that operate at sub-millisecond latency.
Over time, this creates a self-hardening system. Each attack that succeeds once will never succeed again through the same mechanism. The governance stack grows more resilient with every AEGIS cycle, not because a human reviewed a report and wrote a patch, but because the hardening loop is automatic and continuous.
The only way to stay ahead of adversaries is to make the adversary part of the system.
AEGIS runs continuously in our staging environment and on a regular cadence in production. Every governance update, every charter rule modification, and every new CRD threshold is tested against the full attack corpus before deployment. The system that protects EVE AI Core is not a static set of rules — it is a living, evolving adversarial defense that has been tested against more attacks than any human team could generate in a lifetime.