---
title: Break Your Own AI Agent: A Practical Red-Team Framework for Builders (Part 2)
description: A practical six-phase red-team framework for enterprise AI agents: Scope, Threat-Model, Attack, Evidence, Remediate, Re-test. Harness-agnostic (Claude Code, OpenClaw, custom). Findings mapped to OWASP ASI, MITRE ATLAS, ISO 42001, NIST AI RMF, AIUC-1.
url: https://ziosec.com/blog/break-your-own-ai-agent-a-practical-red-team-framework-part-2
category: Blog
publishedAt: 2026-04-23
author: Andrius Useckas
authorRole: Co-Founder & CTO
tags: ai-red-teaming, ai-agent-security, offensive-security, enterprise-red-team, ai-pentesting
---

# Break Your Own AI Agent: A Practical Red-Team Framework for Builders (Part 2)

*This is Part 2 of the Break Your Own AI Agent series. [Read Part 1](/blog/break-your-own-ai-agent-why-proactive-security-testing-is-essential-for-builders-part-1) for the philosophy behind proactive AI agent security and the unique attack surface AI agents introduce.*

## TL;DR

* Part 1 made the case. Part 2 gives you the framework.
* Six phases: Scope, Threat-Model, Attack, Evidence, Remediate, Re-test. Every phase has a concrete deliverable.
* Harness-agnostic. The framework works whether your agent runs on Claude Code, OpenClaw, a custom-built stack, or a mix. Harnesses change; the attack surface does not.
* No magic attack counts. We don't tell you we have "225 attack patterns." We tell you we generate attack chains across model, protocol, and tool layers. The number of possible combinations is in the millions.
* Findings map to the frameworks your auditors ask about: OWASP ASI, MITRE ATLAS, ISO 42001, NIST AI RMF, AIUC-1, EU AI Act.

---

## Why "Break Your Own Agent" Needs a Framework, Not a Vibe

In Part 1 we argued that every builder should adopt a proactive, adversarial mindset and systematically try to break their own AI agents. That is the philosophy. Philosophy is necessary. It is not enough.

What we see in enterprise environments, after dozens of engagements, is that the teams who want to "break their own agent" and have no framework do one of three things:

1. They spend a Friday afternoon pasting known jailbreaks from GitHub into their agent, see nothing break, and declare victory.
2. They build a one-off harness, run a few hundred prompt variations, and generate a spreadsheet that nobody ever reads again.
3. They hire a traditional pentest firm that tests the web endpoints and the authentication layer, leaving every AI-specific attack surface (prompt injection, tool misuse, inter-agent trust, RAG poisoning, memory corruption) untouched.

None of these produce evidence. None of them give compliance teams something to work with. None of them catch the actual attacks.

The framework below is what produces evidence. It is what the ZioSec platform automates for customers, and what our engineers run manually when they lead a full engagement. You can run it yourself, with tooling you already have, provided you follow the phases in order.

## The Framework: Six Phases

### Phase 1: Scope

Before you run a single test, write down:

* **What is the agent?** A harness (Claude Code, OpenClaw, custom) plus a model plus a set of tools plus a set of data sources plus a set of memory and state.
* **What does it have access to?** Enumerate every tool, every API, every database, every secret, every file system path. If you can't enumerate this, you already have an inventory problem, and an attacker will find what you missed.
* **What is the blast radius of full compromise?** Assume an attacker fully controls the agent's output. What can they touch? What can they exfiltrate? What can they destroy?
* **What is the compliance posture you are testing against?** Pick a framework. OWASP ASI is the baseline. MITRE ATLAS gives you a shared taxonomy with the rest of the security community. ISO 42001 gives you the management-system scaffolding. NIST AI RMF gives you the governance scaffolding. AIUC-1 gives you AI-agent-specific controls. Enterprises usually test against a combination.

**Deliverable:** A one-page scope document that answers these four questions. If you cannot fill it out, you are not ready to test.

### Phase 2: Threat Model

For each tool, data source, and memory store the agent has access to, ask:

* What would an attacker want to do with this?
* How would they get the agent to do it for them?
* What would the attack look like from the agent's perspective (a prompt), from the tool's perspective (a call), and from the target's perspective (an event)?

This is the step most builders skip because it feels abstract. It is the step that makes the rest of the framework work. A threat model that enumerates ten realistic attack paths is worth a thousand random jailbreak attempts.

**Deliverable:** A threat model document, one page per tool, listing the abuse scenarios and the expected attack shape. We'll use this as the test plan.

### Phase 3: Attack

This is where most teams start, and it's why they fail. Without scope and a threat model, attacking is noise generation.

With a threat model, attacking is evidence gathering. For each abuse scenario:

* **Craft a goal-based attack.** Not a prompt from a list, a specific objective. "Make the agent exfiltrate the contents of the `secrets` table via a tool call." "Make the agent send an email to an external address containing internal-only content." "Make the agent execute a command on the container host."
* **Run adversarial chains.** Single-prompt attacks often fail. Chained attacks across multiple turns, across multiple tools, or across an indirect prompt-injection plus tool-call combo succeed. This is where ZioSec spends most of its time. A chain is a sequence of prompts and observations that takes the agent from a benign starting state to the attack goal.
* **Test across harnesses.** If your org has agents on Claude Code *and* OpenClaw *and* a custom stack, every attack chain should be run against each. The surface shape is similar. The specific exploits differ.
* **Log everything.** Every prompt, every response, every tool call, every tool return. The log is the evidence.

**Deliverable:** A running attack log with every chain attempted, what happened, and whether the goal was achieved. Chains that succeeded are findings.

### Phase 4: Evidence

For each finding, produce an evidence package a CISO, compliance lead, and auditor can each use:

* **What was achieved.** The attack goal, in one sentence.
* **How it was achieved.** The chain of prompts and tool calls, reproducible step-by-step.
* **Framework mapping.** Which OWASP ASI risk does this hit? Which MITRE ATLAS technique? Which ISO 42001 control is this relevant to? Which NIST AI RMF function is implicated? Which AIUC-1 domain?
* **Severity.** CVSS-ish (Critical, High, Medium, Low) with reasoning tied to your blast-radius analysis from Phase 1.
* **Remediation.** 30-day fix (input validation, tool scoping), 90-day fix (policy engine, runtime guardrails), 180-day fix (architectural control changes).

If your findings don't map to the framework your compliance team uses, the finding is worthless to them. Evidence is a first-class output.

**Deliverable:** One evidence document per finding. Severity-ordered. Mapped to frameworks.

### Phase 5: Remediate

This is the engineering team's phase, not the red team's. But the red team stays engaged because:

* Remediation proposals must be reviewed. A fix that introduces a new vulnerability is common.
* Some fixes change the threat model. When you lock down a tool call's parameters, a new attack chain becomes possible via a different tool. Re-threat-model.
* The remediation timeline must be tracked. ISO 42001 and NIST AI RMF both require evidence of remediation, not just evidence of findings.

**Deliverable:** A remediation tracker. Finding, owner, status, target date, verification criteria.

### Phase 6: Re-Test

A finding isn't closed until the exact attack chain that produced it has been re-run against the remediated system and has failed.

This is also where most teams stop being honest. "We patched input validation, it must be fixed." Prove it. Run the chain. If it fails, close the finding. If it succeeds again with a minor variation, the remediation was incomplete.

**Deliverable:** A verification log. Chain, pre-remediation result (succeeded), post-remediation result (failed, or succeeded with a variation that produces a new finding).

## Doing This Yourself vs. Getting Help

Some teams can run this framework in-house. What it takes:

* An engineer with offensive-security instincts willing to spend 25 to 50% of their time on it.
* Access to the agent's runtime, logs, and source code.
* Time. A real engagement is measured in weeks, not days.
* Tooling. At minimum, a prompt harness, a tool-call tracer, and a way to run chains programmatically.

Some teams can't, or shouldn't. The reasons are predictable:

* No one on the team has run a red-team engagement before, and AI agent attack paths do not look like web-app attack paths.
* Leadership wants evidence-mapped-to-framework deliverables that internal teams won't produce in the format auditors need.
* The timeline pressure (EU AI Act enforcement begins August 2, 2026) doesn't leave room for a six-month internal ramp.
* The agent is in production with enterprise data access, and the cost of a missed finding is too high for a first-time internal team.

In those cases, **schedule an engagement with us**. [Our AI Agent Pentest](/ai-agent-pentesting) starts at $10,000, findings map to OWASP ASI, MITRE ATLAS, ISO 42001, NIST AI RMF, and AIUC-1 out of the box, and 100% of the engagement fee applies as credit toward an annual platform subscription if you move to continuous adversarial testing.

## FAQ

**How long should a first internal red-team engagement take?**
Three to six weeks for a medium-complexity agent (one harness, five to ten tools, one or two data sources). Less than two weeks is shallow. More than eight weeks is scope creep.

**Do I need the agent's source code?**
No, but it helps. The framework works against black-box agents (you only have prompt access and observable behavior) and white-box agents (you have the full source). White-box catches more. Black-box is closer to the real threat model.

**Is this the same as running red-team prompts from GitHub?**
No. Jailbreak prompt lists are a useful primitive. They're a fraction of Phase 3, the "attack" phase. Running them without Phases 1, 2, 4, 5, and 6 produces output, not evidence.

**How do I know if my remediation is complete?**
The original attack chain fails *and* three variations of it (different phrasing, different tool sequencing, different indirect injection vector) also fail. If any variation still succeeds, remediation is partial.

**What if my agent runs on multiple harnesses (Claude Code, OpenClaw, custom)?**
Run the framework once per harness. The attack surface is similar but not identical. Specific exploit primitives differ, and an agent's response to the same attack chain will depend on the harness's prompting and tool conventions.

---

*This concludes the Break Your Own AI Agent series as a two-parter. If there's interest, we'll write a Part 3 on running this framework continuously, from one-time engagement to always-on adversarial testing in production.*