
Agentic AI Red Teaming: A Practitioner's Guide
Agentic AI red teaming is the adversarial testing of AI systems that plan, make decisions, and take actions over many steps using external tools, rather than answering a single prompt. The goal is to find sequences of inputs and conditions that drive an autonomous agent to misuse its tools, exceed its authority, leak data, or cause real-world impact, then prove those paths end to end. It treats the agent as an attack surface that holds state and credentials across a session, not as a stateless text box.
What is agentic AI red teaming?
An agent is a model wrapped in a loop. It reads a goal, decides on an action, calls a tool, reads the result, and decides again, often dozens of times before it stops. That loop is the thing under test. Red teaming it means treating the whole control flow as the target: the planning, the tool selection, the memory it carries between turns, and the permissions attached to each tool it can reach. You are not asking whether the model can be made to say a bad word. You are asking whether the system can be made to take a harmful action against a real resource.
This is the practical expression of two risks that the OWASP Top 10 for LLM Applications names directly. Prompt injection sits at LLM01 because the model reads instructions and data on the same channel and cannot reliably tell them apart. Excessive agency sits at LLM06 and breaks into three root causes: excessive functionality, where the agent can reach tools it does not need; excessive permissions, where those tools run with broader rights than the task requires; and excessive autonomy, where high-impact actions execute with no human in the loop. Agentic red teaming exists to find where an injection meets excessive agency, because that intersection is where a text trick becomes an incident.
The mental model that holds up under engagement is simple. The model is the decision engine, the tools are the hands, and the orchestration layer is the nervous system that connects them. An attacker who controls any input the model reads, a web page, a document, a tool result, a prior message, can attempt to steer the decision engine into using the hands in ways the operator never intended. Your job on a red team is to demonstrate that path concretely, not to speculate that it might exist.
How does it differ from single-prompt LLM testing?
Single-prompt testing asks one question and grades one answer. You send an input, you read an output, and the interaction ends. Agentic testing is stateful. The agent carries a goal, a scratchpad, a conversation history, and often a long-term memory across many turns. An attack that fails on turn one can succeed on turn nine because the agent has accumulated context, fetched a poisoned document, or built up enough trust in a corrupted instruction that it stops questioning it. You are testing a trajectory, not a reply.
The behavior is also non-deterministic in a way that breaks naive test design. The same input can produce different action sequences across runs because of sampling, changing tool results, and the agent's own prior choices. A finding that reproduces three times in ten runs is still a finding, and a defense that holds nine times in ten is still broken. This is why agentic red teaming leans on repeated trials and probabilistic reporting rather than a single pass or fail. You measure how often an attack path lands, not whether it landed once.
Tool access is the difference that matters most. A single-prompt model produces text and nothing happens. An agent with tool access reads email, queries databases, executes code, calls internal APIs, and increasingly orchestrates other agents. When a model can act, the blast radius of one successful injection expands from a bad sentence to a deleted record, an exfiltrated secret, or a fraudulent transaction. Many agents now reach those tools through the Model Context Protocol, the open standard that connects models to external systems, which means MCP server trust and tool descriptions become part of the attack surface you are paid to test.
Finally, the failure surface is broader and quieter. A jailbroken chatbot fails loudly: it says something it should not. A compromised agent can fail silently, taking a sequence of individually plausible actions that add up to a breach, with no single step that looks obviously wrong in a log. Detecting that requires you to trace the whole chain and reason about intent across steps, which single-prompt evaluation never has to do.
The agent kill chain
Attacks on autonomous agents follow a structure worth naming as a kill chain, because each stage gives the defender a place to break the sequence. The first stage is reconnaissance. The attacker probes what the agent can do: which tools it exposes, how it describes them, what it refuses, how it handles errors, and what data sources it reads without question. Much of this is observable from outside by feeding the agent crafted inputs and watching how its plans change. The agent's own verbosity, its willingness to explain its reasoning, often hands the attacker a map of the system.
The second stage is injection. The attacker plants instructions where the agent will read them. Direct prompt injection puts the payload in the user turn. Indirect prompt injection, the more dangerous variant for agents, hides instructions in content the agent fetches on its own: a web page it browses, a document it summarizes, a calendar invite it reads, a tool result it trusts, or a record returned from a poisoned retrieval store. The agent treats that planted text as a legitimate instruction because it arrived on the same channel as its data. The operator never typed it and may never see it.
The third stage is escalation through chained tool calls. A single injected instruction rarely causes the full impact on its own. The attacker steers the agent to use one tool to enable the next: read a credential from one system, use it to authenticate to another, query a directory to find a target, then act on that target. Each call is individually within the agent's granted permissions, which is exactly why excessive agency is dangerous. The agent composes a harmful outcome out of authorized parts, and no single step trips a permission check.
The final stage is impact. The agent exfiltrates data to an attacker-controlled destination, sends messages as the victim, modifies or deletes records, moves money, or pivots into connected systems. The red team's deliverable is to walk this entire chain on the target, from the first crafted input to the demonstrated impact, and to mark every point where a control should have stopped it and did not. A finding that ends at injection without showing impact is incomplete; a finding that walks to impact is the one a CISO acts on.
Autonomous attacker agents and hackbots
The defense is autonomous, so the offense has to be autonomous too. A human tester can try a few dozen multi-turn attack paths against an agent in a day. An autonomous attacker agent, what we build and run as a hackbot, can try thousands, adapt each attempt based on the last response, and explore branches a human would not have the patience to chase. The hackbot drives the conversation, reads the target agent's behavior, updates its strategy in real time, and pursues a goal such as extracting a secret or triggering a forbidden tool call across as many turns as it takes.
This adaptivity is the core of the method. A static list of jailbreak strings ages out the moment a model is updated or a guardrail is patched. An attacker agent does not depend on a fixed payload; it reasons about why an attempt was refused and constructs the next attempt from that reasoning. When the target agent blocks a direct request, the hackbot reframes it, splits it across turns, embeds it in a plausible task, or routes it through a tool the target trusts. It is the same loop the defending agent runs, pointed at the defending agent.
Autonomous attacker agents also compress the timeline. Recent work in the field describes agentic red teaming moving from weeks of manual effort to hours of automated, adaptive testing against a live target. That speed is not a vanity metric. It means coverage. You can run an attack campaign before every release, regenerate it when the model or the tool set changes, and treat agent security as continuous rather than a once-a-year exercise that is stale before the report is delivered. Human expertise sets the objectives, designs the scenarios, and judges the findings; the hackbot does the volume.
Multi-turn and trust-building attacks
The most effective attacks on agents are rarely one clever sentence. They are conversations. A multi-turn attack establishes a benign context first, gets the agent to commit to a framing, and then exploits that commitment. The agent agrees to help with a task, accepts a role, or acknowledges a premise, and several turns later the attacker cashes in that earlier agreement to push an action the agent would have refused if asked cold. The agent's own consistency, its tendency to stay coherent with what it already said, becomes the lever.
Trust-building works because agents are designed to be helpful and to maintain state. An attacker can spread a payload across many messages so that no single turn looks malicious, a technique that defeats filters tuned to inspect one message at a time. The attacker can also poison the agent's working memory early in a session, planting a false fact or a false instruction that the agent then carries forward and acts on long after the original message has scrolled out of the immediate context window. By the time the harmful action fires, the cause is buried turns back.
These attacks connect directly to jailbreaking, but extend it. Classic jailbreaking aims to get a model to produce forbidden content. Multi-turn agent attacks aim to get an autonomous system to take a forbidden action, which is a higher bar and a worse outcome. The red team has to test the full session, replay long conversations, and probe how the agent handles contradictions between its system instructions and instructions injected later. A guardrail that checks only the latest message will miss every attack built this way, so the test design has to look across the whole trajectory.
Scoping an agentic red team engagement
Scoping starts with a question most engagements skip: what is this agent actually allowed to do, and to what? Before any attack, enumerate every tool the agent can call, the permissions behind each tool, the data sources it reads, the systems it can reach, and the actions that are irreversible. That inventory defines the blast radius and tells you which findings will matter. An agent that can only read public documents is a different engagement from one that can send wire transfers, and the scope has to name that difference in writing.
Decide the rules of engagement around impact. Agents act on real systems, so a red team has to agree in advance on which actions may be executed for real, which must stop at proof of capability, and which are strictly off limits. A common and sound approach is to test against a staging environment that mirrors production tools and data, so that a demonstrated exfiltration or deletion is real enough to be credible but contained enough to be safe. Define the success conditions up front: the specific secrets, actions, or boundary violations that count as a win, so findings are unambiguous rather than a matter of interpretation.
Map the engagement to the frameworks your client already reports against, because that is what makes the work actionable inside their program. The relevant anchors are the OWASP Top 10 for LLM Applications, OWASP's dedicated agentic AI guidance, MITRE ATLAS for adversarial tactics and techniques against AI systems, and the NIST AI Risk Management Framework with its generative AI profile and its newer agentic extensions. Tagging each finding to these frameworks turns a list of clever attacks into a remediation plan that maps onto controls the organization is already obligated to manage.
What to measure in an agentic red team
Because agent behavior is non-deterministic, the unit of measurement is the rate, not the single event. For each attack objective, report how often it succeeds across a fixed number of trials. An attack success rate of three in ten is a real exposure, and it should be reported as one. A model update that drops a success rate from forty percent to five percent is progress worth quantifying, and a defense that claims to block an attack should be tested enough times to show whether it holds or merely got lucky. Single-shot pass or fail hides the truth about systems that behave probabilistically.
Measure depth, not just success. For every successful path, record how many turns it took, how many tool calls were chained, and which permissions were abused along the way. A breach that takes one turn and one tool is a different severity from one that takes fifteen turns of careful trust-building, and the depth tells the defender where in the chain their controls are thinnest. Track which kill chain stage each control failed at, so remediation can target the cheapest effective break point rather than trying to harden everything at once.
Tie every finding to concrete impact and to a framework control. Severity should reflect what the agent was actually driven to do: data read, data exfiltrated, action taken, system reached, money moved. A finding that says the agent can be jailbroken is weak. A finding that says a poisoned document caused the agent to read a customer record and send it to an external address, abusing two specific tools at two specific permission boundaries, mapped to OWASP excessive agency and a MITRE ATLAS technique, is the finding that gets fixed. Coverage is the last metric: which tools, data sources, and trust boundaries were tested, and which were not, so the client knows the true edge of the assessment.
Reporting and remediation
A good agentic red team report leads with reproducible attack paths, not abstract risk language. Each finding shows the full trajectory: the initial input, the injection point, the chain of tool calls, and the demonstrated impact, with enough detail for an engineer to reproduce it and confirm the fix later. The success rate sits next to each path so the reader knows whether they are looking at a reliable exploit or an intermittent one. The point of the report is to make the invisible chain visible, because the defending team usually cannot see it in their own logs.
Remediation for agents lives mostly at the architecture level, not in the prompt. The strongest fixes constrain agency rather than trying to talk the model out of being attacked. Scope each tool to the minimum permission it needs. Put human approval in front of irreversible or high-impact actions so excessive autonomy cannot fire on its own. Separate trusted instructions from untrusted content so injected text from a fetched page does not carry the authority of a system instruction. Validate and sandbox tool outputs before the agent acts on them. These are structural controls; a system prompt that says please do not get hijacked is not a control.
Treat the engagement as a baseline, not a one-time event. Agents change constantly: new tools, new model versions, new data sources, new connected systems. Each change can reopen a path you closed or open one that did not exist before. The value of building attacks as autonomous, adaptive campaigns is that they can be re-run on every release, so the security posture is verified continuously instead of assumed between annual tests. That continuous, autonomous assurance is the model Krypteia is built around: hackbots that test agents the way agents now operate, at the speed and scale that manual testing cannot match.
- How is agentic red teaming different from a normal LLM penetration test?
- A standard LLM penetration test mostly probes a model's responses to inputs, often one prompt at a time. Agentic red teaming tests a system that holds state, calls tools, and acts over many turns, so the target is the whole decision-and-action loop rather than a single answer. The findings are about actions taken against real systems, not just text the model produced.
- Why are autonomous attacker agents better than a static jailbreak list?
- Static payload lists age out the moment a model or guardrail is updated, and they cannot react to a refusal. An autonomous attacker agent reasons about why an attempt failed and builds the next attempt from that reasoning, adapting in real time across thousands of multi-turn paths. That adaptivity is what mirrors how a real attacker would work against an autonomous target.
- What is the agent kill chain?
- It is the staged structure most attacks on agents follow: reconnaissance to learn the agent's tools and behavior, injection to plant instructions where the agent will read them, escalation through chained tool calls that compose a harmful outcome from authorized actions, and impact such as data exfiltration or unauthorized actions. Naming the stages gives defenders specific points to break the sequence.
- What is indirect prompt injection and why does it matter for agents?
- Indirect prompt injection hides attacker instructions inside content the agent fetches on its own, such as a web page, a document, a calendar invite, or a tool result, rather than in the user's message. It matters for agents because they autonomously read external content and treat it as trusted on the same channel as their data, so a poisoned source can hijack the agent without the operator ever typing the payload.
- How do you scope an agentic red team safely when the agent acts on real systems?
- Start by inventorying every tool, permission, data source, and irreversible action so you know the blast radius. Then agree on rules of engagement: which actions may execute for real, which stop at proof of capability, and which are off limits, usually by testing against a staging environment that mirrors production. Define explicit success conditions up front so findings are unambiguous.
- What should an agentic red team measure?
- Measure attack success as a rate across repeated trials, since agent behavior is non-deterministic, and report the depth of each successful path in turns and chained tool calls. Tie every finding to concrete impact and to a framework control such as OWASP excessive agency or a MITRE ATLAS technique. Record coverage so the client knows which tools and trust boundaries were and were not tested.
- Which frameworks should findings map to?
- The practical anchors are the OWASP Top 10 for LLM Applications, OWASP's dedicated agentic AI guidance, MITRE ATLAS for adversarial tactics against AI systems, and the NIST AI Risk Management Framework with its generative AI profile and agentic extensions. Mapping findings to these turns a list of attacks into a remediation plan that fits controls the organization already reports against.