
AI Agent Security: Threats and Defenses Guide
AI agent security is the practice of protecting LLM-driven agents that can take actions in the world: agents that call tools, run code, query databases, send messages, and persist memory across sessions. It matters because the failure mode changes when you give a model autonomy. A jailbroken chatbot says something wrong; a jailbroken agent with tool access executes code, exfiltrates data, or modifies records. AI agent security treats the model as an untrusted component inside a system and applies access control, sandboxing, output validation, monitoring, and human oversight around it.
Why agents change the risk model
A standalone language model produces text. The blast radius of a bad output is whatever a human does with that text next. An agent removes the human from that loop on purpose. It reads a goal, decides on a sequence of tool calls, and executes them. The model is now wired to a shell, an API client, a database connection, a browser, or a payment system. The same prompt that used to produce an embarrassing sentence can now produce a sequence of side effects that are real, immediate, and sometimes irreversible.
This is the core shift every security team needs to internalize. With a chatbot, the worst case is reputational: it generates offensive content, leaks a snippet of training data, or gives bad advice. With an agent, the worst case is operational: it deletes a production table, wires money to an attacker-controlled account, opens a reverse shell, or emails your customer list to an external address. The vulnerability classes look similar on paper. The consequences do not. Prompt injection against a support bot is annoying; prompt injection against an agent that can issue refunds is fraud.
Autonomy also compounds errors. A single model call has one chance to go wrong. An agent loop has many. Each step feeds its output back as the next step's input, so a small misread early in a task can cascade into a confident, fully-wrong action chain by the end. Agents that plan over multiple steps, retry on failure, and call sub-agents multiply the number of decision points where an attacker can inject a deviation or where the model can simply hallucinate a destructive command. The attack surface is not the prompt. It is the entire loop of perception, planning, and action.
The agent threat landscape
Prompt injection is the foundational threat. The model cannot reliably tell the difference between instructions from its operator and instructions that arrive inside data it processes. Direct prompt injection is when an attacker types a malicious instruction into a field the model reads. Indirect prompt injection is more dangerous for agents: the payload is planted in a web page, a document, an email, a code comment, or a calendar invite that the agent retrieves while doing its job. The agent ingests the poisoned content, treats the embedded text as a command, and acts on it. Because agents are built to fetch and act on external data, indirect injection is not an edge case. It is the main event.
Excessive agency is the second pillar, and it is the gap most teams underestimate. It covers excessive permissions, excessive functionality, and excessive autonomy. An agent given a database credential with full write access has excessive permission even if it only needs read. An agent wired to a tool that can delete files when the task only requires listing them has excessive functionality. An agent allowed to complete a high-impact action without confirmation has excessive autonomy. Most real agent incidents are not exotic. They are excessive agency meeting a routine model mistake or a successful injection.
Insecure output handling is the bug that turns a model error into a system compromise. Agent output is frequently passed to a downstream interpreter: a SQL query, a shell command, an eval, a templating engine, a markdown renderer that loads remote images. If the system trusts model output the way it would trust a vetted internal function, then any injected or hallucinated payload flows straight into that interpreter. This is how prompt injection becomes SQL injection, command injection, or server-side request forgery. The model is just the delivery mechanism; the real vulnerability is the unvalidated trust boundary on the way out.
The remaining classes round out the picture. Memory poisoning targets agents that persist state: an attacker plants false facts or hidden instructions in the agent's long-term memory or in a shared knowledge base so the agent acts on them in future sessions, a longer-lived cousin of RAG poisoning. Tool misuse covers an agent calling legitimate tools in harmful combinations or with attacker-chosen arguments. Identity and authorization failures happen when an agent acts with a single over-privileged service identity instead of carrying the requesting user's scoped permissions, which collapses your access control and enables confused-deputy attacks. Supply chain risk covers poisoned models, compromised tool servers, and malicious MCP servers the agent connects to. Each of these is a documented, reproducible failure mode, not a hypothetical.
Mapping agent threats to the OWASP LLM Top 10
The OWASP Top 10 for LLM Applications, updated for 2025, gives you a shared vocabulary for these risks and maps cleanly onto the agent threat landscape. LLM01 Prompt Injection is the entry point for most agent attacks, direct and indirect. LLM06 Excessive Agency is the entry that exists specifically because agents have tools, memory, and autonomy; it is the single most agent-specific item on the list. LLM05 Improper Output Handling is the insecure output trust boundary described above, the link that turns a model response into command or SQL injection.
The data-integrity items map to the memory and knowledge threats. LLM04 Data and Model Poisoning covers training-time and fine-tuning-time corruption as well as poisoned data the agent learns from. LLM08 Vector and Embedding Weaknesses covers the retrieval layer that most agents depend on, where poisoned or maliciously crafted documents skew what the agent retrieves and trusts. Together these describe how an attacker corrupts what the agent believes before it ever takes an action.
The disclosure and availability items complete the mapping. LLM02 Sensitive Information Disclosure covers an agent leaking secrets, credentials, or private data through its outputs or tool calls. LLM07 System Prompt Leakage covers an agent revealing its own instructions, which often expose tool schemas and guardrails an attacker can then route around. LLM03 Supply Chain covers compromised models, plugins, and tool servers. LLM09 Misinformation and LLM10 Unbounded Consumption cover the agent confidently asserting false results and the cost or denial-of-service risk of an agent stuck in an expensive loop. Use the Top 10 as a checklist, not a ceiling: it is a starting taxonomy, and a real engagement against your agent will find combinations the list names individually.
Defense in depth for agents
There is no single control that secures an agent, because the model itself cannot be trusted to enforce its own boundaries. You cannot prompt your way to safety. Instructions like 'never reveal secrets' or 'refuse harmful requests' are guidance, not enforcement, and a determined injection will talk past them. Defense in depth means building the controls in the surrounding system, where they are deterministic, and assuming the model will eventually be tricked. Design for the day the model does exactly the wrong thing, and make sure that day is survivable.
Least privilege is the highest-value control and it directly counters excessive agency. Give each agent the narrowest set of tools the task requires, and give each tool the narrowest scope it needs. Read-only beats read-write. A single record beats a whole table. Scoped, short-lived credentials beat standing admin keys. Critically, the agent should act with the requesting user's permissions, not a shared super-identity, so that a compromised agent can only reach what that specific user could already reach. If an agent never holds the capability to wire money or drop a table, no injection can make it do so.
Sandboxing and output validation contain the two ends of the loop. Run agent-generated code and tool calls inside isolated, ephemeral environments with no ambient network access and no host credentials, so that even successful code execution is trapped. On the output side, never pass model output to a downstream interpreter without validation: parameterize queries, allowlist commands, schema-validate tool arguments, and strip or sanitize anything destined for a shell, an eval, or a renderer. Treat every byte the model emits as untrusted input to the next system, because that is exactly what it is.
Human-in-the-loop, monitoring, and kill switches handle the actions you cannot fully prevent. Require explicit human approval for high-impact or irreversible operations: financial transactions, production deletes, external communications, permission changes. Log every prompt, tool call, argument, and result so you have an audit trail and can detect anomalous behavior, such as an agent suddenly reaching for tools or data it has never touched before. Build a kill switch that halts an agent mid-task and a circuit breaker that trips on cost spikes, loop detection, or repeated failures. The goal is that when an agent goes wrong, you find out fast and you can stop it.
Identity, authorization, and the confused deputy
Authorization is where agent security most often quietly fails, because the easy implementation is also the insecure one. The path of least resistance is to give the agent one service account with broad permissions and let it do whatever any user might need. This is the confused-deputy setup: the agent is a trusted, highly-privileged intermediary that can be manipulated into acting on behalf of an attacker. A low-privilege user sends a request, possibly carrying an injection, and the agent executes it with its own high privileges. The user could never have done the action directly, but the agent could, so now they can.
The fix is to propagate identity, not pool it. The agent should carry the requesting user's identity and authorization context into every tool call, and the tool, not the agent, should enforce access control against that identity. This collapses the blast radius: an injection that hijacks the agent during a given user's session can still only reach what that user is allowed to reach. Authorization decisions belong in the resource server behind a verified token, never in the agent's prompt or in the model's judgment about whether a request seems legitimate.
This becomes acute the moment an agent connects to external tool servers, including over the Model Context Protocol. An MCP server exposes tools, resources, and prompts to the agent, and a malicious or compromised server can inject instructions through tool descriptions, return poisoned data, or define tools that quietly exceed their stated purpose. Token handling, scope enforcement, and server trust are first-class concerns here. MCP security is its own deep topic and the natural next layer once you have the agent's own identity model right.
Testing AI agents
You cannot secure an agent you have not attacked. Static review of the system prompt and the tool list tells you what is supposed to happen; adversarial testing tells you what actually happens when someone tries to break it. Agent testing is its own discipline because the target is a non-deterministic action-taker, not a static binary. The same input can produce different behavior across runs, so you test for exploitable behavior across many trials and many phrasings rather than expecting a single deterministic crash. Coverage matters more than any one clever payload.
Agentic AI red teaming is the structured version of this. You define the agent's capabilities, its trust boundaries, and the worst-case outcomes, then attack across the full kill chain: initial access through direct or indirect prompt injection, privilege and capability discovery as the agent reveals its tools, lateral movement through chained tool calls and sub-agents, and impact through data exfiltration or destructive actions. Thinking in terms of an AI agent kill chain keeps testing systematic instead of a grab-bag of jailbreak attempts, and it maps findings to the stages where a control would actually have stopped the attack.
Ground the program in the public frameworks so findings are legible to the rest of the organization. The OWASP LLM Top 10 gives you the vulnerability taxonomy. MITRE ATLAS catalogs real-world tactics and techniques against AI systems and gives you an attacker-behavior reference. The NIST AI Risk Management Framework gives you the governance language to translate technical findings into risk decisions a CISO and a board can act on. Run testing continuously, not once: agents change when their prompts, tools, models, or connected servers change, and every one of those changes can reopen a hole you previously closed.
Building an agent security program
Start with an inventory and a threat model, because you cannot protect agents you have not enumerated. List every agent in your environment, what tools and data each one can reach, what identity it acts under, what memory it persists, and what external servers it connects to. For each one, write down the worst thing it could do if fully compromised. That single exercise usually surfaces the excessive-agency problems before any attacker does, and it tells you where to spend defensive effort first: the agents that can move money, change permissions, or delete data.
Then close the loop between building and testing. Bake least privilege, output validation, sandboxing, and logging into the agent platform so every new agent inherits the controls instead of reinventing them. Gate high-impact actions behind human approval by default and require an explicit, reviewed exception to remove that gate. Wire monitoring and kill switches into the runtime so detection and response are not afterthoughts. Treat agent security as a property of the platform, not a checklist each team reimplements badly.
Finally, accept that this is a moving target and staff it accordingly. Models improve and so do the attacks against them. New tool integrations, new MCP servers, and new autonomy features each expand the attack surface. The teams that stay ahead treat agent security as continuous adversarial testing plus deterministic guardrails, not a one-time hardening pass. The model will always be the untrusted component in the loop. Your job is to make sure that when it fails, the system around it does not.
- How is AI agent security different from chatbot security?
- A chatbot produces text, so the worst outcome of a compromise is a bad or leaked output that a human still has to act on. An agent produces actions: it calls tools, runs code, and modifies data with no human in the loop. The same prompt injection that makes a chatbot say something wrong can make an agent delete records, exfiltrate data, or move money. Agent security therefore focuses on access control, sandboxing, and oversight around the model, not just on the model's responses.
- What is the biggest AI agent security risk?
- The combination of prompt injection and excessive agency. Prompt injection, especially indirect injection through retrieved web pages, documents, or emails, is how an attacker gets the agent to do something it should not. Excessive agency, meaning over-broad tool permissions and unchecked autonomy, is what determines how much damage that hijacked agent can cause. Neither is fully solvable in the model alone, so the fix is least privilege and human approval for high-impact actions in the surrounding system.
- Can you prevent prompt injection in an agent?
- You cannot eliminate it at the model layer, because the model cannot reliably separate trusted instructions from untrusted data it processes. You contain it in the system. Assume injection will eventually succeed and design so that a hijacked agent cannot reach anything dangerous: scope its tools and credentials tightly, validate all output before it hits a downstream interpreter, sandbox code execution, and require human approval for irreversible actions. The goal is to make a successful injection survivable, not to make injection impossible.
- Which frameworks apply to AI agent security?
- Three are widely used together. The OWASP Top 10 for LLM Applications (2025) is the vulnerability taxonomy and maps directly to agent risks like prompt injection, excessive agency, and improper output handling. MITRE ATLAS catalogs real attacker tactics and techniques against AI systems. The NIST AI Risk Management Framework provides governance language to turn technical findings into risk decisions. For agents that connect to external tools, the Model Context Protocol specification defines the trust boundaries you need to secure.
- What does least privilege mean for an AI agent?
- Give the agent only the tools the task needs, and give each tool the narrowest scope it can work with: read-only over read-write, a single record over a whole table, short-lived scoped credentials over standing admin keys. Just as important, the agent should act with the requesting user's permissions rather than a shared high-privilege service identity. That way a compromised agent can only reach what that specific user could already reach, which counters confused-deputy attacks and contains the blast radius of any injection.
- How do you test an AI agent for security?
- Through agentic red teaming, not static review alone. Define the agent's tools, trust boundaries, and worst-case outcomes, then attack across the full kill chain: prompt injection for initial access, capability discovery, lateral movement through chained tool calls, and impact like data exfiltration or destructive actions. Because agents are non-deterministic, test many phrasings across many runs and measure exploitable behavior rather than expecting one deterministic failure. Re-test whenever the prompt, tools, model, or connected servers change.
- What is a kill switch for an AI agent?
- A kill switch is a runtime control that immediately halts an agent mid-task, paired with a circuit breaker that trips automatically on warning signs like cost spikes, detected loops, or repeated failures. It exists because some bad actions cannot be fully prevented in advance, so you need a fast, reliable way to stop an agent that is going wrong before it finishes a destructive sequence. Combined with detailed logging of every tool call, it turns an open-ended autonomous process into one you can observe and interrupt.