Hijack the Agent: Excessive Agency and Tool Abuse

The Play

An agent is a language model wired to tools: read a database, send an email, run a query, approve a change. The trust boundary you are testing is not the model's politeness filter. It is the gap between what the agent is allowed to do and what it actually needs to do. Excessive Agency lives in that gap. When an agent ingests untrusted content (an email, a document, a retrieved web page, a transaction memo) and that content carries instructions, the agent's reasoning loop can treat those instructions as its own and reach for a tool to carry them out. The play works because the damage is done by a legitimate, authenticated tool call. No exploit lands on the model. The model is just the confused deputy holding the keys. Three things make it possible, and they are independent: the agent has tools it never needed (excessive functionality), those tools run with more access than the job requires (excessive permissions), and high-impact actions fire without anyone signing off (excessive autonomy). Fix one and the other two still bite. This is the credit to the people whose framing this stands on: OWASP's LLM06 taxonomy and Reversec Labs and Lakera, who built ranges where you can see the loop bend in real time.

Before the Snap

Stand up the range, not a live system. For the self-hosted track, clone the Damn Vulnerable LLM Agent and run it locally or in Docker with your own model backend (OpenAI, a local Ollama model, your choice), so the only account, data, and tools at risk are ones you created. For the hosted track, Gandalf Agent Breaker runs in the browser with nothing to install and is sanctioned practice by design. Either way the scope is owned or explicitly authorized: a signed engagement against a customer agent, or a lab you control. Read OWASP LLM06:2025 first so the three agency factors are in your head as named, separate things before you start. Have one tab open for the agent and one for its logs or trace output, because the whole lesson is in the reasoning trace, not the final answer.

Run It

Map the agent's tools before touching it. Read the system documentation or, in the lab, the tool definitions: list every action the agent can take, what each tool's parameters are, and what identity or permission level each runs under. You are looking for tools that exceed the agent's stated job and tools whose backing credentials are broader than read-only.
Classify each tool against the three agency factors. For every tool ask: is this functionality even needed (excessive functionality), does its credential allow writes or deletes when reads would do (excessive permissions), and does a high-impact call fire without a human approving it (excessive autonomy). Note which tools fail which test. These are your candidate targets.
Find the untrusted-input surface. Identify every place the agent reads content it did not author: user messages, retrieved documents, database fields like transaction memos, emails, web pages. Indirect injection through one of these is the realistic path, because it does not require a malicious operator, only malicious data the agent later reads.
Establish baseline behavior. Drive the agent through its intended happy path and watch the reasoning trace, the Thought/Action/Observation loop in a ReAct agent. Confirm which tool it picks for a normal request and what a clean Observation looks like. You need this baseline to recognize when the loop bends.
Steer the loop through the data channel, not the chat box. Place benign-looking instructional content into one of the untrusted surfaces you found, then have the agent process that content normally. Concept only here: the goal is to get the agent to treat embedded text as a directive and select a tool it should not have used for that data. No payloads, the technique is the placement, not a magic string.
Watch for the decision point in the trace. Success is not the final message. It is the moment in the Thought/Action/Observation sequence where the agent decides to invoke a privileged tool, or feeds attacker-influenced parameters into a legitimate one. Capture that step. That single Action line is the finding.
Confirm real-world impact within the lab. Verify the tool call actually did something that crosses a boundary: returned another user's data, wrote where it should only have read, or completed a high-impact action with no human in the loop. Tie the impact back to the specific agency factor you abused so the report names the root cause, not just the symptom.
Write it up as a chain. Untrusted source, the tool that was reachable, the permission it ran under, the missing approval gate, and the trace line that proves the agent chose it. That chain is what the defender fixes.

What You Learn

The failure class is the confused deputy with a tool belt. The model was not jailbroken in any interesting sense: it was handed authority it should never have held, then fed data that redirected that authority. The transferable lesson is that agent security is an access-control and architecture problem wearing an AI costume. Three separate defects (too many tools, too much permission, too little oversight) compound, and patching the prompt addresses none of them. Once you have seen the Thought/Action/Observation loop pick the wrong tool on injected data, you stop trusting model-side guardrails as a boundary and start auditing the plumbing behind every agent you meet.

Drive It with Claude Code

On our authorized agent range (a local clone of the Damn Vulnerable LLM Agent we own and run), enumerate every tool the ReAct agent can call, then for each tool reason about whether the model can reach an action outside the authenticated user's context. Map each over-broad capability to OWASP LLM06 Excessive Agency, write up the tool, the trust boundary it crosses, and the least-privilege scope that would close it. Stay inside the range and produce findings only, no exploitation payloads.

{
  "tools": {
    "get_current_user": {
      "scope": "session.authenticated_user_id",
      "agent_callable": false,
      "note": "auth/identity resolves from the session outside the agent, never from a tool the model can invoke"
    },
    "query_orders": {
      "scope": "read:own_orders",
      "constraints": { "user_id": "${session.authenticated_user_id}" },
      "agent_callable": true,
      "deny": ["cross_user_lookup", "raw_sql"]
    }
  },
  "permission_model": "least_privilege",
  "default": "deny",
  "human_in_loop_required": ["write", "delete", "transfer", "external_send"]
}

Defend It

Fix the plumbing, in this order. Least privilege per tool: every tool gets the narrowest credential that does its job, read-only stays read-only, and you remove tools the agent does not actually need so the functionality is simply absent. Human-in-the-loop gating: any high-impact or irreversible action (moving money, sending mail, deleting, approving) requires explicit human confirmation before it executes, so autonomy is bounded where the blast radius is largest. Output and parameter validation: treat the agent's tool calls as untrusted input to the downstream system, enforce authorization checks at the tool and data layer (complete mediation, not just at the prompt), and validate parameters so a tool cannot be coerced into actions outside its contract. Add monitoring and rate limits as damage control, not as the primary defense. The test that the fix holds: replay your own attack chain and confirm the privileged tool is now either gone, scoped to read-only, or blocked pending human approval, and that the trace shows the agent could not complete the damaging Action on its own.