Reference

Guides

Definitive, evergreen guides to the attack surfaces of agentic AI. The deep references behind the daily research.

A faceless crimson-suited agent in charcoal and ink, red matrix code rain dissolving a lock, in the Krypteia matrix-agent style.

LLM Jailbreaking: The Complete Red Team Guide

Jailbreaking is the practice of getting a language model to produce output its own safety training was meant to refuse. It is distinct from prompt injection, which subverts an application's instruction hierarchy using untrusted data; jailbreaking targets the model's alignment directly. It works because refusal is a learned behavior trained on top of a next-token predictor, not a hard constraint wired into it, so an input that shifts the model far enough from where refusals were reinforced can suppress the refusal. This guide breaks the whole problem down to fundamentals with worked examples and diagrams: the taxonomy of techniques, how each one is built, how a model gets walked over gradually, how the Gandalf game teaches it, and how a frontier model like GPT-5.6, Claude Opus, or Sonnet is actually approached. It is written for authorized red teams and defenders, as an educational map of how the attack works and how to defend against it.

A faceless agent at the center of a vortex, its head and chest cross-sectioned to reveal a glowing crimson neural core of nodes and circuitry, the model anatomy laid bare

Pillar Guide

LLM Security: The Full Spectrum

LLM security is the practice of defending a language model system across its entire lifecycle and stack, not just the prompt box where users type. The attack surface runs in layers: the training data the model learned from, the weights themselves, the fine-tuning that added safety, the retrieval system that feeds it knowledge at run time, the live prompt interaction, and the tools and autonomy wired around it. Most people only see the top layer. Real attacks chain across the layers below it, and a defender who watches only the prompt misses most of what can go wrong.

One agent injecting crimson code into another agent's chest, overwriting it from within

Guide

Prompt Injection Testing: A Practical Guide

Prompt injection testing is the practice of attacking an LLM application with adversarial inputs to find where untrusted text can override the system's intended instructions. It treats the model as a non-deterministic target, so you run many variations of each attack and measure how often the application follows the attacker instead of its own rules. Done well, it produces a reproducible corpus of attacks and a success rate per attack class, not a single pass or fail.

Two agents in adversary confrontation, one crackling with energy as it breaks the other into red shards

Guide

Agentic AI Red Teaming: A Practitioner's Guide

Agentic AI red teaming is the adversarial testing of AI systems that plan, make decisions, and take actions over many steps using external tools, rather than answering a single prompt. The goal is to find sequences of inputs and conditions that drive an autonomous agent to misuse its tools, exceed its authority, leak data, or cause real-world impact, then prove those paths end to end. It treats the agent as an attack surface that holds state and credentials across a session, not as a stateless text box.

Two faceless agents at a threshold, one offering a glowing corrupted tool across an untrusted boundary

Guide

MCP Security: The Complete Guide

MCP security is the practice of securing the Model Context Protocol, the open standard that lets AI agents discover and call external tools, data sources, and APIs through MCP servers. The core risk is that an MCP server is untrusted input to the agent: its tool definitions, schemas, and runtime outputs all enter the model context and can carry instructions, so a malicious or compromised server can redirect the agent's behavior, exfiltrate data, or abuse the agent's existing privileges. Securing MCP means treating every connected server as a hostile boundary and constraining what the agent is allowed to do on the other side of it.

A faceless agent with its hand plunged into a glowing crimson command-line panel, code streaming from its fingers, a blade of light extending from the console

Guide

CLI Tool Security for AI Agents

CLI tool security for AI agents is the practice of securing the command-line tools an AI agent is allowed to run, such as bash, gh, kubectl, and curl, when those tools are exposed to the model directly rather than wrapped behind an MCP server. The core risk is that an agent with shell access can do anything the shell user can do, and the agent's decisions are driven by inputs an attacker can influence through prompt injection, so a single piece of poisoned content can turn into a destructive or data-exfiltrating command. Securing it means treating the agent's command construction as untrusted, constraining which commands it can run, and running those commands in a sandbox with least privilege and human review for anything irreversible.

A triad of agents from defended to compromised, the spectrum from secured to breached

Guide

AI Agent Security: Threats and Defenses Guide

AI agent security is the practice of protecting LLM-driven agents that can take actions in the world: agents that call tools, run code, query databases, send messages, and persist memory across sessions. It matters because the failure mode changes when you give a model autonomy. A jailbroken chatbot says something wrong; a jailbroken agent with tool access executes code, exfiltrates data, or modifies records. AI agent security treats the model as an untrusted component inside a system and applies access control, sandboxing, output validation, monitoring, and human oversight around it.

Reference

Living Taxonomy

The Jailbreak Taxonomy

Every jailbreak technique organized by mechanism, defense-mapped, and cross-referenced to Arcanum, OWASP, and MITRE ATLAS. A filterable, living reference companion to the guides.