Skip to content
All GuidesA faceless agent at the center of a vortex, its head and chest cross-sectioned to reveal a glowing crimson neural core of nodes and circuitry, the model anatomy laid bare
Flagship Guide

LLM Security: The Full Spectrum

LLM security is the practice of defending a language model system across its entire lifecycle and stack, not just the prompt box where users type. The attack surface runs in layers: the training data the model learned from, the weights themselves, the fine-tuning that added safety, the retrieval system that feeds it knowledge at run time, the live prompt interaction, and the tools and autonomy wired around it. Most people only see the top layer. Real attacks chain across the layers below it, and a defender who watches only the prompt misses most of what can go wrong.

By The Operator·Updated June 13, 2026

Why most people only see the top of the stack

Ask a security team about LLM security and you will almost always get one answer: prompt injection. They have read about a chatbot tricked into ignoring its instructions, they have seen a jailbreak go viral, and they conclude that LLM security is a problem of clever inputs and clever filters. That view is not wrong. It is just the top floor of a tall building, and the people standing on it rarely look down.

The model you talk to is the visible surface of a long pipeline. Before a single prompt is ever typed, the model has been built from a corpus of training data, frozen into a set of weights, shaped by a fine-tuning and alignment pass, and then wired into a retrieval system and a set of tools. Every one of those stages is a place where an attacker can act, and several of them are far more dangerous than the prompt layer because the damage is baked in before the user arrives. A poisoned training corpus, a tampered set of weights, a stripped safety layer: none of these show up as a suspicious prompt, and none of them are caught by a prompt firewall.

This guide gives you the whole map. It walks the stack from the bottom up, names the attacks that live at each layer, and shows where they chain together. The goal is simple. By the end you should be able to look at any LLM deployment and see all six layers of its attack surface, not just the one that everyone argues about online. The security professional who can do that is operating at a different level from the one who only knows prompt injection.

The six layers of the LLM attack surface

Think of an LLM system as a stack of six layers, each built on the one below it. From the bottom: the training data layer, where the model learns. The base model and weights layer, which is the model itself. The fine-tuning and alignment layer, where safety is added. The retrieval and RAG layer, which is the knowledge the model is handed at run time. The inference and prompt layer, the live interaction everyone knows. And the agent and tool layer, where the model is wired to act in the world. Attacks exist at every layer, and the lower you go, the more permanent and invisible the compromise.

Layer one is the training data. This is the corpus the model learned from, often scraped from the open internet at a scale no human can review. An attacker who can get content into that corpus can plant data poisoning, backdoors, and sleeper triggers: patterns that sit dormant until a specific phrase activates a hidden behavior the model was quietly taught. The poison is in the model from birth, and no prompt-level filter will ever find it.

Layer two is the base model and the weights. This is the trained artifact itself, the multi-gigabyte file of parameters. Attacks here include model theft and extraction, where an attacker reconstructs a copy of a proprietary model; distillation, where they train a competitor on its outputs; weight tampering, where the file is modified; and abliteration, the surgical removal of the model's refusal behavior directly from its weights. Abliteration is the deepest jailbreak there is, because it is permanent and lives at the parameter level rather than in any prompt. We give it its own section below.

Layer three is fine-tuning and alignment. This is where safety is added on top of the base model through alignment training. It is also where safety can be removed: a small amount of malicious fine-tuning, including cheap low-rank adapter attacks, can strip the safety behavior an organization spent enormous effort to install. The uncomfortable truth of this layer is that alignment is fragile. It is a thin coat of paint over a far larger and amoral base model, and a few hundred adversarial examples can peel it off.

Layer four is retrieval and RAG. Most production LLM systems do not rely on the model's frozen knowledge alone. They retrieve documents from a vector database and feed them to the model at run time. That retrieval channel is an attack surface: vector database poisoning plants malicious documents that the system will dutifully retrieve and trust, embedding inversion attacks recover sensitive text from the stored vectors, and indirect prompt injection rides in on a retrieved document. The model treats retrieved content as fact, which is exactly the assumption an attacker exploits.

Layer five is inference and the prompt. This is the live interaction: the system prompt, the user input, the model output. Here live the attacks everyone knows, prompt injection both direct and indirect, jailbreaking, system-prompt extraction, insecure output handling, and adversarial inputs. It is a real and busy layer. It is also only one of six.

Layer six is the agent and tool layer. This is the model wired to tools, memory, and autonomy so it can act, not just answer. Attacks here include excessive agency, tool misuse, and the full agent kill chain, where an injected instruction is escalated through chained tool calls into real-world impact. This is the layer where Krypteia's other guides concentrate: MCP security, CLI tool security for AI agents, agentic AI red teaming, and AI agent security all live here. It is the layer with the largest blast radius, because this is where a text trick becomes a deleted record or a moved dollar.

The model core: the part most people do not understand

The honest hook of this guide is that the model core, layers one through three, is the part almost no security professional actually understands. They understand the prompt because they can see it. They do not understand the weights because the weights are an opaque file of billions of floating point numbers, and the idea that those numbers can be attacked, stolen, poisoned, or surgically edited feels closer to research than to operations. That gap is exactly where the most serious and least defended risks sit.

Consider what the core actually is. The training data layer determines everything the model knows and everything it was secretly taught. The weights layer is the entire model condensed into one artifact that can be copied, modified, or reconstructed. The alignment layer is the only thing standing between a helpful assistant and a model that will explain anything to anyone, and it is the most fragile layer of the three. An attacker who operates at the core does not need to win an argument with a guardrail at run time. They change what the model is before it ever runs.

This matters more every year because open-weight models are now everywhere. When a model's weights are published and downloadable, every attack that requires access to the weights becomes available to anyone. Data poisoning, malicious fine-tuning, weight tampering, and abliteration are not theoretical lab exercises against a hypothetical model. They are documented, published techniques being run on real open-weight models today. The security professional who only knows prompt injection is defending the front door of a house whose foundation is already accessible to the public.

Abliteration: cutting safety out of the weights

Abliteration is the technique that makes the model core real for people who have only ever thought about prompts. The name is a blend of ablation and obliteration, coined by a researcher known as FailSpy in 2024, and it describes the targeted removal of a model's refusal behavior directly from its weight matrices. It is not a jailbreak prompt. It does not trick the model into saying yes. It edits the model so that the part of it that ever said no is gone.

The technique rests on a finding from interpretability research, established in work by Arditi and colleagues, that refusal in instruction-tuned models is mediated by a single direction in the model's activation space. By running the model on a set of harmful prompts and a set of harmless prompts and taking the mean difference between the internal activations, researchers can compute the specific direction that represents refusal. Once that direction is identified, it can be projected out of the model's weights, so the model loses its ability to represent refusal at all. The result is a permanently unaligned version of the model that answers anything, with no prompt trickery required at run time.

This is why abliteration is the deepest jailbreak. A prompt-level jailbreak is a temporary win against a model that is still trying to refuse; patch the guardrail or update the model and the jailbreak dies. Abliteration is permanent and at the parameter level. The safety is not bypassed, it is removed, and the modified weights can be saved, shared, and run by anyone. Recent open-source toolkits have packaged the method into multi-stage pipelines with more than a dozen abliteration variants, lowering the skill required to strip safety from an open-weight model to minutes of work.

It is worth being precise about what this section is and is not. This is the educational map, not an operational recipe. The point is that any organization deploying or relying on open-weight models has to understand that the safety behavior of those models is not a fixed property; it is removable, and removed versions circulate. Defenders need weight integrity controls, provenance tracking, and a clear-eyed understanding that an open-weight model's alignment can be stripped by anyone who downloads it. There is also active research on defenses that make abliteration harder, which tells you the technique is taken seriously enough to be worth defending against.

Training data and model supply chain attacks

Below the weights sits the training data, and it is the quietest attack surface of all. Frontier models are trained on enormous corpora scraped from the open web, and no team reviews every document. That creates an opening: if an attacker can get crafted content into the training set, they can teach the model a hidden behavior. Data poisoning at this layer can plant a backdoor, a behavior that stays dormant until a specific trigger phrase appears in the input, at which point the model does something it was never supposed to do. These sleeper triggers are nearly impossible to find by inspecting the model, because the model behaves normally until the exact key is presented.

Research has shown that the number of poisoned documents needed to plant a backdoor can be surprisingly small relative to the size of the corpus, which inverts the intuition that a few bad pages in a sea of good ones cannot matter. They can. The poison does not need volume; it needs to be present and consistent enough for the model to learn the association. This is why training data provenance is a security control and not just a data-quality nicety.

The same supply chain logic extends past the raw data to every component a model is built from: pretrained checkpoints downloaded from public hubs, fine-tuning datasets from third parties, adapters and tokenizers, and the libraries that load them. OWASP names this directly in its 2025 Top 10 for LLM Applications, where Supply Chain sits at LLM03 and Data and Model Poisoning sits at LLM04. A model pulled from a public repository carries the trust assumptions of everyone who touched it upstream. Treating a downloaded checkpoint as trusted by default is the same mistake as running an unsigned binary from an unknown source.

Fine-tuning attacks: stripping safety on purpose

Alignment is added in the fine-tuning layer, and it can be removed in the same layer. The published research is stark: a small number of adversarial fine-tuning examples can undo the safety behavior of an aligned model, and the cost to do it is low. This is not exotic. Fine-tuning is a normal, supported operation for open-weight models and for many hosted models through their tuning APIs, which means the mechanism to strip safety is the same mechanism vendors offer as a feature.

Low-rank adaptation makes this cheaper still. A LoRA adapter is a small set of additional weights trained on top of a frozen base model, and a malicious adapter can shift the model's behavior toward compliance without retraining the whole thing. The adapter is small, easy to distribute, and can be merged into the base weights or applied at load time. An attacker who cannot abliterate a model may simply fine-tune the safety out of it with a few hundred examples, and the end state looks similar: a model that no longer refuses.

The lesson of this layer is that alignment is not a durable property of the weights. It is a behavior that was trained in and can be trained out, and the effort to remove it is a tiny fraction of the effort that went into adding it. For a defender, this means you cannot assume that a model which was aligned when you obtained it will stay aligned through a fine-tuning step, and you cannot assume that a third-party fine-tuned model carries the safety of its base. Alignment robustness is its own control, separate from whatever the base model shipped with.

RAG and retrieval: poisoning the knowledge channel

Most real LLM products do not run on the model's frozen knowledge. They use retrieval-augmented generation: a query is turned into an embedding, similar documents are pulled from a vector database, and those documents are placed in the model's context so it can answer from current, specific information. This is a powerful pattern and a serious attack surface, because the retrieval channel is a path by which attacker-controlled content reaches the model wearing the costume of trusted knowledge.

Vector database poisoning is the headline attack. Research presented at USENIX Security in 2025 demonstrated that a handful of carefully crafted documents, on the order of five, can reliably steer the system's answer to a targeted query with a success rate above ninety percent, even when the database holds millions of documents. The attacker does not need to flood the store; they need to place a small number of documents that the retriever will pull for the queries they care about. Once retrieved, the poisoned document is treated as fact by the model, and that is the whole game.

The retrieval layer also carries indirect prompt injection. A retrieved document can contain hidden instructions, and because the model reads retrieved content on the same channel as its own context, it can mistake those instructions for legitimate ones. There is a second, quieter risk at this layer too: embedding inversion. The vectors stored in the database are not opaque; research shows that a meaningful fraction of the original text, in some studies a majority of the words, can be reconstructed from the embeddings themselves. A vector store of customer documents is a store of recoverable customer text, which makes it a confidentiality target, not just an availability one. OWASP recognized this in 2025 by adding Vector and Embedding Weaknesses to its Top 10 as LLM08. RAG hygiene, source trust, and treating the vector store as sensitive data are the controls this layer demands.

The prompt layer: where everyone is already looking

The inference and prompt layer is the one the whole industry obsesses over, and it does deserve attention. Prompt injection sits at the top of the OWASP 2025 Top 10 as LLM01 for the second edition running, because the root cause is structural and unsolved: the model reads instructions and data on the same channel and cannot reliably tell which is which. Direct prompt injection puts the malicious instruction in the user's own input. Indirect prompt injection hides it in content the model is given, a web page, a document, a tool result, so the operator never typed the payload and may never see it.

Around prompt injection sit its relatives. Jailbreaking aims to get the model to produce content its alignment was meant to refuse, through role-play framings, obfuscation, or multi-turn pressure. System-prompt extraction tricks the model into revealing its hidden instructions, which OWASP now tracks separately as System Prompt Leakage at LLM07, because those leaked instructions often contain logic and sometimes secrets an attacker can use. Insecure output handling, LLM05 in the 2025 list, is the failure of trusting model output downstream without validation, so a model coaxed into emitting a script or a SQL fragment can cause a classic injection in whatever system consumes its text.

This layer is real, it is busy, and it is worth defending. But it is the layer where compromise is most visible and most temporary. A jailbroken prompt is caught in a log; a patched guardrail kills it. Compare that to a poisoned corpus or abliterated weights, where the compromise is invisible at the prompt and permanent in the model. The mistake is not paying attention to the prompt layer. The mistake is believing it is the whole map.

The agent layer: where text becomes action

At the top of the stack, the model is wired to tools, memory, and autonomy, and that changes the stakes entirely. An agent is a model in a loop: it reads a goal, picks an action, calls a tool, reads the result, and decides again, often many times before it stops. When the model can act, a successful attack at any layer below stops being a bad sentence and becomes a real-world consequence: a record deleted, a secret exfiltrated, a message sent as the victim, a payment moved.

The dominant risk here is excessive agency, which OWASP places at LLM06. It breaks into three root causes: excessive functionality, where the agent can reach tools it does not need; excessive permissions, where those tools run with broader rights than the task requires; and excessive autonomy, where high-impact actions fire with no human in the loop. An attacker who lands an injection anywhere below, in a poisoned document, a retrieved record, or a direct prompt, can escalate it through chained tool calls into impact. Each call is individually within the agent's granted permissions, so no single step trips an alarm. This staged structure is the agent kill chain, and it is the subject of Krypteia's dedicated guides on agentic AI red teaming, MCP security, CLI tool security for AI agents, and AI agent security.

The reason this layer is the natural endpoint of the whole map is that it is where every lower-layer attack cashes out. A backdoor planted in training, a safety layer stripped by fine-tuning, a poisoned RAG document, an injected prompt: none of them matter as much when the model can only produce text. All of them matter enormously when the model can take an action. The agent layer is the amplifier that turns a quiet compromise anywhere in the stack into a loud incident in the real world.

Attacks chain across layers

The single most important idea in this guide is that real attacks do not stay on one floor. They chain across the stack, and the defender who watches only the prompt layer misses the path entirely. The interesting question is never just what can go wrong at one layer; it is how a weakness at a low layer becomes an impact at the top.

Trace one example end to end. An attacker poisons a public document so it ranks well for a query the target's RAG system handles. The document carries an indirect prompt injection. When a user asks a related question, the retrieval layer pulls the poisoned document and places it in the model's context. The injected instruction tells the agent to call a tool, the agent has excessive permissions on that tool, and it exfiltrates data to an attacker-controlled address. Four layers participated: training or content supply, retrieval, prompt, and agent. A prompt firewall watching the user's input sees nothing wrong, because the user typed an innocent question. The payload arrived through the document.

This is why a layered defense is not optional. A control at one layer does not cover the layers below it, and a monitoring strategy fixated on the prompt is blind to compromises that were installed in the weights or the corpus long before any prompt was sent. The threat model has to span the whole stack, because the attacker already does.

Testing the full spectrum, not just the prompt

If the attack surface is six layers deep, then so is the testing. LLM penetration testing done properly is full-stack work, not prompt fuzzing. Prompt fuzzing, throwing thousands of adversarial strings at the input and grading the responses, is a legitimate technique for the prompt layer and only the prompt layer. It tells you nothing about a poisoned corpus, a tampered weight file, a fine-tuned-away safety layer, or a poisoned vector store. A test that only fuzzes prompts and reports clean is reporting on one sixth of the surface.

A full-spectrum assessment asks layer-specific questions. For the data and model supply chain: where did this model and its training data come from, what is the provenance, and what is the weight integrity story. For the alignment layer: how well does the safety behavior survive a small fine-tune, and was any third-party tuning involved. For the retrieval layer: can the vector store be poisoned, what content is treated as trusted, and is the embedded data recoverable. For the prompt layer: the full battery of injection, jailbreak, extraction, and output-handling tests. For the agent layer: the kill-chain work, tool by tool and permission by permission, demonstrating real impact against a staging environment.

The work maps cleanly onto the frameworks an organization already reports against. The OWASP Top 10 for LLM Applications covers the operational layers. MITRE ATLAS catalogs adversarial tactics and techniques against AI systems, including poisoning and model extraction, so findings have a shared vocabulary. The NIST AI Risk Management Framework and its generative AI profile give the governance anchor. Tagging each finding to these turns a list of clever attacks into a remediation plan that fits controls the organization is already obligated to manage, across the whole stack rather than one layer of it.

Defenses mapped to each layer

Because the attack surface is layered, the defense has to be layered too, and the controls at one layer do not substitute for the controls at another. Start at the bottom. The training data layer is defended with data provenance and curation: knowing where training data came from, vetting third-party datasets, and treating an unverified corpus the way you would treat unsigned code. You cannot inspect a poisoned model into safety, so the control has to sit on the inputs.

The weights layer is defended with integrity and provenance controls: cryptographic verification of model files, tracking the chain of custody for every checkpoint, and refusing to run weights whose origin you cannot establish. For open-weight models specifically, the defender has to accept that alignment is strippable and plan accordingly, rather than assuming the safety that shipped will persist. The fine-tuning layer is defended with alignment robustness: re-evaluating safety after any tuning step, controlling who can fine-tune a production model, and verifying that third-party tuned models still refuse what they should.

The retrieval layer is defended with RAG hygiene: validating and trust-ranking ingested documents, isolating untrusted content so it cannot carry instructions, treating the vector store as sensitive data because embeddings are recoverable, and monitoring for poisoned entries. The prompt layer is defended with input and output controls: separating trusted instructions from untrusted content, validating and sandboxing model output before any system acts on it, and never trusting model text downstream by default. The agent layer is defended with least privilege: scoping every tool to the minimum permission it needs, putting human approval in front of irreversible actions, and constraining autonomy so no single injection can compose its way to impact. The through line is that a system prompt asking the model to behave is not a control at any layer. The controls are structural, and they have to exist at all six.

Deep dives: the next level of the map

This guide is the orienting overview, the page you land on when you want the whole territory. The lower layers of the stack each deserve their own treatment, because the model core is where the least understood and most permanent attacks live. Four companion guides go deeper into the layers this overview has only mapped.

Fine-Tuning Attacks and Model Abliteration goes to the deepest layer: how safety is removed from a model at the parameter level, how the refusal direction is computed and projected out of the weights, how cheap malicious fine-tuning and low-rank adapter attacks strip alignment, and what weight integrity and alignment robustness look like as real controls. Data Poisoning and Backdoor Attacks covers the training data layer: how poisoned documents and sleeper triggers are planted in a corpus, why a small number of poisoned samples can plant a durable backdoor, and how data provenance defends the layer no prompt filter can reach.

RAG and Vector Database Security covers the retrieval layer: vector database poisoning, embedding inversion and the confidentiality of stored vectors, indirect prompt injection through retrieved content, and the ingestion-to-generation defenses that keep the knowledge channel trustworthy. Model Extraction and Supply Chain Security covers the weights and supply chain: how proprietary models are stolen through high-volume API querying and distillation, how downloaded checkpoints and third-party components carry upstream trust assumptions, and how rate limiting, monitoring, and provenance defend the model as an asset. Together with Krypteia's agent-layer guides on agentic red teaming, MCP security, CLI tool security, and AI agent security, these are the deeper layers of the same map this overview lays out.

Frequently Asked
What is the full LLM attack surface?
The LLM attack surface spans six layers, not just the prompt. From the bottom: the training data the model learned from, the base model weights, the fine-tuning and alignment layer, the retrieval and RAG layer, the inference and prompt layer, and the agent and tool layer. Each layer has its own attacks, and real attacks chain across them, so a defense that only covers the prompt misses most of the surface.
What is model abliteration?
Abliteration is the targeted removal of a model's refusal behavior directly from its weights. Research established that refusal in instruction-tuned models is mediated by a single direction in the model's activation space. Abliteration computes that direction from the model's responses to harmful and harmless prompts, then projects it out of the weight matrices, leaving a permanently unaligned model. It is the deepest jailbreak because it is permanent and at the parameter level, not a prompt trick.
Is prompt injection the main LLM security risk?
Prompt injection is the most visible risk and sits at the top of the OWASP 2025 Top 10, but it is one of six layers of the attack surface. Attacks at lower layers, poisoned training data, stolen or abliterated weights, malicious fine-tuning, and poisoned RAG stores, are often more dangerous because they are invisible at the prompt and permanent in the model. Treating prompt injection as the whole of LLM security is the most common mistake security teams make.
How do you secure an LLM end to end?
Defend every layer, because controls at one layer do not cover another. Use data provenance for the training corpus, integrity and provenance checks for the weights, alignment robustness testing after any fine-tune, RAG hygiene and vector-store protection for retrieval, input and output controls at the prompt layer, and least-privilege plus human approval for the agent layer. A system prompt asking the model to behave is not a control. The controls are structural and have to exist at all six layers.
What is the difference between LLM security and AI agent security?
AI agent security is the top layer of LLM security. LLM security covers the whole stack, from training data and weights up through retrieval and prompts. AI agent security focuses on the layer where the model is wired to tools and autonomy, where attacks like excessive agency and the agent kill chain turn a text-level compromise into real-world action. Agent security is where lower-layer attacks cash out, which is why it has the largest blast radius.
Are these attacks on the model core real or just research?
They are real and documented. Abliteration was introduced in 2024 and has been packaged into open-source toolkits that strip safety from open-weight models. Data poisoning with small numbers of samples, malicious fine-tuning, RAG poisoning with a handful of documents, and model extraction through API querying are all demonstrated techniques, several of them named in the OWASP 2025 Top 10 and cataloged in MITRE ATLAS. The risks at the model core are operational, not hypothetical, especially for open-weight models.
Why does full-spectrum LLM penetration testing matter?
Because prompt fuzzing tests one sixth of the attack surface. A test that throws adversarial strings at the input and reports clean says nothing about a poisoned corpus, tampered weights, stripped safety, or a poisoned vector store. Full-spectrum testing asks layer-specific questions about provenance, weight integrity, alignment robustness, retrieval trust, prompt handling, and agent permissions, then maps each finding to OWASP, MITRE ATLAS, and NIST so the remediation covers the whole stack.
This is one of the attack surfaces Krypteia tests. See the daily threat intel, research, the glossary, or request an assessment.