AI Security Fundamentals ; The Threat Model

If you walk into AI security with an AppSec mental model, you will be wrong in specific, predictable ways. Traditional AppSec assumes a deterministic system: identifiable trust boundaries, parseable inputs, enforceable grammar. AI systems break every one of those assumptions. The model does not have a grammar. The trust boundary is fuzzy because the model conflates instruction and data. The same input produces different outputs across runs. You are not defending an application in the old sense. You are defending a stochastic, non-deterministic, instruction-following machine that has been wired into your systems and asked to act on user input.

The first hour of any AI security engagement is unlearning the false certainties of AppSec and re-deriving the threat model from how the system actually works.

Why traditional threat models break

Three core assumptions of AppSec do not hold for AI systems.

Determinism. A SQL injection either bypasses the filter or it does not. The same input produces the same result every time, so a single test proves or disproves the vulnerability. AI systems are probabilistic. A prompt injection that succeeds 30% of the time is a real vulnerability even though it does not always succeed, and it requires statistical evaluation, not pass/fail tests.

Parseable input. Traditional inputs have grammars. SQL has a parser. HTTP has a spec. You can escape, you can canonicalise, you can validate against a schema. Natural language has none of that. There is no escape character for prompt injection because there is no grammar to escape to. Encoding, normalising, or filtering inputs gets you a defence-in-depth layer, not a fix.

Static trust boundaries. In a typical web app, you can draw a line between trusted code (the server) and untrusted input (the user). That line is solid. In an AI system, every piece of text that ends up in the context, whether it came from your system prompt, the user, a retrieved document, or a tool result, gets the same treatment from the model. The model cannot tell them apart natively. The trust boundary you thought you had is a convention enforced by prompt engineering, not a security control.

A fourth assumption is more subtle: emergent behaviour. AI systems exhibit capabilities at certain scales that were not present at smaller scales. A jailbreak that does not work on a small model may work on a large one, and a defense that worked yesterday may stop working when the provider upgrades. Your threat model has to assume capabilities you have not seen yet.

The OWASP LLM Top 10 in practical terms

The OWASP LLM Top 10 is the closest thing the industry has to a shared vocabulary for AI threats. Memorise the numbering. You will see it in scoping documents, vendor questionnaires, and SOC reports.

ID	Name	Practical meaning
LLM01	Prompt Injection	Attacker-controlled text in the context overrides instructions or extracts secrets.
LLM02	Insecure Output Handling	Downstream systems trust the model output as if it were safe input.
LLM03	Training Data Poisoning	Adversarial data inserted into training or fine-tuning corpora to alter behaviour.
LLM04	Model Denial of Service	Resource-exhausting inputs (long context, recursive prompts) that drive up cost or kill availability.
LLM05	Supply Chain Vulnerabilities	Compromised models, weights, datasets, or third-party plugins.
LLM06	Sensitive Information Disclosure	The model returns secrets it should not, from training data or context.
LLM07	Insecure Plugin Design	Tools and plugins with poor input validation or excessive privilege.
LLM08	Excessive Agency	Agents given more permission than they need, leading to unintended actions.
LLM09	Overreliance	Humans or systems trust model output without verification, even when it is wrong.
LLM10	Model Theft	Extraction of proprietary model weights or capabilities through query attacks.

Two observations. First, LLM01 and LLM02 are by far the most common in real engagements, and they are the same root cause from different angles: the model conflates data with instruction, and downstream systems conflate model output with trusted data. Get those two right and you have eliminated most of the easy findings. Second, LLM08 (excessive agency) is the one that scales with capability. The more your agent can do, the worse a successful injection becomes. We will revisit it constantly in the offensive and defensive modules.

The trust boundary problem

Draw your AI system as a box. Then draw arrows for every piece of text that enters the model's context window:

The system prompt (you control this)
The current user message (the user controls this)
The conversation history (the user has been influencing this for the whole session)
Retrieved documents from your vector DB (could be poisoned upstream)
Tool results (could contain content from external systems the attacker influences)
File uploads, image contents, transcribed audio (all attacker-controlled in many setups)

Every arrow is text the model will weight by attention and use to produce its next output. From the model's perspective, they are all the same. The CIA triad still applies, but with new failure modes for each leg.

Confidentiality. The model can leak the system prompt, retrieved documents, user data from other sessions if state is mis-scoped, and even fragments of training data. Information disclosure happens through cleverly crafted queries that route the model toward content it should not surface. There is no good way to put a secret into the context and trust that it will not come out.

Integrity. Model output is what your downstream systems act on. If the attacker can influence the output, they can influence those actions. Indirect prompt injection (poisoning a document the model will read) is the textbook integrity attack. The "output" is not just text shown to the user. It includes structured tool calls the agent makes on the user's behalf.

Availability. Beyond the obvious cost-denial through expensive queries, availability also fails through subtle paths: a malicious tool result that puts the agent into a loop, an injection that causes the agent to refuse legitimate work, a context-poisoning attack that destroys the user's session memory.

A practical threat modelling framework for AI apps

Skip STRIDE for a first pass. Use a four-question framework instead:

1. What can the model see? Enumerate every text source that ends up in the context. For each, ask who controls it and what their incentive is. Anything that an external actor can influence (web content, email bodies, documents uploaded by other users, search results) is hostile by default.

2. What can the model do? List every tool, function, or downstream action the agent can trigger. For each, ask what the worst outcome is if an attacker fully controls the parameters. "Send email" sounds harmless until you realise it can be used to exfiltrate the entire conversation history to an attacker-controlled inbox.

3. Who can reach the model? Authenticated users only? Anonymous web traffic? Other internal systems? Each population needs a different risk profile and different rate-limit budget. An attack that requires authenticated access has a much smaller likelihood than one that works against anonymous traffic.

4. What does success look like for an attacker? Be specific. "Bypass safety" is not a goal. "Extract the customer record for user X" or "send a phishing email from the support agent to the customer list" is a goal. Goal-oriented threat modelling produces test cases. Capability-oriented threat modelling produces handwaving.

A worked example: a documentation assistant

Imagine an agent that answers customer questions by reading internal documentation and your public knowledge base. It has two tools: search_docs and escalate_to_human. Let us apply the framework.

What can it see? System prompt (you), user message (customer), conversation history (customer), search results from your docs (you), search results from the public knowledge base (whoever wrote those docs, possibly a community contributor).

What can it do? Return text to the user, call escalate_to_human (which creates a support ticket with conversation history).

Who can reach it? Anyone with a customer account. Low barrier.

What does success look like? An attacker who is also a customer wants to either: extract internal docs that should not be shown, extract another customer's conversation history, or get the agent to send something embarrassing to a high-value account.

Now the obvious findings fall out. The search results from the community-contributed knowledge base are a prompt injection vector (LLM01) because anyone can submit content. The escalate_to_human tool needs to scope the conversation history to the current user only, otherwise an injection could include another user's history in the ticket (LLM06). The agent should refuse to discuss anything outside its scope, but the refusal logic has to live downstream of the model output, not inside the prompt, because the prompt can be overridden (LLM02 plus LLM08).

You did not need STRIDE. You needed to know what the model sees, what it can do, and what an attacker would want. The rest is mechanical.

Carry-forward for the rest of the course

Three things to internalise before moving on.

First, treat every text source in the context as having the privilege level of the least-trusted writer. If your retrieval pipeline pulls from a wiki anyone can edit, then your retrieved documents are user input as far as security goes.

Second, the model is not your last line of defense. It is your prompt-shaped reasoner. Real safety lives in the layers around it: input validation, output validation, tool authorisation, and audit logging.

Third, threat models for AI systems must be re-evaluated whenever the model is upgraded, the tools change, or the data sources change. AI systems are alive in a way traditional applications are not. Their behaviour shifts under you. Your threat model has to be a living document, not a one-time artifact.

The next four modules build agents that exercise every assumption here. The two modules after that break them on purpose, then teach you how to put the pieces back together.