The Play
A chat model receives the developer's system prompt and your input as one continuous token stream. There is no privileged channel that marks the system instructions as inviolable, so the model is left ranking competing instructions by salience, recency, and framing rather than by authority. Direct prompt injection abuses exactly that missing trust boundary: you type instructions that outrank the developer's, and the model, having no way to know whose orders are whose, follows yours. It works because the defense the developer was relying on, "I told the model not to," is a preference expressed in the same medium you control, not an enforced control.
Before the Snap
Pick a sanctioned range, never a production app. Gandalf is hosted and built for this, so the lab is the URL: no install, no infrastructure, and the scope is the game itself. For the HackAPrompt playground stage, the same applies. Confirm you are on the practice target and not a live system that happens to share a UI. Have a notes file open to record what you sent and what came back, because the whole value here is the pattern you extract, not the win. The research this stands on is the OWASP GenAI work on LLM01 and Lakera's framing of the Gandalf challenge; credit both as you learn from them.
Run It
- Open the practice range and read the model's stated job for this level. The defender's instruction is your map; the constraint they wrote is the thing you are testing.
- Send a plain, honest request first and record the baseline refusal. You are learning the shape of the guardrail before you probe it, not skipping to a win.
- Form a hypothesis about how the level filters: is it watching the input, the output, or both? Note which, because the rest of the play branches on that answer.
- Test one variable at a time. Change framing, ordering, or who the model thinks is speaking, and send again. Single-variable changes are what turn a lucky hit into a repeatable technique.
- Read every response as signal, including partial leaks and changed refusals. A refusal that shifts wording tells you the boundary moved; follow that gradient.
- When a level falls, write down the minimal element that made it work, then immediately try to make the model refuse again. Knowing why it broke is the deliverable, not the cleared level.
- Advance through Gandalf to a guarded level, then carry the same method to the HackAPrompt playground, where the targets vary and the instruction-ranking idea has to hold under different filtering.
What You Learn
A system prompt is a request the model is asked to honor, not a wall it cannot cross. When attacker text and trusted instructions share one channel, instruction priority is decided by the model's judgment, and judgment is steerable. The transferable failure class: any design that depends on "the model was told not to" has no enforced boundary, and that assumption is the bug, not the specific words that triggered it.
Drive It with Claude Code
Run a direct prompt injection assessment against the authorized staging chatbot endpoint defined in our signed scope, using garak's promptinject and dan probe families, then summarize which instruction-override categories the model failed and map each finding to OWASP LLM01.
python -m garak \
--model_type rest \
--generator_option_file authorized_endpoint.json \
--probes promptinject,dan \
--report_prefix llm01_direct_injectionDefend It
Stop treating the prompt as a security boundary, because it is not one. Separate trusted instructions from untrusted input with explicit structure and clear delimiters so the model can tell developer content from user content, and validate and sanitize input on the way in. Watch the output too: screen responses for system-prompt leakage and other injection signatures before they reach the user. Layer defense in depth per the OWASP Prompt Injection Prevention Cheat Sheet: least privilege on anything the model can call, a secondary classifier or guardrail model on inputs and outputs, human review on high-risk actions, and logging plus rate limiting so probing is visible. No single control holds; the stack is the control.