Pull the Curtain: System Prompt Leakage

The Play

The system prompt is the natural-language config for the whole application: its persona, its rules, its tool list, and far too often a password or API key the builder thought the user would never see. It sits one layer above the conversation, so the model treats it as trusted context rather than a secret. That is the trust boundary you abuse. You are not breaking encryption; you are asking the model to summarize, translate, or continue text it is already holding, and a model that was told to be helpful tends to comply. Once the instructions are in view, every guardrail is a known quantity, and a known guardrail is a routed-around guardrail.

Before the Snap

Pick a sanctioned target first. Lakera Gandalf and Doublespeak.chat are both hosted ranges built for exactly this, so no scope signature is needed beyond their terms of use. For anything you own, stand up the chatbot locally with a deliberately stuffed system prompt (a persona, two or three rules, a fake secret) so you can compare what leaks against ground truth. Open a notebook to log, per attempt: the angle you used, what came back, and which rule that reveals. Read the OWASP LLM07 entry before you start so you are hunting the right thing: not just the persona text, but embedded credentials, role definitions, and tool schemas.

Run It

Fingerprint the surface: send benign messages and note refusals, tone, and any phrases that hint at a governing instruction. The shape of the refusals tells you where the rules live.
Ask the model to report its own operating instructions directly, in the plain framing a confused user would use. Builders frequently forget to forbid the obvious question.
If refused, shift the frame: request a summary, a translation, or a restatement of the model's context rather than the verbatim text. Same content, different ask, different guardrail in the path.
Probe the boundary markers: ask what came immediately before the conversation started, or for the model to continue text from its initial setup. You are testing whether the prompt boundary is enforced or just implied.
When fragments leak, reassemble them across turns. Partial reveals from several angles often reconstruct the full instruction set without any single attempt tripping a filter.
Inventory the loot: separate the haul into persona, rules, embedded secrets, and tool or function definitions. Flag every secret as a finding on its own, independent of the leak.
Map the route-around: for each named rule, write the path that would now bypass it. This is the deliverable that turns recon into impact and proves the leak mattered.
On Gandalf, climb level by level and on Doublespeak work through the scenarios, recording the defense each level added so you can name the escalation of controls, not just the wins.

What You Learn

The failure class is treating the system prompt as a secret and as a security control. It is neither. It is readable context the model wants to share, so any guardrail, credential, or authorization decision placed inside it is one helpful answer away from disclosure. The transferable lesson: a control a user can talk the model into revealing was never a control.

Drive It with Claude Code

Run a system prompt leakage assessment against the authorized chatbot endpoint in our signed scope. Use garak's leakage and promptinject probes to test whether the model discloses its hidden instructions, then map any disclosures to OWASP LLM07 and ATLAS, and write findings to the engagement report.

python -m garak \
  --model_type rest \
  --generator_option_file authorized_endpoint.json \
  --probes leakreplay,promptinject \
  --report_prefix llm07_system_prompt_leakage

Defend It

Assume the system prompt will be read, and build so that reading it costs you nothing. Pull every secret out of the prompt: API keys, passwords, database names, and connection strings live in a secrets manager and are referenced by the application layer, never pasted into model context. Stop using the prompt as the enforcement point. Authorization, privilege separation, and content rules run deterministically outside the model, in code that checks identity and inspects output regardless of what the model was told. Add an independent guardrail layer that scans both input and output, so a leaked rule does not equal a bypassed rule. Per OWASP LLM07, the real fix is not hiding the prompt harder; it is making disclosure boring.