Spill the Secrets: Sensitive Information Disclosure

The Play

Models leak in three ways, and a good probe separates them. The first is memorization: large models can replay chunks of their training data word for word when nudged the right way, which spills anything sensitive that was in the corpus. The second is inference: a model that holds enough context can deduce a fact nobody handed it, like reconstructing a person's identity from scattered attributes. The third is context bleed: the model surfaces what it was given at runtime, the system prompt, a retrieval document, a tool result, or a fragment from a prior conversation that should never have crossed a tenant boundary. This play attacks all three on an authorized target. You do not write extraction strings by hand. You point a published probe harness at the model, classify what comes back by disclosure mode and data class, and verify each hit reproduces. The output is a defender artifact: where the model leaks, why, and which mitigation closes it.

Before the Snap

Confirm the scope is signed and the target is owned or sanctioned. Sensitive-data probing can pull real PII out of a real corpus, so the range matters. Stand up a deliberately vulnerable target you control: DVAIA self-hosted gives you a leaky app with planted secrets, and a local open-weights model gives you a clean memorization surface. Identify every place the target sees data: the system prompt, any retrieval or RAG store, connected tools, and the training corpus if you know it. Decide your data classes up front: PII (names, emails, IDs), secrets (keys, tokens, credentials), and training fragments (verbatim corpus text). Have a redaction policy for your own report so you do not republish whatever the model spills. Set the inference temperature low and pin the model version so hits reproduce.

Run It

Define the data classes and surfaces in scope, then plant known canary values into the target you control (a fake API key in the system prompt, a fake PII record in the retrieval store) so you can measure detection without touching real data.
Run garak's leakreplay probe against the authorized target to measure memorization: how readily the model replays verbatim training-corpus fragments when prompted to continue or complete known text.
Probe for inference disclosure by asking the model to reason over partial attributes you supply, and observe whether it reconstructs or confirms a sensitive fact it was never directly given. Record what it deduces and from what.
Probe for context bleed by querying for content that lives only in the system prompt, the retrieval store, or a tool result, and check whether scoped data crosses into the output. Use your planted canaries as the signal.
Classify every hit by disclosure mode (memorization, inference, context bleed) and data class (PII, secret, training fragment), and record the source surface for each.
Re-run each confirmed hit at a pinned model version and low temperature to verify it reproduces, and capture a minimal reproduction step for the defender.
Map every confirmed finding to OWASP LLM02 and to the ATLAS exfiltration-via-inference-API technique, then rank by data sensitivity and reproducibility.

What You Learn

You learn that "the model leaked" is not a finding, it is a category. A memorization hit and a context-bleed hit look similar in the output but have completely different root causes and different fixes, so the mode is the finding. You learn that the most dangerous disclosures are often inference hits, where the model was never told the secret but assembled it, which means input scrubbing alone never catches them. You learn to use planted canaries to measure detection precisely instead of guessing, and you learn that a hit only counts when it reproduces against a pinned version. By the end you can tell a defender not just that their model leaks, but exactly which surface to scrub, filter, or scope down.

Drive It with Claude Code

We have a signed scope to probe our self-hosted DVAIA range and a local open-weights model for sensitive information disclosure. Plant canary PII and a fake secret in the system prompt and retrieval store, run garak's leakreplay probe against the local model, then classify every hit by disclosure mode (memorization, inference, context bleed) and data class, and give me a finding table mapped to OWASP LLM02 and the ATLAS inference-API exfiltration technique with a reproduction step for each.

### Probe an AUTHORIZED, owned target for training-data replay (memorization).
### Replace target_name with YOUR local sandbox model. Never point at production or a third party.
 
# Measure memorization: does the model replay verbatim training-corpus fragments?
python3 -m garak \
  --target_type huggingface \
  --target_name gpt2 \
  --probes leakreplay \
  --report_prefix ahp05_leakreplay
 
# Inspect the run report (JSONL) for hits, then classify each by disclosure mode and data class.
# garak writes to ~/.local/share/garak/garak_runs/ by default; review before sharing.

Defend It

Close the three modes at three layers. For memorization, you cannot un-train a deployed model cheaply, so wrap it: filter outputs against the sensitive-data patterns you care about and add differential-privacy or data-minimization steps in any future training run. For inference, the fix is data minimization in context, never put attributes in front of the model that, combined, reconstruct something sensitive, and filter outputs for the reconstructed class even when no input matched. For context bleed, scope your retrieval and tools hard: enforce per-tenant access at the retrieval layer so a query can only ever reach documents the caller owns, and never assume a concealed system prompt is a secure boundary because prompt injection bypasses it. Layer output redaction (tokenization, pattern-matching) on top of all three as the last line. Re-run the probe after each control to prove the leak is actually closed.