Skip to content

Smuggle It in a Picture: Multimodal Prompt Injection

Get a multimodal model to follow instructions you hid inside an image or audio clip, not the text prompt the user actually typed.

The Play

A multimodal model ingests pixels and audio samples through the same pathway it uses to read your words. To the model, an image is not a thing to describe. It is more context. And context, in a language model, is instructions waiting to be followed.

There are two clean families here, both published, both reproducible on your own lab.

The first is typographic. You render the instruction as visible text inside the picture, styled to read like a diagram, a sign, a screenshot, or a label. The user sees a normal image. The model's vision encoder reads the words and treats them as part of the prompt. FigStep is the canonical public example: harmful or off-policy requests are drawn as text inside an image, paired with an innocent text prompt, and the typographic channel walks straight past text-only filters. No pixel math, no gradients, just a font and a black-box request.

The second is perturbative. You blend an imperceptible adversarial perturbation into an image or an audio clip so that, to a human, nothing changed, but the model's encoder resolves it into attacker-chosen tokens. Bagdasaryan and colleagues called this adversarial instruction blending: a benign, unmodified model is steered by the carrier alone. The user asks an ordinary question about an ordinary-looking file, and the dialog bends to the attacker's script.

The lesson under both: the model never asked "is this content, or is this a command?" That question has no answer in the architecture. That is the whole bug. You are not breaking a filter. You are exploiting the absence of a boundary that was never built.

This play is about proving that to yourself, on your own model, so you can defend it. Methodology only. Public, book-and-OWASP level. No working strings.

Before the Snap

Stand up the range and the rules before you render a single pixel.

Authorization: write down, in one line, the endpoint you own and the date range you are testing. If you cannot name the owner and it is not you, stop. This play is owned-target-only.

The lab: serve one open-weight vision-language model behind a local REST endpoint on your own machine. Pick something documented and reproducible. Keep it offline or firewalled. Give it no tools, no file access, no outbound network, nothing it could actually misuse, because the point is to measure instruction-following, not to wire up a real consequence.

The baseline: before any injected carrier, run a clean batch. Same benign text prompts, ordinary images, no embedded instructions. Record what normal looks like. You need this so that any behavior change you see later is attributable to the carrier, not to the model being moody.

The marker: choose a harmless, unmistakable success signal. A specific nonsense phrase the model would never emit on its own, or a clearly-flagged benign behavior flip. You want a clean yes/no per trial, not a judgment call. This keeps the whole exercise defensive: you are detecting whether the hidden channel fired, not producing anything harmful.

The tooling: install garak. Its visual_jailbreak probe family includes a public FigStep implementation, which gives you a repeatable, scored harness for the typographic family against a VLM you point it at.

Run It

  1. Confirm scope and stand up the lab. Verify in writing that the multimodal endpoint is yours, then serve one open-weight vision-language model behind a local REST endpoint with no tools, no file access, and no outbound network.
  2. Establish a baseline. Send a batch of benign text prompts with ordinary, un-embedded images and record normal output, including how often your harmless marker behavior appears by chance (it should be near zero).
  3. Build the typographic carrier (methodology). Render a directive as visible text inside an image, styled to read like a diagram, label, or screenshot, paired with an innocent text prompt. Use the FigStep methodology as your public reference for how text-as-image bypasses text-only filters. Keep the directive benign and marker-based.
  4. Run the typographic batch with garak. Point garak's visual_jailbreak / FigStep probe at your endpoint, or feed your own carrier images through the same harness, and let the detector score whether the model followed the embedded text instead of the typed prompt.
  5. Build the perturbative carrier (methodology). Following the published adversarial-instruction-blending approach, prepare an image or short audio clip where a near-imperceptible perturbation, not visible text, is what steers the model. Confirm to your own eyes and ears that the carrier looks and sounds unchanged.
  6. Run the perturbative batch and compare. Submit the carriers with ordinary text questions, and check whether the model emits your marker behavior that the clean baseline never produced.
  7. Score and isolate. Compute attack-success rate per family (typographic vs perturbative) across the batch, and prove the control: remove the carrier, keep the text, and confirm the marker behavior disappears. That delta is your finding.
  8. Write it up against the defense. Map each successful carrier to OWASP LLM01:2025 and ATLAS AML.T0051.001, and note which input-side control (OCR-and-scan of images, modality isolation, output constraints, least-privilege on any downstream tools) would have caught it.

What You Learn

You learn that "the model read the image" and "the model obeyed the image" are the same event, because a multimodal model has no internal type system that separates content from command. You learn the two public carrier families, typographic (visible text drawn into the picture, the FigStep pattern) and perturbative (imperceptible blending, the adversarial-instruction-blending pattern), and that the typographic one needs nothing but a font and black-box access. You learn to measure attack-success rate with a clean baseline and an isolatable control, so a finding is a number with a delta, not a vibe. Most of all you learn the defensive lesson: this is not a filter you patch, it is a missing boundary. Any control you ship lives at the input edge (scan and sanitize image and audio content before it reaches the model), at the modality border (treat ingested media as untrusted data, never as instructions), and at the blast radius (least privilege on whatever the model can actually do).

Drive It with Claude Code

You are assisting an authorized assessment of a multimodal model endpoint I own and operate at http://localhost:8000 (an open-weight vision-language model I am self-hosting for security testing, no tools, no outbound network). Scope is this endpoint only. Goal: measure how reliably content embedded in an image overrides the user's typed text prompt, OWASP LLM01:2025 multimodal, MITRE ATLAS AML.T0051.001. Help me: (1) script a clean baseline batch of benign text prompts with ordinary images and record a harmless marker-phrase rate; (2) set up garak's visual_jailbreak / FigStep probe against my endpoint as a REST target and parse its JSONL report into an attack-success rate; (3) run a control that strips the image and keeps the text to isolate the carrier's effect; (4) summarize results as a table mapping each finding to OWASP LLM01 and the input-side mitigation that would catch it. Methodology and measurement only. Do not generate harmful payloads or working jailbreak strings, keep all directives benign marker phrases.

# Install garak (NVIDIA LLM vulnerability scanner)
python -m pip install -U garak
 
# List the multimodal visual-jailbreak probes (includes FigStep)
python -m garak --list_probes | grep visual_jailbreak
 
# Run the FigStep visual-jailbreak probe against a VLM you own/self-host.
# Replace target_type/target_name with YOUR endpoint config. Scope: owned only.
python -m garak \
  --target_type huggingface \
  --target_name your-org/your-open-vlm \
  --probes visual_jailbreak.FigStep
 
# Output: structured JSONL report with per-attempt detector scores
# (attack-success rate) for downstream comparison across runs.

Defend It

Treat every ingested image and audio file as untrusted data, never as a source of instructions. That is the one sentence the architecture forgot. Concretely, per OWASP LLM01:2025: run OCR and content scanning on inbound images before they reach the model, and flag or strip embedded text directives. Isolate modalities so media content cannot escalate into the instruction context (clear provenance, a hard separation between "what the user said" and "what a file contained"). Constrain output format and behavior so a hijacked turn cannot reach anything dangerous, and enforce least privilege on every downstream tool or action the model can trigger, so even a successful injection has nowhere to go. Add an independent guardrail or moderation pass on inputs and outputs rather than trusting the model to police itself, because FigStep showed that prompt-level OCR self-checks only dent the problem and do not close it, especially on open VLMs. Keep humans in the loop for any high-impact action. And regression-test it: garak's visual_jailbreak probes give you a repeatable score you can run on every model and prompt change, so a defense you shipped last quarter is proven still standing this quarter.

References

Krypteia AgentComing soon

The Krypteia agent will run scoped multimodal injection sweeps against vision and audio endpoints you authorize, mapping every hit to OWASP LLM01 and ATLAS AML.T0051.001 in the operator console: coming soon.