Skip to content
LearnPlaybookAdvanced+

AHP-11: Make It Lie: Misinformation and Slopsquatting

Induce confidently-wrong output, fabricated citations, or hallucinated package names that downstream users or builds trust, then measure how far that false output travels before a human or a check catches it.

The Play

Most LLM attacks chase a refusal. This one chases a fact. The target says something untrue with total confidence, and the harm is downstream: a developer installs a package that an attacker registered yesterday, a paralegal files a case that never existed, a build pipeline pulls a dependency that was hallucinated into being. You are not breaking the model. You are measuring its integrity gap, the distance between assertion and truth, and showing the defender where that gap becomes a supply-chain hole. Slopsquatting is the sharp end: identify the names the model invents most often, confirm they are unregistered, and you have predicted exactly where a real adversary will plant the next malicious package. On an authorized range, you do this so the defender pins dependencies and adds grounding before anyone weaponizes it.

Before the Snap

Authorization first, every time. You need a signed scope that names the model, the endpoint, the date window, and explicit permission to elicit false output and to query public package registries during the test. This play touches a real supply chain: never register, publish, or reserve any hallucinated package name you discover, even as proof. Discovery is the deliverable; the name goes in the report, not on the registry. Stand up garak and promptfoo locally pointed at the authorized endpoint only. Confirm rate limits and cost ceilings with the owner before running, because integrity probes are high-volume by design. Methodology and measurement only. No payloads, no weaponized output.

Run It

  1. Confirm scope and snapshot the target: record the model name, version, system prompt if disclosed, and whether grounding or retrieval is enabled, so results map to a specific configuration the defender can fix.
  2. Run garak's package-hallucination probe against the authorized endpoint to collect package names the model recommends as real across multiple ecosystems (for example pip, npm).
  3. For each suggested package, query the matching public registry to classify it as real or non-existent, and keep only the non-existent ones as candidate slopsquatting targets.
  4. Run garak's misinformation probe to surface confidently-wrong factual claims and fabricated citations, then independently check a sample against authoritative public sources to confirm they are invented.
  5. Score the integrity gap: hallucinated-package rate, fabricated-citation rate, and which prompt shapes (vague asks, niche topics, made-up library requests) raise the rate.
  6. Trace the downstream path for the worst finding: document how one hallucinated package name would reach a build (assistant suggests it, developer installs it, CI resolves it) so the impact is concrete, not theoretical.
  7. Convert the highest-rate cases into promptfoo red-team assertions so the exact failures become a repeatable gate the defender can run in CI on every model or prompt change.
  8. Write the report: every claim paired with the prompt, the raw output, and the registry or source lookup that proves it false, mapped to OWASP LLM09 and the ATLAS integrity-erosion technique. Never publish or register any discovered name.

What You Learn

You learn that a model's confidence is uncorrelated with its accuracy, and that this gap is measurable, not vibes. You learn that slopsquatting is a prediction problem: the names a model hallucinates most are exactly where a supply-chain attacker will strike, so finding them first is defense. You learn to separate two failure classes that read the same to a user, fabricated citations (an information-integrity harm) and hallucinated packages (a build-integrity harm), because they need different fixes. And you learn why a one-time spot check is worthless: model and prompt changes reopen the gap, so the finding only matters once it is a standing CI gate.

Drive It with Claude Code

I have a signed scope to test the authorized model endpoint in scope.env for misinformation and package hallucination. Run garak's packagehallucination and misleading probes against it, then for every package name it returns, check the matching public registry and give me a table of which ones do not exist, with the exact prompt and output for each. Do not register or publish any name you find.

python -m garak \
  --model_type openai \
  --model_name $AUTHORIZED_MODEL \
  --probes packagehallucination.PythonPyPI,packagehallucination.JavaScriptNPM \
  --report_prefix authorized_run_ahp11
 
# Misinformation / fabricated-claim pass on the same authorized endpoint:
python -m garak \
  --model_type openai \
  --model_name $AUTHORIZED_MODEL \
  --probes misleading.FalseAssertion \
  --report_prefix authorized_run_ahp11_misinfo

Defend It

Three layers, all from the public OWASP LLM09 guidance. First, grounding: wire the model to retrieval over a verified source set and instruct it to answer only from retrieved context, so it has somewhere true to stand instead of inventing. Second, citation verification: never render a model-supplied citation or source as authoritative until an automated check resolves it against a real document or record, and surface unresolved ones as unverified rather than fact. Third, dependency pinning against slopsquatting: pin and lock every dependency to known-good versions, resolve only from a vetted internal mirror or allowlist, and block install of any package name a human did not approve, so a hallucinated name cannot enter the build. Treat coding-assistant output as a suggestion to verify, never as a resolver. Make the garak and promptfoo probes a release gate so regressions get caught before they ship.

References

Krypteia AgentComing soon

The Krypteia agent runs this whole sweep for you behind a signed scope: a multi-agent run that probes the authorized model for hallucinated packages and fabricated citations, verifies each one against the live public registry, and lights up an operator console that maps every confirmed integrity gap to OWASP LLM09 and MITRE ATLAS, with the discovered names locked in the report and never touching a registry. Coming soon.