Skip to content

Corrupt the Knowledge: RAG and Vector Poisoning

Inject a few crafted documents into a retrieval corpus to steer the model toward an attacker-chosen answer, then measure how few documents it takes and which integrity controls would have stopped them.

The Play

Retrieval-augmented generation closes the gap between a frozen model and live knowledge. The model answers a question by pulling the top matching documents from a vector store and grounding its response in them. That grounding is the attack surface. If you can get a few crafted documents into the corpus, and you can make them rank high for a chosen question, the model treats your text as truth and hands it back with a citation. You never touch the prompt, the weights, or the system instructions. You touch the library, and the librarian reads what is on the shelf. PoisonedRAG showed the asymmetry plainly: roughly five crafted documents per target question reach high attack success even in a corpus of millions, because retrieval is a similarity contest and you get to write documents engineered to win it. This play has you build the contest on your own range, run it, and then measure what defense changes the outcome.

Before the Snap

Authorization first. Build and poison a corpus you own. Do not add documents to, ingest into, or query any production or third-party RAG system, internal wiki, customer knowledge base, or shared vector store. Your range is a local embedding index seeded with public, openly licensed documents (public docs, open datasets, your own notes) plus the crafted entries you write for the test. Write a one-page rules-of-engagement note before you start: the corpus location, the target questions, the attacker-chosen answers, who owns the box, and a teardown step that deletes the index when you are done. Record the clean baseline (the correct answer for each target question, retrieved from the un-poisoned corpus) before you inject anything, so you have a before and after. Keep the crafted documents benign in content. The point is steering the answer to a harmless chosen target (a wrong date, a fake product name, a decoy recommendation), not generating harmful output.

Run It

  1. Define the range and the target. Stand up a local vector store you own and seed it with public, openly licensed documents. Pick three target questions and, for each, write down the correct baseline answer and the harmless attacker-chosen answer you want the model to give instead.
  2. Capture the clean baseline. Query the un-poisoned corpus with each target question. Record the answer and the top-k retrieved documents. This is your control: no change you observe later counts unless you can compare it to this.
  3. Craft the injection documents, methodology only. Following the public PoisonedRAG framing, write a small set of documents (a handful per target question) whose text is built to (a) rank high in retrieval for the target question and (b) assert the chosen answer. No exploit payloads. The craft is in topical similarity and assertion, not in any malicious string.
  4. Inject and re-index. Add the crafted documents to the corpus and rebuild or update the embedding index exactly as the owned system would on normal ingestion. Confirm the documents are present and embedded.
  5. Re-run the target questions and measure two things. First, retrieval: do the crafted documents appear in the top-k for the target question? Second, generation: does the model now return the attacker-chosen answer? Note how few documents it took to flip each answer.
  6. Assess with Giskard. Point Giskard at the poisoned RAG pipeline and run a scan to surface misinformation and grounding-quality failures. Use the report as an independent, repeatable signal of corpus health, not just your own eyeballing.
  7. Map and walk back the defense. Map each successful flip to OWASP LLM08 (vector and embedding weaknesses) and LLM04 (data poisoning), and to ATLAS AML.T0020. Then test the mitigations on the range: add document provenance and signing, filter retrieval by trusted source, and re-run. Record which control stopped the flip and which did not.
  8. Tear down. Delete the poisoned index and the crafted documents per your rules-of-engagement note. Write up the minimum document count that flipped each answer and the one control that would have prevented it.

What You Learn

You learn that RAG does not check whether a retrieved document is true, only whether it is similar to the question, and that similarity is something an attacker writes on purpose. You learn the asymmetry that makes this dangerous: a handful of crafted documents can flip an answer inside a corpus of millions, because retrieval ranks by relevance and the poison is engineered to be the most relevant thing in the store. You learn to measure poisoning as two separate gates, retrieval rank and generated answer, because a document that does not get retrieved cannot steer anything. Most of all you learn where the real defense lives. It is not at the prompt and not at the model. It is at ingestion: provenance, signing, source trust, and retrieval filtering. Corpus integrity is the control plane.

Drive It with Claude Code

On my authorized local range, build a small vector store I own, seed it with public documents, and record the baseline answers for three target questions. Then add a handful of crafted documents per question using the PoisonedRAG methodology, re-index, and report for each question whether the crafted documents reached top-k retrieval and whether the generated answer flipped to my chosen target. Finish by running a Giskard scan against the poisoned pipeline and mapping every flip to OWASP LLM08, LLM04, and ATLAS AML.T0020.

import asyncio
from giskard.scan import vulnerability_scan
 
# Assess an OWNED, locally built RAG pipeline after seeding crafted documents.
# my_agent wraps your local retriever + model. Nothing third-party is touched.
 
async def main():
    # 1) Independent grounding / misinformation scan over the poisoned pipeline
    await vulnerability_scan(
        target=my_agent,
        description="Owned local RAG range for corpus-poisoning assessment.",
        languages=["en"],
    )
 
    # 2) Lightweight corpus-integrity check: every document must carry provenance
    #    and a trusted-source signature. Unsigned / unknown-source = quarantine.
    missing = [
        doc["id"]
        for doc in corpus.all_documents()
        if not doc.get("provenance") or not doc.get("signature")
    ]
    if missing:
        print(f"INTEGRITY FAIL: {len(missing)} documents lack provenance/signature")
        print("Quarantine these before they reach retrieval:", missing[:10])
    else:
        print("INTEGRITY OK: all documents carry provenance and a trusted signature")
 
asyncio.run(main())

Defend It

Defense moves to the ingestion boundary. Treat every document entering the corpus as untrusted until proven otherwise: attach provenance metadata (source, author, timestamp, ingestion path) and sign trusted documents so unsigned or unknown-source entries can be filtered out at retrieval time. Constrain retrieval to trusted sources for sensitive queries instead of searching the whole pool. Watch for ingestion anomalies: bursts of near-duplicate documents, new documents that suddenly dominate top-k for a specific question, or entries whose embeddings cluster suspiciously tight around a single query. Run a periodic grounding scan (Giskard or equivalent) as a regression gate, so a poisoned answer shows up as a failed test before a user sees it. And keep a clean reference set: if an answer to a known question changes, that is a signal to audit what entered the corpus, not to trust the new answer.

References

Krypteia AgentComing soon

The Krypteia agent will run this play autonomously behind a signed scope: it builds the owned range, seeds and poisons the corpus, measures both the retrieval rank and the answer flip for every target question across a multi-agent orchestration, then maps each hit to OWASP and ATLAS in an operator console that hands you the minimum poison count and the one control that stops it. Coming soon.