The Play
The first eleven plays are you, a keyboard, and one prompt at a time. That finds bugs. It does not find the shape of a target. Real coverage comes from running hundreds of conversations, each a small variation, each scored automatically, so the signal floats up out of the noise instead of you reading every transcript at 2am.
This play is the jump from manual probe to campaign. You stop being the attacker typing and become the operator wiring the machine that attacks. A red-team framework like PyRIT splits the job into four parts so you can compose them: a target (the endpoint under test), a converter (transforms a seed instruction into a variant, encoding, rephrasing, multi-step framing), an orchestrator (drives the conversation turn by turn, deciding what to send next), and a scorer (reads each reply and labels whether the objective was met). You supply the authorized target and the objective. The framework supplies the loop.
The reason this matters is multi-turn. Most guardrails are tuned to catch the obvious single-message attack. They are weaker across a conversation, where context builds slowly and no single turn looks hostile. An orchestrator that carries state across turns, escalating gradually, tests exactly the surface a single prompt cannot reach. That is the gap, and it is why conversation-level defense, not just per-message filtering, is the fix.
This is methodology. You are wiring orchestrators, converters, and scorers and pointing them at a range you own. You are not writing jailbreak strings. The framework's job is to systematize what you already know how to do by hand, run it at volume, and score it so the result is a ranked report, not a pile of chat logs.
Before the Snap
Signed scope first, always. A scripted campaign sends far more traffic than manual testing, so the rules of engagement have to name the target, the rate, the time window, and the kill switch in writing before a single turn fires.
Stand up your own range. Run AI Goat locally so the target, the data, and the logs are all yours. Never point an automated orchestrator at production or at anything you do not own, and never at a third-party model endpoint outside your authorization.
Set the objective and the scorer before the converter. Decide what a hit looks like and how the scorer recognizes it first. If you cannot define success, the campaign produces volume without meaning. Cap the turn count and the total run size so a runaway loop cannot hammer the target. Log every conversation to disk so the run is reviewable and repeatable.
Run It
- Define the campaign on paper. Name the authorized target, the objective in one sentence (what a successful outcome looks like), the maximum turns per conversation, the total conversation budget, and the abort condition. This is the ROE for the run, not optional.
- Stand up the target locally. Deploy AI Goat or another lab you own, confirm you can reach its endpoint, and wrap it as a PyRIT prompt target so the orchestrator can talk to it. Verify one plain request and response round trips before you automate anything.
- Choose a scorer and wire it first. Pick or configure a scorer that maps each model reply to your objective (met / not met, or a graded label). Test it against a handful of known-good and known-bad replies by hand so you trust its labels before it runs at volume.
- Add a converter to generate variation. Attach a converter that transforms a seed instruction into variants (rephrasing, encoding, step-by-step framing). You provide the benign seed and the objective; the converter produces the spread. No hand-written payloads, the framework generates the variation space.
- Select a multi-turn orchestrator. Use an orchestrator that carries conversation state and escalates across turns rather than firing one shot. This is the whole point: test the slow-build path that single-message guardrails miss. Configure max turns from your ROE.
- Dry run small, then scale. Run a tiny campaign first (a few conversations, low turn cap) and read the full transcripts by hand. Confirm the orchestrator drives turns correctly, the converter varies the input, and the scorer labels sanely. Only then raise the conversation budget.
- Run the campaign and rank by score. Execute the full run within your budget, let every conversation log to disk, and sort the output by score so the conversations that met the objective surface at the top. You triage a ranked list, not a wall of chat logs.
- Reproduce and write it up. Take the top-scoring conversations, re-run them to confirm they are repeatable and not one-off noise, and document each as a finding: the conversation path, the turn where it crossed, the score, and the conversation-level control that would have caught it. Map each to OWASP LLM01 and the ATLAS staging context.
What You Learn
You learn that adversarial testing is a pipeline, not a chat session, and that the four-part split (target, converter, orchestrator, scorer) is what makes it scale. Once the scorer is trustworthy, the operator's job shifts from typing attacks to designing campaigns and triaging ranked output.
You learn why multi-turn is the soft surface. A defense that scores each message in isolation has no memory of the conversation, so a slow escalation reads as eleven harmless turns. The orchestrator that carries state is testing exactly the blind spot per-message filtering leaves open, which is why the fix lives at the conversation level, not the message level.
You also learn the discipline that separates a red-team campaign from a denial-of-service accident: scope, rate caps, turn caps, a budget, a kill switch, and full logging. Automation multiplies your reach, which means it multiplies the cost of getting the rules of engagement wrong.
Drive It with Claude Code
I have a signed scope and a locally deployed AI Goat instance at an endpoint I own. Help me build a PyRIT orchestrator script that wraps that local endpoint as a prompt target, attaches one converter and one scorer, runs a capped multi-turn campaign against a single defined objective, logs every conversation to disk, and prints results ranked by score. Enforce a max turn count and a total conversation budget from my ROE, and do not generate any payload content yourself, only the orchestration scaffolding.
# PyRIT multi-turn orchestrator skeleton. Scaffolding only, no payloads.
# Target = a lab you OWN (e.g. local AI Goat). ROE signed before running.
import asyncio
from pyrit.common import initialize_pyrit, IN_MEMORY
from pyrit.prompt_target import OpenAIChatTarget
from pyrit.orchestrator import RedTeamingOrchestrator
from pyrit.prompt_converter import PromptConverter # variation, not payloads
from pyrit.score import SelfAskTrueFalseScorer # labels each reply
initialize_pyrit(memory_db_type=IN_MEMORY)
# 1) TARGET: your authorized, locally hosted lab endpoint
authorized_target = OpenAIChatTarget(
endpoint="http://localhost:8000/v1/chat/completions", # AI Goat you own
api_key="LOCAL_LAB_KEY",
model_name="lab-model",
)
# 2) SCORER: define what a "hit" is BEFORE running the campaign
objective_scorer = SelfAskTrueFalseScorer(
chat_target=authorized_target,
true_false_question_path="path/to/your_objective_question.yaml",
)
# 3) CONVERTERS: generate variation from a benign seed (you supply no payloads)
converters: list[PromptConverter] = [
# e.g. a rephrasing / encoding converter from pyrit.prompt_converter
]
# 4) ORCHESTRATOR: carries state across turns, capped per ROE
orchestrator = RedTeamingOrchestrator(
objective_target=authorized_target,
adversarial_chat=authorized_target,
objective_scorer=objective_scorer,
prompt_converters=converters,
max_turns=5, # turn cap from your rules of engagement
)
async def run_campaign() -> None:
# Objective described in plain language; the framework drives the turns.
result = await orchestrator.run_attack_async(
objective="Authorized lab objective described here, no payload text."
)
await result.print_conversation_async() # full transcript + scores, logged
asyncio.run(run_campaign())Defend It
Defend at the conversation level, not just the message level. Per-message filters miss slow escalation by design. Track state across a session and score the trajectory: a conversation that drifts steadily toward a sensitive objective is a signal even when no single turn trips a filter.
Watch for campaign signatures. Automated red-team traffic has a shape: bursts of near-identical conversations, systematic variation of the same seed, abnormal turn counts, high request rates from one session or key. Anomaly detection on conversation volume, similarity, and pacing catches the orchestrator even when it catches no single payload.
Put conversation-level guardrails in front of the model and rate-limit per session and per credential. Cap turns, flag rapid re-tries of semantically similar requests, and log full conversations so your own blue team can replay an attack the way the red team scored it. The defender who only keeps single-message logs cannot reconstruct a multi-turn attack, and the multi-turn attack is the one that worked.