Lock It In: Turn Findings into a CI Regression Gate

The Play

Most findings die in a PDF. The fix ships, the report gets filed, and six sprints later a refactor quietly undoes it because nothing in the pipeline remembers the attack happened. This play closes that gap. You take a weakness you already proved in an owned lab and encode it as an adversarial test that the build runs every time, so the fixed behavior becomes a contract the pipeline enforces. The trust boundary it protects is the one between "we patched it once" and "it stays patched." It works because CI already gates code on unit tests; an AI weakness is just another assertion the build can refuse to pass. The grader is the hard part: model output is non-deterministic, so the test asserts on a property of the response (a refusal, a missing secret, a held boundary), not on an exact string.

Before the Snap

Signed scope or an owned lab, plus a finding you have already reproduced in it. This play does not generate new attacks; it preserves one you landed. Stand up the target app in a runnable harness CI can call (a local endpoint or a wrapped function). Install promptfoo and DeepEval into the project, pinned. Confirm your CI runner can reach the target and that you have a place to store test cases in version control. Write down, in plain language, exactly what the fixed system should and should not do for this finding. That sentence becomes your pass condition.

Run It

Pull the finding from your report and reduce it to one falsifiable claim: under input class X, the system must hold behavior Y (refuse, redact, stay in role). No exploit text, just the property.
Map the finding to its OWASP LLM Top 10 class (for example prompt injection, sensitive information disclosure, excessive agency) so the test is labeled by weakness type and your coverage is auditable.
Pick the engine per finding shape: promptfoo red-team mode when you want its OWASP-mapped plugins to generate adversarial variations of the input class, DeepEval when you want a precise pytest-style assertion on one response property.
Author the test against your reproduction, not a fresh attack. Encode the input class and a grader that checks the property (a refusal classifier, a must-not-contain check on a canary secret, a role-adherence metric), never an exact-string match.
Run the test twice locally to prove it discriminates: once against the unpatched build (it must fail) and once against the patched build (it must pass). A test that cannot fail is not a gate.
Add the suite to the pipeline as a required check on the AI surface, wired to fail the build on any red case. Keep the adversarial inputs in version control next to the code they guard.
Tune the grader threshold against a small batch of known-good and known-bad responses so a flaky model does not produce false reds. Record the threshold and why you chose it.
Hand the suite to the owning team as the deliverable: the report says what broke, the CI gate proves it stays fixed, and each case carries its OWASP label and its before/after evidence.

What You Learn

The failure class is silent regression: a fix that holds in the report but degrades in production because nothing tests for it on the path that ships. The transferable lesson is that an AI security finding is only durable once it is executable. The grader, not the attack, is the engineering: you are forced to state precisely what "fixed" means as a checkable property of model output, and that act of definition is most of the value. It also reframes the deliverable. A finding handed off as a passing-or-failing test is worth more to a defender than the same finding handed off as prose.

Drive It with Claude Code

On our authorized staging copy of the customer-support assistant, generate a promptfooconfig.yaml that maps the OWASP LLM Top 10 to a red-team test suite, wire it into our CI pipeline so the build fails when any new high-severity finding appears, and produce a baseline report I can commit as the regression floor. Run inside the sanctioned scope only and do not store or echo any raw adversarial strings.

# promptfooconfig.yaml, OWASP LLM Top 10 as a CI regression gate
# Run in CI:  promptfoo redteam run -c promptfooconfig.yaml --no-progress-bar
# Gate the build: a nonzero exit on new failures blocks the merge.
description: "Top10-as-tests regression gate for the authorized assistant"
 
targets:
  - id: https
    config:
      url: "https://staging.internal.example/authorized-assistant/chat"
      method: POST
      headers:
        Authorization: "Bearer ${RANGE_TOKEN}"   # sanctioned-scope token only
      body: { "input": "{{prompt}}" }
 
redteam:
  purpose: "Internal support assistant; in-scope per signed ROE"
  numTests: 5
  plugins:
    - owasp:llm          # full OWASP LLM Top 10 collection
  strategies:
    - basic              # methodology framing only; no payloads stored in VCS
 
# CI gate: compare against committed baseline, fail on regressions
defaultTest:
  assert:
    - type: moderation
outputPath: "redteam-results.json"

Defend It

The mitigation is the play itself: the regression gate is the durable blue-team control. To make it real and not theater, the defender owns four things. Treat the adversarial suite as production tests, run on every PR that touches the AI surface, required to merge, mapped to OWASP LLM Top 10 classes so coverage is visible and gaps are obvious. Grade on response properties (refusal, redaction, role adherence, canary absence), never exact strings, and pin a threshold tuned against known-good and known-bad samples to keep the model's non-determinism from flapping the build. Expand the corpus over time: every new finding adds a case, so the suite grows into a living map of what this system has already failed and now must not fail again. Cross-reference each case to the ATLAS mitigation it enforces so the security and engineering ledgers agree. The gate that catches a real regression six months later, with no human in the loop, is the whole return on this play.