advanced50 min6 min read

Output Filtering and Guardrails

Defense layer two: catching dangerous outputs before they reach users or downstream systems. Content classifiers, output schemas, and the tradeoff between safety and utility.

guardrailsoutput-filteringcontent-classificationllm-guard

Why Output Filtering Matters

Input validation stops attacks before the model sees them. Output filtering catches what got through.

The model processed something it should not have. Or the model is functioning normally, but its output contains sensitive data, harmful content, or instruction fragments that will cause problems downstream. Output filtering is the last line of defense before the response reaches a user, a database, another model, or an action executor.

This layer matters most in two scenarios:

Indirect injection succeeded: Attacker-controlled content in a retrieved document manipulated the model's output. The output now contains exfiltrated data, harmful instructions, or unexpected content.

Legitimate output with unintended consequences: The model is working correctly but its output will be embedded in a SQL query, rendered as HTML, or passed to a shell. Output that is safe to display to a human is dangerous in those contexts.

Rule-Based Filtering

Fast, predictable, no false positives on known patterns. Start here.

import re
from typing import Optional

# PII detection patterns
PII_PATTERNS = {
    "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
    "credit_card": r"\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})\b",
    "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
    "phone": r"\b(?:\+1\s?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b",
    "api_key": r"\b[A-Za-z0-9]{32,}\b",  # crude but catches many key formats
}

# Injection attempt in output (model was hijacked and is now outputting instructions)
INJECTION_IN_OUTPUT_PATTERNS = [
    r"\[SYSTEM\s*(?:OVERRIDE|UPDATE|INSTRUCTION)\]",
    r"IGNORE\s+(?:ALL\s+)?PREVIOUS\s+INSTRUCTIONS",
    r"<\s*script\s*>",  # XSS in markdown-rendered output
]

def filter_output(text: str, redact_pii: bool = True) -> dict:
    result = {
        "filtered_text": text,
        "blocked": False,
        "redacted": [],
        "flags": [],
    }

    # Check for injection artifacts in output
    for pattern in INJECTION_IN_OUTPUT_PATTERNS:
        if re.search(pattern, text, re.IGNORECASE):
            result["blocked"] = True
            result["flags"].append("injection_artifact_in_output")
            return result

    # Redact PII
    if redact_pii:
        filtered = text
        for pii_type, pattern in PII_PATTERNS.items():
            matches = re.findall(pattern, filtered)
            if matches:
                filtered = re.sub(pattern, f"[REDACTED_{pii_type.upper()}]", filtered)
                result["redacted"].append(f"{pii_type}:{len(matches)}")
        result["filtered_text"] = filtered

    return result

Structured Output Enforcement

The most reliable output control is constraining the model to produce only structured output that matches a defined schema. If the output must be valid JSON matching a specific schema, most injection artifacts are automatically eliminated.

from pydantic import BaseModel
from anthropic import Anthropic

client = Anthropic()

class ProductSearchResult(BaseModel):
    products: list[str]
    total_count: int
    search_successful: bool

def safe_product_search(query: str) -> ProductSearchResult:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system="""You are a product search assistant. 
        Respond ONLY with valid JSON matching this schema:
        {
          "products": ["product name 1", "product name 2"],
          "total_count": <integer>,
          "search_successful": <boolean>
        }
        Do not include any other text, explanation, or commentary.""",
        messages=[{"role": "user", "content": f"Search for: {query}"}]
    )

    raw = response.content[0].text.strip()

    # Parse and validate against schema
    import json
    data = json.loads(raw)
    return ProductSearchResult(**data)

If the model has been injected and tries to output something other than valid JSON, the parse fails. The application gets an exception rather than unexpected behavior. This is safer than trying to detect all possible injection outputs.

Semantic Output Classification

For unstructured outputs, use classifiers to evaluate what the model produced.

LLM-as-judge: A second model evaluates the first model's output. Slower and more expensive, but catches semantic-level problems that rules miss.

async def judge_output(
    original_request: str,
    model_response: str,
    policy: str,
) -> dict:
    judge_prompt = f"""You are a security policy evaluator.

Policy: {policy}

Original user request:
{original_request}

AI response to evaluate:
{model_response}

Evaluate whether the AI response:
1. Violates the stated policy
2. Contains sensitive information that should not be disclosed
3. Contains instructions that could cause harm if followed
4. Appears to have been manipulated (e.g., contains unexpected system-level text)

Respond with JSON:
{{
  "policy_violation": <boolean>,
  "sensitive_data_present": <boolean>,
  "appears_manipulated": <boolean>,
  "reasoning": "<one sentence>",
  "severity": "none|low|medium|high|critical"
}}"""

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        messages=[{"role": "user", "content": judge_prompt}]
    )

    import json
    return json.loads(response.content[0].text)

Limitations: the judge model can also be manipulated. An adversarial output that is persuasive enough may fool the judge. Use this as one layer, not the only layer.

Specialized classifiers: Pre-trained classifiers for specific content categories.

  • Llama Guard: Meta's safety classifier, available locally. Fast, no API cost, reasonable accuracy on standard harm categories.
  • Presidio: Microsoft's PII detection library. Better PII detection than regex for production use.
  • Perspective API: Google's toxicity classifier. Useful for community-facing applications.
# Example using Llama Guard via Ollama
import httpx

async def llama_guard_check(prompt: str, response: str) -> dict:
    payload = {
        "model": "llama-guard3",
        "messages": [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": response},
        ],
    }
    result = await httpx.AsyncClient().post(
        "http://localhost:11434/api/chat",
        json=payload,
    )
    output = result.json()["message"]["content"]
    safe = output.strip().lower().startswith("safe")
    return {"safe": safe, "raw": output}

Context-Specific Output Validation

The downstream context determines what "dangerous output" means.

HTML rendering context: Output rendered as HTML must not contain script tags, event handlers, or external resource references. Even markdown rendering can be an XSS vector if the renderer evaluates HTML.

import bleach

def sanitize_for_html_render(text: str) -> str:
    # Allow basic formatting, block everything else
    allowed_tags = ["p", "br", "strong", "em", "code", "pre", "ul", "ol", "li", "h1", "h2", "h3"]
    allowed_attrs = {}
    return bleach.clean(text, tags=allowed_tags, attributes=allowed_attrs, strip=True)

SQL context: Output that will be interpolated into SQL must be treated as untrusted user input, not trusted model output.

# WRONG: trust model output in SQL
query = f"SELECT * FROM products WHERE name = '{model_output}'"

# RIGHT: parameterize regardless of source
cursor.execute("SELECT * FROM products WHERE name = %s", (model_output,))

Action executor context: Before executing any action based on model output, validate that the action is within expected parameters.

def validate_action_before_execution(action: dict, allowed_actions: set, allowed_params: dict) -> bool:
    action_type = action.get("type")
    if action_type not in allowed_actions:
        return False
    params = action.get("params", {})
    for param, value in params.items():
        constraints = allowed_params.get(action_type, {}).get(param)
        if constraints and not constraints(value):
            return False
    return True

The Latency-Safety Tradeoff

Every output filter adds latency. LLM-as-judge adds the most (another model call). Rule-based filtering adds almost none.

For latency-sensitive applications:

  1. Inline fast checks: Rule-based PII redaction and injection pattern detection run synchronously before returning to the user.

  2. Async audit trail: LLM-as-judge and more expensive classifiers run asynchronously, logging findings without blocking the response. For production monitoring rather than hard blocking.

  3. Structured output enforcement: Zero latency cost because it is just schema validation on output you already have.

  4. Stream-and-check: For streaming responses, run incremental checks on output chunks. Block stream continuation if a check fails mid-generation.

Layer your controls by latency cost. Fast rules first, expensive classifiers asynchronously. Block only when a fast check triggers; audit everything else with the slower checks.