Empty the Wallet: Unbounded Consumption

The Play

Unbounded Consumption is the impact play. The earlier plays got you in or got you data; this one asks a different question: what does it cost the owner when the model answers too freely. An LLM pipeline turns input into compute, and compute into a bill. If nobody caps the input size, the output length, the request rate, or the per-user spend, then the cost of operating the system is set by whoever talks to it most aggressively. OWASP calls the financial version Denial of Wallet: you do not need to take the service down, you only need to make running it hurt. The methodology here is measurement, not flooding. You establish a baseline cost per request, then walk three dimensions the spec names: expensive queries (inputs that demand heavy generation), context flooding (repeatedly filling the context window), and high-volume extraction (many cheap requests that add up). At each step you record what the meter does and, just as important, what control stops you. The deliverable is not a downed service. It is a cost curve and a list of the caps that should have bent it back toward flat.

Before the Snap

Authorization is the whole game on this play, because impact testing touches money and availability. Get a signed scope that names the exact endpoint, a hard ceiling on requests and total spend for the test, a billing alert the owner controls, and a named contact who can kill the run. Test against an owned or sanctioned rate-limited endpoint or a local Steve Chat Playground instance, never a shared production tenant you do not own. Agree the abort conditions in writing: a spend threshold, a latency threshold for other users, and a single word that stops everything. Stage your metering so every request is logged with timestamp, token counts, and latency. You are characterizing a cost surface, not trying to win a war of attrition. Keep volumes low, keep the window short, and stop the moment you have the curve.

Run It

Map the cost surface. Document the pipeline for the authorized endpoint: model, who pays per token or per call, what limits the provider enforces by default, and where the owner's own quotas, rate limits, and output caps sit. Write down the per-request cost model before you send anything.
Establish a clean baseline. Send a small, fixed set of ordinary requests at low rate and record tokens in, tokens out, latency, and dollar-equivalent cost per request. This is the flat line every later measurement is compared against.
Probe expensive queries. Using benign inputs that simply ask for long, detailed output, measure how output length and cost scale. Confirm whether an output cap or a max-tokens limit bends the curve back, or whether generation runs until the model decides to stop.
Probe context flooding. Send progressively larger benign inputs toward and past the context window, within your scope ceiling, and record how input size drives cost and latency. Confirm whether an input-size validation rejects oversized requests before they bill.
Probe high-volume extraction. Increase request rate in small, logged steps toward your agreed ceiling and watch for the rate limit and per-user quota to engage. Record the exact request number where throttling starts, or note that it never does.
Watch the autoscaler and the budget alert. Note whether scaling absorbs load silently (cost climbs, nothing breaks) and whether the owner's billing alert fired. Silent absorption with no alert is the finding, not a pass.
Stop at the ceiling and reconcile. The instant you hit any agreed spend, latency, or volume threshold, stop and confirm the meter against the provider's own usage console. Never push past the scope to see how bad it gets.
Build the cost curve. Plot cost per request class against the baseline, mark where each control engaged or failed to, and translate the loosest gap into a concrete dollar-per-hour exposure for the owner.

What You Learn

You learn to see an LLM pipeline as a metered cost surface rather than a feature, and to separate three failure modes that look similar but have different fixes: cost driven by output length (fix with output caps), cost driven by input size (fix with input validation and context limits), and cost driven by raw volume (fix with rate limits and per-user budgets). You learn that an autoscaler is an attacker's best friend when it has no budget telling it when to stop saying yes, and that a system can stay perfectly available while quietly costing ten times its baseline. Most of all you learn to express impact in the language the owner actually cares about: dollars per hour of exposure, tied to the specific cap that was missing.

Drive It with Claude Code

I have a signed scope to test cost-exhaustion resilience on my own rate-limited LLM endpoint at the URL in scope.txt, with a hard ceiling of 200 total requests and a 5 dollar spend cap. Run a metered baseline, then a controlled walk across output-length, input-size, and request-rate dimensions, logging tokens and latency per request, and stop the moment any ceiling is hit. Produce a cost curve per request class and tell me which control (output cap, input validation, rate limit, or per-user budget) is missing or set too loose.

// promptfooconfig.yaml, metered load against an OWNED, scoped endpoint.
// Caps total runs and concurrency so the test characterizes the cost
// curve without becoming a flood. No payloads: prompts are benign.
description: "AHP-10 cost-surface metering (authorized endpoint only)"
 
providers:
  - id: https
    config:
      url: "https://your-owned-endpoint.example/v1/chat"
      method: POST
      headers:
        Authorization: "Bearer ${SCOPED_TEST_KEY}"
      body:
        model: "your-model"
        max_tokens: 512        # server-side cap is the control under test
        messages:
          - role: user
            content: "{{prompt}}"
 
prompts:
  - "Summarize this short paragraph in one sentence."   # baseline class
  - "Explain this topic in thorough detail."            # expensive-output class
 
# Hard ceilings keep the run bounded and abortable.
defaultTest:
  options:
    maxConcurrency: 2
 
# Run with an explicit repeat ceiling, then read tokens/latency from output:
#   promptfoo eval -c promptfooconfig.yaml --repeat 10 --max-concurrency 2 -o results.json
# Stop immediately if spend, latency, or request ceilings in scope.txt are reached.

Defend It

Put a meter on the model, not just the gateway. Enforce per-user and per-key quotas and a per-user budget that hard-stops when a spend ceiling is crossed, so one identity cannot run the bill alone. Cap output with a server-side max-tokens limit that the client cannot raise, and validate input size to reject oversized requests before they reach the model. Rate-limit at the identity level, not just the IP, and make the limit visible in responses so legitimate clients back off. Bound the autoscaler with a cost ceiling, not only a CPU ceiling, and wire billing alerts to a human who can act. Log tokens in, tokens out, and cost per request, then alert on anomalies in that stream. The Steve Chat Playground demonstrates the smallest version of two of these (a local rate-limit filter and a 256-character input cap), which is a clean place to feel how a single cap changes the cost curve before you tune the real thing.