Skip to main content
Engineering/ai-guardrails-design

AI Guardrails Design

Design layered defenses for an AI feature: input validation, output filtering, jailbreak and abuse detection, calibrated against false-block cost.

Use this when an AI feature can be misused, attacked, or made to produce harmful or unauthorized output. Covers the threat surface (prompt injection, jailbreaks, output harms, abuse), and the three layers that contain them: input validation, output filtering, and abuse detection. If you are speccing the feature, use /ai-product-spec and treat this as the guardrails section.

Related skills: Specs the feature with /ai-product-spec. The unauthorized-output dimensions become eval hard-fails in /ai-eval-design. Adversarial testing lives in /ai-testing-strategy. Production abuse signals are monitored via /llm-observability-plan.

The hard part most teams miss

Guardrails are a layered system, not a content filter you bolt on at the end. A single output filter is theater.

  1. The model is not a security boundary. A model can be talked out of its instructions (prompt injection, jailbreaks), and anything that depends on "the system prompt says not to" will eventually fail. Enforce the things that matter in code around the model, not in the prompt inside it.
  2. "The model will refuse" is not a defense. Refusal is a trained tendency, not a guarantee, and it is exactly what attackers probe. Real protection is layered: validate what goes in, filter what comes out, and detect abuse over time. Any one layer alone has a known bypass.
  3. Over-blocking has a cost too. Guardrails that refuse legitimate requests drive users away and train them to route around you. The job is calibration, blocking the genuinely harmful while letting real work through, not maximum refusal.

Process

Step 1: Gather inputs

Ask the user:

  1. What is the worst output this feature could produce? (Harmful content, leaked PII, an unauthorized commitment, a policy contradiction.)
  2. Who is adversarial, and why? (Bored users, competitors, fraudsters, automated abuse. Different attackers need different layers.)
  3. What is the stakes and reversibility of a bad output? (Annoyance, money, legal exposure, safety.)
  4. What can the feature touch? (Tools, data, external actions. The blast radius defines what must be gated.)
  5. What compliance or policy constraints apply? (PII handling, regulated domain, disclosure requirements.)
  6. What is the cost of a false block? (How much does refusing a legitimate request hurt?)

Step 2: Map the threat surface

Sort threats into the three places they occur:

SurfaceThreatsExample
InputPrompt injection, jailbreaks, instructions hidden in retrieved content or user dataA document the model summarizes contains "ignore your rules and..."
OutputPII leakage, unauthorized commitments, policy contradictions, harmful contentThe model promises a refund it has no authority to grant
Abuse over timeCost or volume attacks, scraping, automated probing for bypassesOne account driving thousands of calls to extract the system prompt

Step 3: Input validation layer

  • Separate trusted from untrusted input. Mark which inputs are authoritative and which are user or retrieved content that may carry injected instructions. Never let untrusted content be treated as instructions.
  • Validate structure before the model. Reject or sanitize inputs that fail basic checks (size, format, required fields) rather than passing them through.
  • Screen known-dangerous patterns. Flag inputs matching injection or jailbreak signatures and high-risk intents (the refund, the policy override) for stricter handling.

Step 4: Output filtering layer

  • Screen every output before it reaches the user. Block PII leakage, unauthorized commitments, and policy contradictions. These are the dimensions that become eval hard-fails.
  • Constrain the shape. Where possible, bind output to a schema so the model cannot freely emit prose that smuggles a harmful action.
  • Decide the fallback. When the filter blocks an output, what does the user see? A safe canned response and a path to a human, not a silent failure or a raw error.

Step 5: Abuse detection and limits

  • Rate and cost limits per user and per key, so one actor cannot run up the bill or brute-force a bypass.
  • Anomaly signals: spikes in volume, repeated near-miss inputs, or a single account probing variations all warrant a flag (wire these into /llm-observability-plan).
  • Escalation path: what gets logged, what auto-blocks, who reviews.

Step 6: Output the guardrails design

# Guardrails Design: (feature)

**Worst output:** (what we are preventing)
**Adversaries:** (who, and their goal)
**Blast radius:** (what the feature can touch)

## Threat surface
(Input / Output / Abuse table from Step 2)

## Input validation
- Trusted vs untrusted handling: (rule)
- Pre-model checks: (list)
- High-risk pattern screening: (list)

## Output filtering
- Blocked categories: (PII, commitments, policy, harm)
- Output shape constraint: (schema / none)
- Block fallback UX: (what the user sees)

## Abuse detection
- Limits: (rate, cost, per-user)
- Anomaly signals + escalation: (list)

## Calibration
- Cost of a false block, and how it is tuned

## Open questions
- (unresolved decisions)

Step 7: Review

Ask the user:

  • Does any defense rely only on the model choosing to refuse?
  • Can untrusted input reach the model as if it were instructions?
  • What does a blocked user see, and can they still get legitimate help?
  • Are false blocks measured, or only true blocks?

Anti-patterns

Anti-patternWhy it failsDo instead
Single output filterOne layer with a known bypass; nothing screens input or abuseLayer input, output, and abuse defenses
The model as the boundaryPrompt instructions get overridden by injectionEnforce hard rules in code around the model
Untrusted content as instructionsInjected text in docs or user input hijacks the modelSeparate and never trust untrusted content
Refusal as the planTrained refusal is probed and bypassedAdd deterministic input and output checks
Max blockingRefuses legitimate work; users route around youCalibrate against the cost of a false block
Silent blockA blocked output confuses the user, no path forwardSafe fallback plus a route to a human

Output location

Present the guardrails design as formatted text in the conversation for the user to copy into their design doc.