AI Guardrails Design

Use this when an AI feature can be misused, attacked, or made to produce harmful or unauthorized output. Covers the threat surface (prompt injection, jailbreaks, output harms, abuse), and the three layers that contain them: input validation, output filtering, and abuse detection. If you are speccing the feature, use /ai-product-spec and treat this as the guardrails section.

Related skills: Specs the feature with /ai-product-spec. The unauthorized-output dimensions become eval hard-fails in /ai-eval-design. Adversarial testing lives in /ai-testing-strategy. Production abuse signals are monitored via /llm-observability-plan.

The hard part most teams miss

Guardrails are a layered system, not a content filter you bolt on at the end. A single output filter is theater.

The model is not a security boundary. A model can be talked out of its instructions (prompt injection, jailbreaks), and anything that depends on "the system prompt says not to" will eventually fail. Enforce the things that matter in code around the model, not in the prompt inside it.
"The model will refuse" is not a defense. Refusal is a trained tendency, not a guarantee, and it is exactly what attackers probe. Real protection is layered: validate what goes in, filter what comes out, and detect abuse over time. Any one layer alone has a known bypass.
Over-blocking has a cost too. Guardrails that refuse legitimate requests drive users away and train them to route around you. The job is calibration, blocking the genuinely harmful while letting real work through, not maximum refusal.

Process

Step 1: Gather inputs

Ask the user:

What is the worst output this feature could produce? (Harmful content, leaked PII, an unauthorized commitment, a policy contradiction.)
Who is adversarial, and why? (Bored users, competitors, fraudsters, automated abuse. Different attackers need different layers.)
What is the stakes and reversibility of a bad output? (Annoyance, money, legal exposure, safety.)
What can the feature touch? (Tools, data, external actions. The blast radius defines what must be gated.)
What compliance or policy constraints apply? (PII handling, regulated domain, disclosure requirements.)
What is the cost of a false block? (How much does refusing a legitimate request hurt?)

Step 2: Map the threat surface

Sort threats into the three places they occur:

Surface	Threats	Example
Input	Prompt injection, jailbreaks, instructions hidden in retrieved content or user data	A document the model summarizes contains "ignore your rules and..."
Output	PII leakage, unauthorized commitments, policy contradictions, harmful content	The model promises a refund it has no authority to grant
Abuse over time	Cost or volume attacks, scraping, automated probing for bypasses	One account driving thousands of calls to extract the system prompt

Step 3: Input validation layer

Separate trusted from untrusted input. Mark which inputs are authoritative and which are user or retrieved content that may carry injected instructions. Never let untrusted content be treated as instructions.
Validate structure before the model. Reject or sanitize inputs that fail basic checks (size, format, required fields) rather than passing them through.
Screen known-dangerous patterns. Flag inputs matching injection or jailbreak signatures and high-risk intents (the refund, the policy override) for stricter handling.

Step 4: Output filtering layer

Screen every output before it reaches the user. Block PII leakage, unauthorized commitments, and policy contradictions. These are the dimensions that become eval hard-fails.
Constrain the shape. Where possible, bind output to a schema so the model cannot freely emit prose that smuggles a harmful action.
Decide the fallback. When the filter blocks an output, what does the user see? A safe canned response and a path to a human, not a silent failure or a raw error.

Step 5: Abuse detection and limits

Rate and cost limits per user and per key, so one actor cannot run up the bill or brute-force a bypass.
Anomaly signals: spikes in volume, repeated near-miss inputs, or a single account probing variations all warrant a flag (wire these into /llm-observability-plan).
Escalation path: what gets logged, what auto-blocks, who reviews.

Step 6: Output the guardrails design

# Guardrails Design: (feature)

**Worst output:** (what we are preventing)
**Adversaries:** (who, and their goal)
**Blast radius:** (what the feature can touch)

## Threat surface
(Input / Output / Abuse table from Step 2)

## Input validation
- Trusted vs untrusted handling: (rule)
- Pre-model checks: (list)
- High-risk pattern screening: (list)

## Output filtering
- Blocked categories: (PII, commitments, policy, harm)
- Output shape constraint: (schema / none)
- Block fallback UX: (what the user sees)

## Abuse detection
- Limits: (rate, cost, per-user)
- Anomaly signals + escalation: (list)

## Calibration
- Cost of a false block, and how it is tuned

## Open questions
- (unresolved decisions)

Step 7: Review

Ask the user:

Does any defense rely only on the model choosing to refuse?
Can untrusted input reach the model as if it were instructions?
What does a blocked user see, and can they still get legitimate help?
Are false blocks measured, or only true blocks?

Anti-patterns

Anti-pattern	Why it fails	Do instead
Single output filter	One layer with a known bypass; nothing screens input or abuse	Layer input, output, and abuse defenses
The model as the boundary	Prompt instructions get overridden by injection	Enforce hard rules in code around the model
Untrusted content as instructions	Injected text in docs or user input hijacks the model	Separate and never trust untrusted content
Refusal as the plan	Trained refusal is probed and bypassed	Add deterministic input and output checks
Max blocking	Refuses legitimate work; users route around you	Calibrate against the cost of a false block
Silent block	A blocked output confuses the user, no path forward	Safe fallback plus a route to a human

Output location

Present the guardrails design as formatted text in the conversation for the user to copy into their design doc.