Use this when you're speccing a feature that uses LLMs, AI models, or generative capabilities. A traditional PRD misses the decisions that make or break AI features: which model, how to evaluate quality, what happens when the model fails, and how much it costs per interaction. This skill extends /prd-draft with the AI-specific sections teams forget until production.

Framework attribution: The data-consent and dignity dimensions draw on the FRIES framework (Freely given, Reversible, Informed, Enthusiastic, Specific) from Building Consentful Tech (Una Lee & Dann Toliver, CC BY). See knowledge/product-ethics-frameworks.md.

Related skills: Extends /prd-draft for traditional PRD structure. Uses /ai-eval-design for deeper eval planning. See /ai-prototype-guide for building the prototype from this spec. See /multi-model-strategy for multi-model routing decisions. For an executable spec to hand a coding agent, use /spec-driven-feature.

The hard part most teams miss

A normal PRD describes the happy path and assumes the technology works. An AI spec is the inverse: the parts a generalist treats as edge cases are the actual product.

"Why AI" is a gate, not a formality. Most specs assume the model and design around it. The spec's real job is to prove a rules engine, a lookup table, or a search index would not do the job. If it would, you are buying nondeterminism, latency, and per-call cost for nothing. Answer this first (Step 1, question 3) or do not write the spec.
The failure mode is the product, not an exception. A deterministic feature fails by crashing, and you handle it once. An AI feature is wrong while looking confident, every day, at some rate. What the model does when it does not know, when it is unavailable, when the user is adversarial, is not an edge case you bolt on at the end. It is the spec (Step 4 guardrails). Design it before the happy path, not after.
Cost per interaction reshapes the design, it is not a footnote. A feature that is excellent at $0.30 per call and impossible at $0.03 is two different products. The unit cost decides model tier, context strategy, and whether the feature can exist at your volume at all (Step 5). Estimate it early, because it changes what you are allowed to build, not just what it costs.

Everything below is the structure. These three are why the structure exists.

Process

Step 1: Gather context

Ask the user:

What AI-powered feature are you building? (Name and one-line description)
What user problem does this solve? (Evidence: user quotes, support tickets, research)
What does the AI do that couldn't be done without it? (The "why AI" test -- if a rules engine or lookup table works, you don't need AI)
Who are the users? (Personas and their expectations for AI accuracy, speed, and tone)
How does this work today? (Current workflow, tools in use, pain points with the status quo)
Known constraints -- budget per interaction, latency requirements, data privacy restrictions, compliance needs, expected volume (daily/monthly interactions)
Prior art -- how competitors or existing tools handle this (include screenshots if available)

If the user provides a brief, extract what you can and ask follow-ups for gaps. Flag anything unclear with [NEEDS INPUT].

Step 2: Define model requirements

Work through these decisions with the user:

Decision	Options	Notes
Primary capabilities	Text generation, classification, extraction, summarization, code generation, multimodal, embedding (often multiple)	What the model actually needs to do -- most features combine capabilities
Quality bar	Must be correct (medical, legal), should be helpful (productivity), can be approximate (creative)	Determines eval rigor
Latency target	Real-time (< 2s), interactive (< 10s), batch (minutes OK)	Affects model choice and architecture
Streaming needed?	Yes (progressive results), no (complete response)	Long-running tasks need streaming for UX; affects API choice
Context window need	Small (< 4K tokens), medium (< 32K), large (< 128K), very large (> 128K)	Affects model and architecture
Cost sensitivity	High (consumer, high volume), medium (B2B, moderate volume), low (enterprise, low volume)	Affects model tier and caching strategy
Data sensitivity	Public data only, private but not regulated, regulated (HIPAA, SOC2, GDPR)	Affects deployment and vendor choice

After completing the table, recommend specific model families based on the answers. Use the knowledge references to map capability + quality bar + cost sensitivity to model tiers. Flag the recommendation as [RECOMMENDED -- verify during development].

If the feature involves voice input or output, also work through:

Decision	Options	Notes
Voice direction	Input only (transcription), output only (TTS), bidirectional (conversation)	Determines API choice: Whisper/TTS for async, Realtime API for conversation
Latency target	Real-time conversation (<500ms), interactive (<2s), batch (minutes OK)	Real-time voice has tighter budgets than text -- >500ms feels laggy
Voice identity	Default synthetic, custom/cloned, brand voice	Custom voices require golden reference samples (30s-5min) and consent
Language coverage	English only, top 5 languages, 20+ languages	Accuracy drops for low-resource languages; test each target language
Accent handling	Standard accents only, broad dialect coverage	Test with representative accent samples -- this is an equity issue
Audio environment	Clean/quiet, noisy/mobile, call center	Background noise tolerance affects model and preprocessing choice

Voice eval criteria (add to Step 4 quality dimensions):

Dimension	Definition	Threshold	How to measure
Word error rate (WER)	Transcription accuracy	(e.g., < 8% on test set)	Automated comparison to human transcripts
Naturalness (MOS)	How human does the voice sound	(e.g., > 4.0 on 1-5 scale)	Human listener panel rating
Response latency	Time from end of user speech to start of AI speech	(e.g., < 500ms p95)	Instrumented measurement
Interruption handling	Can user interrupt mid-response gracefully	(e.g., successful interrupt in < 200ms)	Scenario testing

Voice guardrails (add to Step 4 guardrails):

What happens when transcription confidence is low? (Ask to repeat vs. best-guess)
What happens when background noise makes speech unintelligible?
How does the system handle overlapping speech?
What disclosure is required? (Many jurisdictions require "you are speaking with an AI")

Voice cost projections (add to Step 5 cost table):

Per-minute audio processing cost (transcription + synthesis): typically $0.02-0.07/min
Telephony costs if phone-based (SIP/PSTN interconnect): $0.01-0.02/min
Total AI voice cost: typically $0.04-0.14/min (compare to human agent at $0.50-1.50/min)

Step 3: Design the prompt architecture

Outline the prompting approach -- not the exact prompts, but the strategy:

System prompt approach -- what persona, constraints, and output format does the system prompt establish?
User input handling -- how is user input preprocessed, validated, or augmented before reaching the model?
Context management -- what context is injected? (RAG, conversation history, user profile, external data)
Output parsing -- structured output (JSON, specific format) or free-form? How is output validated?
Few-shot examples -- are examples needed? How many? Static or dynamic?
Multi-step processing -- does the task need chain-of-thought reasoning, multiple passes, or staged processing? (e.g., classify first, then generate; or identify issues, then prioritize, then suggest fixes)
Context compaction -- keep system instructions under 50 lines to avoid context bloat. Long system prompts degrade output quality as the model struggles to weight all instructions equally. If the feature needs more than 50 lines of instructions, split into: core constraints (always loaded), domain context (loaded per request type), and reference material (retrieved via RAG, not stuffed into the prompt). Don't dump five files into the system prompt and expect good results.

System prompt skeleton (a starting template, not a finished prompt):

ROLE: You are (specific role), not a general assistant. (One sentence on the
      single job this prompt exists to do.)

INPUTS: You will receive (named inputs and their shape). Treat (X) as
        authoritative and (Y) as the user's claim, which may be wrong.

OUTPUT CONTRACT: Return (exact shape: JSON schema, or named sections, or a
        single labeled value). Never return prose outside this shape.

CONSTRAINTS:
- (The 3 to 5 hard rules. Each one maps to a guardrail or an eval dimension.)
- When a required input is missing or contradictory, return (defined
  fallback), do not guess.

REFUSE / ESCALATE WHEN: (the named conditions that must reach a human or a
        canned safe response, not a generated one).

The order matters: role and output contract before constraints, because the model anchors on what it is and what it must produce. A constraint the model reads after it has already decided the shape is a constraint it will weight less.

Output handling: contract, do not parse prose. Decide the output shape at spec time and bind the model to it, rather than generating free text and reverse-engineering it downstream. Modern model APIs enforce a response schema directly (for example structured outputs that constrain the response to a JSON schema, or strict tool-call parameters). Specify which the feature uses. The anti-pattern, "the model writes a paragraph and a regex pulls the number out," fails silently the first time the model phrases the paragraph differently.

Reasoning and cost levers to specify (current-generation models): if the task needs multi-step reasoning, say so and pick the lever rather than leaving it to chance: adaptive or extended thinking for genuinely hard steps, an effort or reasoning level for the depth-versus-cost tradeoff, and prompt caching for any large, stable prefix (system prompt, retrieved context, few-shot block) reused across calls. Caching a 2K-token stable prefix is the single most common 10x cost reduction teams leave on the table. Name the levers in the spec so the build team does not rediscover them in production.

Prompt architecture anti-patterns:

Anti-pattern	Why it fails	Do instead
The kitchen-sink system prompt	200 lines of rules; the model weights none of them well and contradicts itself	Core constraints always loaded; domain context per request type; reference material via RAG
Parse the prose	Generate free text, extract structure with regex; breaks the first time phrasing shifts	Bind output to a schema (structured outputs / strict tool params) at spec time
Trusting the user's facts	Prompt does not separate authoritative input from user claims; model agrees with wrong premises (sycophancy)	Mark which inputs are authoritative; instruct the model to decline false premises
No defined "I don't know"	Model has no sanctioned fallback, so it fabricates one	Specify the exact low-confidence and missing-input behavior; make refusal a valid output
Few-shot drift	Static examples that no longer match the live distribution silently bias outputs	Version examples with the prompt; review them when the input distribution shifts
Stuffing context that should be retrieved	Every call pays for context that one call in ten needs	RAG for large or conditional reference material; keep the always-loaded prompt lean

Step 3b: Design the interaction layer

The prompt architecture defines what the AI does technically. The interaction layer defines what the user experiences. Both need to be designed.

Work through these decisions:

Decision	Options	Notes
Loading experience	Streaming text, progress indicator, skeleton screen, "thinking..." message	What does the user see while the AI works? Streaming builds trust through visibility.
Result presentation	Inline text, structured card, expandable sections, side panel, modal	How do results appear? Match the complexity of the output to the display format.
Confidence display	Hidden, subtle indicators, explicit confidence labels, source citations	Does the user need to know how sure the AI is? Clinical and financial domains usually need transparency.
Regeneration/retry	Thumbs up/down, "try again" button, "refine" with instructions, no retry	Can the user ask for a different answer? How?
Correction mechanism	Edit output directly, provide feedback, report errors, none	Can the user fix what the AI got wrong?
Human fallback	Escalate button, automatic routing, "talk to a person" link, none	When the AI can't help, what's the path to a human?

For deeper interaction design -- persona, conversation flow, trust patterns -- see /ai-persona-design, /ai-conversation-design, and /ai-trust-pattern.

Step 4: Define eval criteria and guardrails

Outline what "good" and "bad" look like:

Quality dimensions:

Dimension	Definition	Threshold	How to measure
(Accuracy)	(Does it get the facts right?)	(e.g., > 95% on golden dataset)	(Automated check, human review)
(Relevance)	(Does it answer what was asked?)	(e.g., > 90% rated relevant)	(LLM-as-judge, user feedback)
(Helpfulness)	(Is the output actionable and worth the user's attention?)	(e.g., > 80% acted on by users)	(User action tracking, feedback)
(Tone)	(Does it match brand voice?)	(e.g., passes tone rubric)	(Rubric scoring)
(Safety)	(Does it avoid harmful output?)	(e.g., 0% harmful in adversarial set)	(Red team testing)

Setting thresholds for v1: Start with a golden dataset of 50-100 representative examples. Set initial thresholds based on human baseline performance (how accurate is a human doing this task?). It's better to set an honest threshold and hit it than an aspirational one you can't measure.

Guardrails and fallback behavior:

What happens when the model returns low-confidence output?
What happens when the model is unavailable (timeout, rate limit, outage)?
What content should be blocked or filtered?
What does the fallback UX look like?

For deeper eval planning, use /ai-eval-design to build golden datasets and eval pipelines.

Step 5: Estimate costs

Component	Estimate	Basis
Avg. input tokens per request	(estimate)	(sample prompts)
Avg. output tokens per request	(estimate)	(sample outputs)
Context retrieval cost	(estimate)	(embedding/search calls per request)
Model pricing	($ per 1K tokens)	(current pricing)
Cost per interaction	(calculated)	input + output + retrieval cost
Projected daily volume	(estimate)	(user base x usage frequency)
Monthly cost projection	(calculated)	cost per interaction x volume
Cost per user per month	(calculated)	monthly cost / active users
Cost ceiling	(budget limit)	(what's acceptable)

Worked example: A typical interaction with a 500-word user input, 2K-token system prompt, and 800-token response using a mid-tier model at $0.015/1K tokens costs roughly $0.05 per interaction. At 100 interactions/day, that's ~$150/month in model costs alone.

Include notes on cost optimization: caching, shorter prompts, smaller models for simple tasks, batching.

Step 6: Draft the AI product spec

Compile into the spec document:

# AI Product Spec: (Feature name)

## Overview
(What we're building, why AI, and the core user value -- 3-4 sentences.)

## Problem Statement
(The user problem, with evidence. Why AI is the right solution approach.)

## Model Requirements
(Table from Step 2 -- capability, quality bar, latency, context, cost, data sensitivity)

## Prompt Architecture
(Strategy from Step 3 -- system prompt approach, context management, output parsing)

## Quality & Eval Criteria
(Table from Step 4 -- dimensions, thresholds, measurement methods)

## Guardrails & Fallbacks
(What can go wrong and what happens when it does)

## User Experience & Interaction Design
(How users interact with the AI feature -- loading experience, result presentation, confidence display, regeneration/correction mechanisms, human fallback paths. From Step 3b.)

## Integration Architecture
(How this connects to existing systems -- APIs, webhooks, data flow, authentication)

## Cost Projections
(Table from Step 5 -- per-interaction cost, volume, monthly projection, ceiling)

## Goals & Success Metrics
| Goal | Metric | How we measure it |
|------|--------|-------------------|
| (Goal 1) | (Specific metric) | (Tool or method) |
| (Goal 2) | (Specific metric) | (Tool or method) |

## Scope
**In scope:**
- (Capability 1)
- (Capability 2)

**Out of scope:**
- (Excluded item -- and why)

## Open Questions
- (Unresolved decisions)
- (Things that need testing to determine)

## Handoff to Build

When ready to prototype this spec with an AI coding tool:
1. **Gather visual references** -- take screenshots of similar products or UI patterns you want to match
2. **Start with the plan, not the code** -- paste this spec (or a summary) into your AI tool and ask it to create a build plan before writing code
3. **Build in phases** -- break the scope into 3-4 phases; build and review one at a time
4. **Don't over-specify prompts yet** -- get the UX working first, then tune the AI behavior

See `/ai-prototype-guide` for the full prototyping workflow.

Step 7: Stress-test and finalize

Challenge the spec:

What's the worst thing the AI could output? Is the guardrail sufficient?
Is the cost projection realistic at 10x the expected volume?
Can you actually measure the eval criteria with current tooling?
Is the quality bar honest or aspirational?
What would a skeptical engineer ask about the prompt architecture?
What would a skeptical ML engineer ask about model selection?

Revise based on user responses.

Uncertainty Policy

Topic	Tolerance	Action
User problem statement	Low	STOP and ask -- spec is useless without a real problem
Quality dimensions and thresholds	Low	STOP and ask -- vague quality bars lead to vague features
Model choice	Medium	Recommend + flag [RECOMMENDED] -- can be changed during dev
Cost estimates	Medium	Estimate + flag [ESTIMATED] -- refine with actual usage
Prompt architecture details	Medium	Propose approach + flag [PROPOSED] -- will iterate
Guardrail specifics	Medium	Propose + flag [PROPOSED] -- needs adversarial testing
Competitor details	High	Best guess from context

Default: STOP and ask when a topic is not listed above.

Output location

Present the spec as formatted text in the conversation for the user to copy into their docs tool.

Example Output

Input

Feature name and description: "Contract Redline Assistant" — an AI-powered feature inside Ironclad's contract management platform that automatically identifies risky clauses in uploaded contracts and suggests pre-approved fallback language from the company's playbook
User problem: Legal teams at mid-market companies (50–500 employees) spend 3–6 hours per contract manually comparing vendor paper against their standard positions; quotes from users: "I redline the same indemnification clause 20 times a month" and "We miss things when we're rushed"
Constraints: < 8s latency acceptable (lawyers aren't waiting in real-time), regulated data environment (SOC 2 Type II required, contracts contain PII and confidential terms), target cost ≤ $0.30 per contract review, expected volume 400 contracts/day across customer base
Current workflow: Lawyers download PDFs, mark up in Word, email redlines back — no tooling, fully manual, average 22-day contract cycle
Personas: In-house counsel (accuracy-critical, low tolerance for hallucinated legal language), legal ops managers (want audit trails and throughput metrics)

Output

AI Product Spec: Contract Redline Assistant

Overview

The Contract Redline Assistant analyzes uploaded vendor contracts against a company's pre-configured legal playbook, flags clauses that deviate from acceptable positions, and suggests pre-approved fallback language for each identified issue. AI is the right approach because the task requires semantic understanding of contractual intent — not just keyword matching — across unlimited clause variations and contract structures. The feature targets in-house legal teams at mid-market companies where manual redlining is the primary bottleneck in the contract cycle.

Problem Statement

In-house counsel at companies without large legal departments spend 3–6 hours per contract on redlining work that is 70–80% repetitive. Fatigue and time pressure cause clauses to be missed. A rules engine cannot handle this task because vendor paper uses unlimited structural and linguistic variation to express the same legal concepts — semantic understanding is required.

Evidence: User interviews, support tickets referencing "missed indemnification caps," average 22-day contract cycle in customer base.

Model Requirements

Decision	Recommendation	Notes
Primary capabilities	Extraction + classification + text generation	Extract clauses → classify risk → generate redline suggestions
Quality bar	Must be correct	Legal output — hallucinated fallback language is a liability risk
Latency target	Interactive (< 8s)	Lawyers submit and wait; not real-time
Streaming needed?	Yes	Multi-clause contracts may take 5–8s; stream results clause-by-clause
Context window need	Large (< 128K)	Full contracts can run 15,000–40,000 tokens
Cost sensitivity	Medium	B2B SaaS; $0.30/contract ceiling is workable with mid-tier model
Data sensitivity	Regulated (SOC 2 Type II)	PII, confidential deal terms — no training on customer data; requires BAA or equivalent

Model recommendation: GPT-4o or Claude 3.5 Sonnet via enterprise API with data processing agreement. Both support 128K context, structured output, and enterprise data commitments. [RECOMMENDED — verify during development]

For clause extraction on well-structured contracts, a smaller model (GPT-4o mini) can pre-segment the document before the full model processes flagged sections — reducing token costs by ~40%. [PROPOSED]

Prompt Architecture

System prompt approach: Establishes the model as a "contract review assistant operating under [Company]'s legal playbook." Constraints include: never fabricate fallback language not present in the playbook, output structured JSON per clause, flag confidence level per finding, and never provide legal advice beyond playbook positions. Core instructions capped at ~40 lines; playbook content loaded via RAG per contract type (NDA, MSA, SaaS agreement). [PROPOSED]
User input handling: Uploaded PDFs are parsed and segmented into clauses using a pre-processing step (rule-based section detection + small model classification). Only flagged or ambiguous clauses are passed to the primary model — not the full document text. This reduces per-request token consumption significantly.
Context management:
- Playbook positions retrieved via embedding similarity to each extracted clause (top 3 matches per clause)
- Contract metadata injected: contract type, counterparty tier, jurisdiction if detected
- No conversation history needed — stateless per review session
Output parsing: Structured JSON required for each clause finding:
```
{ clause_type, risk_level, extracted_text, playbook_position, suggested_redline, confidence, explanation }
```
Output validated against schema before rendering. Missing fields trigger a re-call with stricter formatting instructions (max 1 retry). [PROPOSED]
Few-shot examples: 3–5 static examples per contract type (NDA, MSA) embedded in the domain context layer. Examples cover high-risk clause types: indemnification, limitation of liability, IP ownership, data processing.
Multi-step processing:
- Pass 1 (small model): Segment and classify contract sections
- Pass 2 (primary model): Analyze flagged clauses against playbook, generate redlines
- Pass 3 (validation layer): Schema check + confidence thresholding before rendering
Context compaction: Core system prompt (~40 lines) + dynamic playbook retrieval via RAG. Full playbook never stuffed into context. [PROPOSED]

Quality & Eval Criteria

Dimension	Definition	Threshold	How to measure
Clause detection recall	% of risky clauses correctly flagged (no misses)	> 92% on golden set	Human-reviewed contract set (100 contracts, 800+ clauses)
Fallback accuracy	Suggested language matches or is a valid variant of playbook position	> 97% on playbook-covered clauses	Automated diff against playbook + senior counsel review
False positive rate	% of acceptable clauses incorrectly flagged	< 15%	Human review of flagged output
Hallucination rate	Suggested language not grounded in playbook	0% acceptable	Automated grounding check + human audit
Explanation clarity	Lawyer can understand why clause was flagged without additional context	> 85% rated clear	User feedback, in-app thumbs rating
Safety	No output that constitutes legal advice beyond playbook scope	0% in adversarial set	Red team testing with edge-case contracts

Golden dataset: Build from 100 historical contracts already reviewed by counsel, with human-labeled ground truth for flagged clauses and accepted redlines. [NEEDS INPUT — customer data sharing agreements required]

Guardrails & Fallbacks

Low confidence output (< 0.70): Clause flagged as "Needs Manual Review" with explanation surfaced to user; suggested redline shown with explicit warning label. Never silently omitted.
Clause type not in playbook: Model instructed to return "No playbook position — escalate to counsel" rather than generate novel fallback language. This is a hard constraint enforced via output schema validation.
Model timeout / rate limit: Graceful degradation — partial results displayed for completed clauses; banner shown: "Review incomplete — [N] clauses could not be analyzed. Download for manual review." Full manual download always available.
Hallucination prevention: Suggested redlines cross-referenced against playbook embeddings post-generation. Cosine similarity below 0.75 triggers a discard + "Manual review required" flag rather than displaying the output. [PROPOSED — threshold to be calibrated]
Adversarial / junk input: Non-contract uploads (e.g., images, blank PDFs) caught at preprocessing; user shown inline error before model is called.

User Experience & Interaction Design

Decision	Design choice	Rationale
Loading experience	Streaming clause-by-clause results with progress bar ("Analyzing clause 4 of 17…")	5–8s wait is long; progressive display builds trust and lets lawyers start reading
Result presentation	Side-by-side panel: original clause left, flagged issues + suggested redline right; color-coded risk level (red/yellow/green)	Mirrors lawyers' existing Word redline mental model
Confidence display	Explicit confidence label per finding ("

Run this now

Try /ai-product-spec on your own input

0/4000

Part of these Playbook topics

Agent Skills Agent Experience Agentic UX

Related AI & Agents skills

Agent Eval Harness Agent Reliability Audit AI Agent Design AI Eval Design AI Guardrails Design AI Health Check AI Risk Register AI ROI Business Case

Back to Skills Catalog