Use this when you're speccing a feature that uses LLMs, AI models, or generative capabilities. A traditional PRD misses the decisions that make or break AI features: which model, how to evaluate quality, what happens when the model fails, and how much it costs per interaction. This skill extends /prd-draft with the AI-specific sections teams forget until production.
Related skills: Extends
/prd-draftfor traditional PRD structure. Uses/ai-eval-designfor deeper eval planning. See/ai-prototype-guidefor building the prototype from this spec. See/multi-model-strategyfor multi-model routing decisions.
Process
Step 1: Gather context
Ask the user:
- What AI-powered feature are you building? (Name and one-line description)
- What user problem does this solve? (Evidence: user quotes, support tickets, research)
- What does the AI do that couldn't be done without it? (The "why AI" test -- if a rules engine or lookup table works, you don't need AI)
- Who are the users? (Personas and their expectations for AI accuracy, speed, and tone)
- How does this work today? (Current workflow, tools in use, pain points with the status quo)
- Known constraints -- budget per interaction, latency requirements, data privacy restrictions, compliance needs, expected volume (daily/monthly interactions)
- Prior art -- how competitors or existing tools handle this (include screenshots if available)
If the user provides a brief, extract what you can and ask follow-ups for gaps. Flag anything unclear with [NEEDS INPUT].
Step 2: Define model requirements
Work through these decisions with the user:
| Decision | Options | Notes |
|---|---|---|
| Primary capabilities | Text generation, classification, extraction, summarization, code generation, multimodal, embedding (often multiple) | What the model actually needs to do -- most features combine capabilities |
| Quality bar | Must be correct (medical, legal), should be helpful (productivity), can be approximate (creative) | Determines eval rigor |
| Latency target | Real-time (< 2s), interactive (< 10s), batch (minutes OK) | Affects model choice and architecture |
| Streaming needed? | Yes (progressive results), no (complete response) | Long-running tasks need streaming for UX; affects API choice |
| Context window need | Small (< 4K tokens), medium (< 32K), large (< 128K), very large (> 128K) | Affects model and architecture |
| Cost sensitivity | High (consumer, high volume), medium (B2B, moderate volume), low (enterprise, low volume) | Affects model tier and caching strategy |
| Data sensitivity | Public data only, private but not regulated, regulated (HIPAA, SOC2, GDPR) | Affects deployment and vendor choice |
After completing the table, recommend specific model families based on the answers. Use the knowledge references to map capability + quality bar + cost sensitivity to model tiers. Flag the recommendation as [RECOMMENDED -- verify during development].
If the feature involves voice input or output, also work through:
| Decision | Options | Notes |
|---|---|---|
| Voice direction | Input only (transcription), output only (TTS), bidirectional (conversation) | Determines API choice: Whisper/TTS for async, Realtime API for conversation |
| Latency target | Real-time conversation (<500ms), interactive (<2s), batch (minutes OK) | Real-time voice has tighter budgets than text -- >500ms feels laggy |
| Voice identity | Default synthetic, custom/cloned, brand voice | Custom voices require golden reference samples (30s-5min) and consent |
| Language coverage | English only, top 5 languages, 20+ languages | Accuracy drops for low-resource languages; test each target language |
| Accent handling | Standard accents only, broad dialect coverage | Test with representative accent samples -- this is an equity issue |
| Audio environment | Clean/quiet, noisy/mobile, call center | Background noise tolerance affects model and preprocessing choice |
Voice eval criteria (add to Step 4 quality dimensions):
| Dimension | Definition | Threshold | How to measure |
|---|---|---|---|
| Word error rate (WER) | Transcription accuracy | (e.g., < 8% on test set) | Automated comparison to human transcripts |
| Naturalness (MOS) | How human does the voice sound | (e.g., > 4.0 on 1-5 scale) | Human listener panel rating |
| Response latency | Time from end of user speech to start of AI speech | (e.g., < 500ms p95) | Instrumented measurement |
| Interruption handling | Can user interrupt mid-response gracefully | (e.g., successful interrupt in < 200ms) | Scenario testing |
Voice guardrails (add to Step 4 guardrails):
- What happens when transcription confidence is low? (Ask to repeat vs. best-guess)
- What happens when background noise makes speech unintelligible?
- How does the system handle overlapping speech?
- What disclosure is required? (Many jurisdictions require "you are speaking with an AI")
Voice cost projections (add to Step 5 cost table):
- Per-minute audio processing cost (transcription + synthesis): typically $0.02-0.07/min
- Telephony costs if phone-based (SIP/PSTN interconnect): $0.01-0.02/min
- Total AI voice cost: typically $0.04-0.14/min (compare to human agent at $0.50-1.50/min)
Step 3: Design the prompt architecture
Outline the prompting approach -- not the exact prompts, but the strategy:
- System prompt approach -- what persona, constraints, and output format does the system prompt establish?
- User input handling -- how is user input preprocessed, validated, or augmented before reaching the model?
- Context management -- what context is injected? (RAG, conversation history, user profile, external data)
- Output parsing -- structured output (JSON, specific format) or free-form? How is output validated?
- Few-shot examples -- are examples needed? How many? Static or dynamic?
- Multi-step processing -- does the task need chain-of-thought reasoning, multiple passes, or staged processing? (e.g., classify first, then generate; or identify issues, then prioritize, then suggest fixes)
- Context compaction -- keep system instructions under 50 lines to avoid context bloat. Long system prompts degrade output quality as the model struggles to weight all instructions equally. If the feature needs more than 50 lines of instructions, split into: core constraints (always loaded), domain context (loaded per request type), and reference material (retrieved via RAG, not stuffed into the prompt). Don't dump five files into the system prompt and expect good results.
Step 3b: Design the interaction layer
The prompt architecture defines what the AI does technically. The interaction layer defines what the user experiences. Both need to be designed.
Work through these decisions:
| Decision | Options | Notes |
|---|---|---|
| Loading experience | Streaming text, progress indicator, skeleton screen, "thinking..." message | What does the user see while the AI works? Streaming builds trust through visibility. |
| Result presentation | Inline text, structured card, expandable sections, side panel, modal | How do results appear? Match the complexity of the output to the display format. |
| Confidence display | Hidden, subtle indicators, explicit confidence labels, source citations | Does the user need to know how sure the AI is? Clinical and financial domains usually need transparency. |
| Regeneration/retry | Thumbs up/down, "try again" button, "refine" with instructions, no retry | Can the user ask for a different answer? How? |
| Correction mechanism | Edit output directly, provide feedback, report errors, none | Can the user fix what the AI got wrong? |
| Human fallback | Escalate button, automatic routing, "talk to a person" link, none | When the AI can't help, what's the path to a human? |
For deeper interaction design -- persona, conversation flow, trust patterns -- see
/ai-persona-design,/ai-conversation-design, and/ai-trust-pattern.
Step 4: Define eval criteria and guardrails
Outline what "good" and "bad" look like:
Quality dimensions:
| Dimension | Definition | Threshold | How to measure |
|---|---|---|---|
| (Accuracy) | (Does it get the facts right?) | (e.g., > 95% on golden dataset) | (Automated check, human review) |
| (Relevance) | (Does it answer what was asked?) | (e.g., > 90% rated relevant) | (LLM-as-judge, user feedback) |
| (Helpfulness) | (Is the output actionable and worth the user's attention?) | (e.g., > 80% acted on by users) | (User action tracking, feedback) |
| (Tone) | (Does it match brand voice?) | (e.g., passes tone rubric) | (Rubric scoring) |
| (Safety) | (Does it avoid harmful output?) | (e.g., 0% harmful in adversarial set) | (Red team testing) |
Setting thresholds for v1: Start with a golden dataset of 50-100 representative examples. Set initial thresholds based on human baseline performance (how accurate is a human doing this task?). It's better to set an honest threshold and hit it than an aspirational one you can't measure.
Guardrails and fallback behavior:
- What happens when the model returns low-confidence output?
- What happens when the model is unavailable (timeout, rate limit, outage)?
- What content should be blocked or filtered?
- What does the fallback UX look like?
For deeper eval planning, use
/ai-eval-designto build golden datasets and eval pipelines.
Step 5: Estimate costs
| Component | Estimate | Basis |
|---|---|---|
| Avg. input tokens per request | (estimate) | (sample prompts) |
| Avg. output tokens per request | (estimate) | (sample outputs) |
| Context retrieval cost | (estimate) | (embedding/search calls per request) |
| Model pricing | ($ per 1K tokens) | (current pricing) |
| Cost per interaction | (calculated) | input + output + retrieval cost |
| Projected daily volume | (estimate) | (user base x usage frequency) |
| Monthly cost projection | (calculated) | cost per interaction x volume |
| Cost per user per month | (calculated) | monthly cost / active users |
| Cost ceiling | (budget limit) | (what's acceptable) |
Worked example: A typical interaction with a 500-word user input, 2K-token system prompt, and 800-token response using a mid-tier model at $0.015/1K tokens costs roughly $0.05 per interaction. At 100 interactions/day, that's ~$150/month in model costs alone.
Include notes on cost optimization: caching, shorter prompts, smaller models for simple tasks, batching.
Step 6: Draft the AI product spec
Compile into the spec document:
# AI Product Spec: (Feature name)
## Overview
(What we're building, why AI, and the core user value -- 3-4 sentences.)
## Problem Statement
(The user problem, with evidence. Why AI is the right solution approach.)
## Model Requirements
(Table from Step 2 -- capability, quality bar, latency, context, cost, data sensitivity)
## Prompt Architecture
(Strategy from Step 3 -- system prompt approach, context management, output parsing)
## Quality & Eval Criteria
(Table from Step 4 -- dimensions, thresholds, measurement methods)
## Guardrails & Fallbacks
(What can go wrong and what happens when it does)
## User Experience & Interaction Design
(How users interact with the AI feature -- loading experience, result presentation, confidence display, regeneration/correction mechanisms, human fallback paths. From Step 3b.)
## Integration Architecture
(How this connects to existing systems -- APIs, webhooks, data flow, authentication)
## Cost Projections
(Table from Step 5 -- per-interaction cost, volume, monthly projection, ceiling)
## Goals & Success Metrics
| Goal | Metric | How we measure it |
|------|--------|-------------------|
| (Goal 1) | (Specific metric) | (Tool or method) |
| (Goal 2) | (Specific metric) | (Tool or method) |
## Scope
**In scope:**
- (Capability 1)
- (Capability 2)
**Out of scope:**
- (Excluded item -- and why)
## Open Questions
- (Unresolved decisions)
- (Things that need testing to determine)
## Handoff to Build
When ready to prototype this spec with an AI coding tool:
1. **Gather visual references** -- take screenshots of similar products or UI patterns you want to match
2. **Start with the plan, not the code** -- paste this spec (or a summary) into your AI tool and ask it to create a build plan before writing code
3. **Build in phases** -- break the scope into 3-4 phases; build and review one at a time
4. **Don't over-specify prompts yet** -- get the UX working first, then tune the AI behavior
See `/ai-prototype-guide` for the full prototyping workflow.
Step 7: Stress-test and finalize
Challenge the spec:
- What's the worst thing the AI could output? Is the guardrail sufficient?
- Is the cost projection realistic at 10x the expected volume?
- Can you actually measure the eval criteria with current tooling?
- Is the quality bar honest or aspirational?
- What would a skeptical engineer ask about the prompt architecture?
- What would a skeptical ML engineer ask about model selection?
Revise based on user responses.
Uncertainty Policy
| Topic | Tolerance | Action |
|---|---|---|
| User problem statement | Low | STOP and ask -- spec is useless without a real problem |
| Quality dimensions and thresholds | Low | STOP and ask -- vague quality bars lead to vague features |
| Model choice | Medium | Recommend + flag [RECOMMENDED] -- can be changed during dev |
| Cost estimates | Medium | Estimate + flag [ESTIMATED] -- refine with actual usage |
| Prompt architecture details | Medium | Propose approach + flag [PROPOSED] -- will iterate |
| Guardrail specifics | Medium | Propose + flag [PROPOSED] -- needs adversarial testing |
| Competitor details | High | Best guess from context |
Default: STOP and ask when a topic is not listed above.
Output location
Present the spec as formatted text in the conversation for the user to copy into their docs tool.
Example Output
Input
- Feature name and description: "Contract Redline Assistant" — an AI-powered feature inside Ironclad's contract management platform that automatically identifies risky clauses in uploaded contracts and suggests pre-approved fallback language from the company's playbook
- User problem: Legal teams at mid-market companies (50–500 employees) spend 3–6 hours per contract manually comparing vendor paper against their standard positions; quotes from users: "I redline the same indemnification clause 20 times a month" and "We miss things when we're rushed"
- Constraints: < 8s latency acceptable (lawyers aren't waiting in real-time), regulated data environment (SOC 2 Type II required, contracts contain PII and confidential terms), target cost ≤ $0.30 per contract review, expected volume 400 contracts/day across customer base
- Current workflow: Lawyers download PDFs, mark up in Word, email redlines back — no tooling, fully manual, average 22-day contract cycle
- Personas: In-house counsel (accuracy-critical, low tolerance for hallucinated legal language), legal ops managers (want audit trails and throughput metrics)
Output
AI Product Spec: Contract Redline Assistant
Overview
The Contract Redline Assistant analyzes uploaded vendor contracts against a company's pre-configured legal playbook, flags clauses that deviate from acceptable positions, and suggests pre-approved fallback language for each identified issue. AI is the right approach because the task requires semantic understanding of contractual intent — not just keyword matching — across unlimited clause variations and contract structures. The feature targets in-house legal teams at mid-market companies where manual redlining is the primary bottleneck in the contract cycle.
Problem Statement
In-house counsel at companies without large legal departments spend 3–6 hours per contract on redlining work that is 70–80% repetitive. Fatigue and time pressure cause clauses to be missed. A rules engine cannot handle this task because vendor paper uses unlimited structural and linguistic variation to express the same legal concepts — semantic understanding is required.
Evidence: User interviews, support tickets referencing "missed indemnification caps," average 22-day contract cycle in customer base.
Model Requirements
| Decision | Recommendation | Notes |
|---|---|---|
| Primary capabilities | Extraction + classification + text generation | Extract clauses → classify risk → generate redline suggestions |
| Quality bar | Must be correct | Legal output — hallucinated fallback language is a liability risk |
| Latency target | Interactive (< 8s) | Lawyers submit and wait; not real-time |
| Streaming needed? | Yes | Multi-clause contracts may take 5–8s; stream results clause-by-clause |
| Context window need | Large (< 128K) | Full contracts can run 15,000–40,000 tokens |
| Cost sensitivity | Medium | B2B SaaS; $0.30/contract ceiling is workable with mid-tier model |
| Data sensitivity | Regulated (SOC 2 Type II) | PII, confidential deal terms — no training on customer data; requires BAA or equivalent |
Model recommendation: GPT-4o or Claude 3.5 Sonnet via enterprise API with data processing agreement. Both support 128K context, structured output, and enterprise data commitments. [RECOMMENDED — verify during development]
For clause extraction on well-structured contracts, a smaller model (GPT-4o mini) can pre-segment the document before the full model processes flagged sections — reducing token costs by ~40%. [PROPOSED]
Prompt Architecture
-
System prompt approach: Establishes the model as a "contract review assistant operating under [Company]'s legal playbook." Constraints include: never fabricate fallback language not present in the playbook, output structured JSON per clause, flag confidence level per finding, and never provide legal advice beyond playbook positions. Core instructions capped at ~40 lines; playbook content loaded via RAG per contract type (NDA, MSA, SaaS agreement).
[PROPOSED] -
User input handling: Uploaded PDFs are parsed and segmented into clauses using a pre-processing step (rule-based section detection + small model classification). Only flagged or ambiguous clauses are passed to the primary model — not the full document text. This reduces per-request token consumption significantly.
-
Context management:
- Playbook positions retrieved via embedding similarity to each extracted clause (top 3 matches per clause)
- Contract metadata injected: contract type, counterparty tier, jurisdiction if detected
- No conversation history needed — stateless per review session
-
Output parsing: Structured JSON required for each clause finding:
{ clause_type, risk_level, extracted_text, playbook_position, suggested_redline, confidence, explanation }Output validated against schema before rendering. Missing fields trigger a re-call with stricter formatting instructions (max 1 retry).
[PROPOSED] -
Few-shot examples: 3–5 static examples per contract type (NDA, MSA) embedded in the domain context layer. Examples cover high-risk clause types: indemnification, limitation of liability, IP ownership, data processing.
-
Multi-step processing:
- Pass 1 (small model): Segment and classify contract sections
- Pass 2 (primary model): Analyze flagged clauses against playbook, generate redlines
- Pass 3 (validation layer): Schema check + confidence thresholding before rendering
-
Context compaction: Core system prompt (~40 lines) + dynamic playbook retrieval via RAG. Full playbook never stuffed into context.
[PROPOSED]
Quality & Eval Criteria
| Dimension | Definition | Threshold | How to measure |
|---|---|---|---|
| Clause detection recall | % of risky clauses correctly flagged (no misses) | > 92% on golden set | Human-reviewed contract set (100 contracts, 800+ clauses) |
| Fallback accuracy | Suggested language matches or is a valid variant of playbook position | > 97% on playbook-covered clauses | Automated diff against playbook + senior counsel review |
| False positive rate | % of acceptable clauses incorrectly flagged | < 15% | Human review of flagged output |
| Hallucination rate | Suggested language not grounded in playbook | 0% acceptable | Automated grounding check + human audit |
| Explanation clarity | Lawyer can understand why clause was flagged without additional context | > 85% rated clear | User feedback, in-app thumbs rating |
| Safety | No output that constitutes legal advice beyond playbook scope | 0% in adversarial set | Red team testing with edge-case contracts |
Golden dataset: Build from 100 historical contracts already reviewed by counsel, with human-labeled ground truth for flagged clauses and accepted redlines. [NEEDS INPUT — customer data sharing agreements required]
Guardrails & Fallbacks
- Low confidence output (< 0.70): Clause flagged as
"Needs Manual Review"with explanation surfaced to user; suggested redline shown with explicit warning label. Never silently omitted. - Clause type not in playbook: Model instructed to return
"No playbook position — escalate to counsel"rather than generate novel fallback language. This is a hard constraint enforced via output schema validation. - Model timeout / rate limit: Graceful degradation — partial results displayed for completed clauses; banner shown: "Review incomplete — [N] clauses could not be analyzed. Download for manual review." Full manual download always available.
- Hallucination prevention: Suggested redlines cross-referenced against playbook embeddings post-generation. Cosine similarity below 0.75 triggers a discard +
"Manual review required"flag rather than displaying the output.[PROPOSED — threshold to be calibrated] - Adversarial / junk input: Non-contract uploads (e.g., images, blank PDFs) caught at preprocessing; user shown inline error before model is called.
User Experience & Interaction Design
| Decision | Design choice | Rationale |
|---|---|---|
| Loading experience | Streaming clause-by-clause results with progress bar ("Analyzing clause 4 of 17…") | 5–8s wait is long; progressive display builds trust and lets lawyers start reading |
| Result presentation | Side-by-side panel: original clause left, flagged issues + suggested redline right; color-coded risk level (red/yellow/green) | Mirrors lawyers' existing Word redline mental model |
| Confidence display | Explicit confidence label per finding (" |