Skip to main content
Product Management/ai-product-spec

AI Product Spec

You need to spec an AI-powered feature covering model requirements, prompt architecture, quality bar, cost projections, and guardrails.

Use this when you're speccing a feature that uses LLMs, AI models, or generative capabilities. A traditional PRD misses the decisions that make or break AI features: which model, how to evaluate quality, what happens when the model fails, and how much it costs per interaction. This skill extends /prd-draft with the AI-specific sections teams forget until production.

Related skills: Extends /prd-draft for traditional PRD structure. Uses /ai-eval-design for deeper eval planning. See /ai-prototype-guide for building the prototype from this spec. See /multi-model-strategy for multi-model routing decisions.

Process

Step 1: Gather context

Ask the user:

  1. What AI-powered feature are you building? (Name and one-line description)
  2. What user problem does this solve? (Evidence: user quotes, support tickets, research)
  3. What does the AI do that couldn't be done without it? (The "why AI" test -- if a rules engine or lookup table works, you don't need AI)
  4. Who are the users? (Personas and their expectations for AI accuracy, speed, and tone)
  5. How does this work today? (Current workflow, tools in use, pain points with the status quo)
  6. Known constraints -- budget per interaction, latency requirements, data privacy restrictions, compliance needs, expected volume (daily/monthly interactions)
  7. Prior art -- how competitors or existing tools handle this (include screenshots if available)

If the user provides a brief, extract what you can and ask follow-ups for gaps. Flag anything unclear with [NEEDS INPUT].

Step 2: Define model requirements

Work through these decisions with the user:

DecisionOptionsNotes
Primary capabilitiesText generation, classification, extraction, summarization, code generation, multimodal, embedding (often multiple)What the model actually needs to do -- most features combine capabilities
Quality barMust be correct (medical, legal), should be helpful (productivity), can be approximate (creative)Determines eval rigor
Latency targetReal-time (< 2s), interactive (< 10s), batch (minutes OK)Affects model choice and architecture
Streaming needed?Yes (progressive results), no (complete response)Long-running tasks need streaming for UX; affects API choice
Context window needSmall (< 4K tokens), medium (< 32K), large (< 128K), very large (> 128K)Affects model and architecture
Cost sensitivityHigh (consumer, high volume), medium (B2B, moderate volume), low (enterprise, low volume)Affects model tier and caching strategy
Data sensitivityPublic data only, private but not regulated, regulated (HIPAA, SOC2, GDPR)Affects deployment and vendor choice

After completing the table, recommend specific model families based on the answers. Use the knowledge references to map capability + quality bar + cost sensitivity to model tiers. Flag the recommendation as [RECOMMENDED -- verify during development].

If the feature involves voice input or output, also work through:

DecisionOptionsNotes
Voice directionInput only (transcription), output only (TTS), bidirectional (conversation)Determines API choice: Whisper/TTS for async, Realtime API for conversation
Latency targetReal-time conversation (<500ms), interactive (<2s), batch (minutes OK)Real-time voice has tighter budgets than text -- >500ms feels laggy
Voice identityDefault synthetic, custom/cloned, brand voiceCustom voices require golden reference samples (30s-5min) and consent
Language coverageEnglish only, top 5 languages, 20+ languagesAccuracy drops for low-resource languages; test each target language
Accent handlingStandard accents only, broad dialect coverageTest with representative accent samples -- this is an equity issue
Audio environmentClean/quiet, noisy/mobile, call centerBackground noise tolerance affects model and preprocessing choice

Voice eval criteria (add to Step 4 quality dimensions):

DimensionDefinitionThresholdHow to measure
Word error rate (WER)Transcription accuracy(e.g., < 8% on test set)Automated comparison to human transcripts
Naturalness (MOS)How human does the voice sound(e.g., > 4.0 on 1-5 scale)Human listener panel rating
Response latencyTime from end of user speech to start of AI speech(e.g., < 500ms p95)Instrumented measurement
Interruption handlingCan user interrupt mid-response gracefully(e.g., successful interrupt in < 200ms)Scenario testing

Voice guardrails (add to Step 4 guardrails):

  • What happens when transcription confidence is low? (Ask to repeat vs. best-guess)
  • What happens when background noise makes speech unintelligible?
  • How does the system handle overlapping speech?
  • What disclosure is required? (Many jurisdictions require "you are speaking with an AI")

Voice cost projections (add to Step 5 cost table):

  • Per-minute audio processing cost (transcription + synthesis): typically $0.02-0.07/min
  • Telephony costs if phone-based (SIP/PSTN interconnect): $0.01-0.02/min
  • Total AI voice cost: typically $0.04-0.14/min (compare to human agent at $0.50-1.50/min)

Step 3: Design the prompt architecture

Outline the prompting approach -- not the exact prompts, but the strategy:

  1. System prompt approach -- what persona, constraints, and output format does the system prompt establish?
  2. User input handling -- how is user input preprocessed, validated, or augmented before reaching the model?
  3. Context management -- what context is injected? (RAG, conversation history, user profile, external data)
  4. Output parsing -- structured output (JSON, specific format) or free-form? How is output validated?
  5. Few-shot examples -- are examples needed? How many? Static or dynamic?
  6. Multi-step processing -- does the task need chain-of-thought reasoning, multiple passes, or staged processing? (e.g., classify first, then generate; or identify issues, then prioritize, then suggest fixes)
  7. Context compaction -- keep system instructions under 50 lines to avoid context bloat. Long system prompts degrade output quality as the model struggles to weight all instructions equally. If the feature needs more than 50 lines of instructions, split into: core constraints (always loaded), domain context (loaded per request type), and reference material (retrieved via RAG, not stuffed into the prompt). Don't dump five files into the system prompt and expect good results.

Step 3b: Design the interaction layer

The prompt architecture defines what the AI does technically. The interaction layer defines what the user experiences. Both need to be designed.

Work through these decisions:

DecisionOptionsNotes
Loading experienceStreaming text, progress indicator, skeleton screen, "thinking..." messageWhat does the user see while the AI works? Streaming builds trust through visibility.
Result presentationInline text, structured card, expandable sections, side panel, modalHow do results appear? Match the complexity of the output to the display format.
Confidence displayHidden, subtle indicators, explicit confidence labels, source citationsDoes the user need to know how sure the AI is? Clinical and financial domains usually need transparency.
Regeneration/retryThumbs up/down, "try again" button, "refine" with instructions, no retryCan the user ask for a different answer? How?
Correction mechanismEdit output directly, provide feedback, report errors, noneCan the user fix what the AI got wrong?
Human fallbackEscalate button, automatic routing, "talk to a person" link, noneWhen the AI can't help, what's the path to a human?

For deeper interaction design -- persona, conversation flow, trust patterns -- see /ai-persona-design, /ai-conversation-design, and /ai-trust-pattern.

Step 4: Define eval criteria and guardrails

Outline what "good" and "bad" look like:

Quality dimensions:

DimensionDefinitionThresholdHow to measure
(Accuracy)(Does it get the facts right?)(e.g., > 95% on golden dataset)(Automated check, human review)
(Relevance)(Does it answer what was asked?)(e.g., > 90% rated relevant)(LLM-as-judge, user feedback)
(Helpfulness)(Is the output actionable and worth the user's attention?)(e.g., > 80% acted on by users)(User action tracking, feedback)
(Tone)(Does it match brand voice?)(e.g., passes tone rubric)(Rubric scoring)
(Safety)(Does it avoid harmful output?)(e.g., 0% harmful in adversarial set)(Red team testing)

Setting thresholds for v1: Start with a golden dataset of 50-100 representative examples. Set initial thresholds based on human baseline performance (how accurate is a human doing this task?). It's better to set an honest threshold and hit it than an aspirational one you can't measure.

Guardrails and fallback behavior:

  • What happens when the model returns low-confidence output?
  • What happens when the model is unavailable (timeout, rate limit, outage)?
  • What content should be blocked or filtered?
  • What does the fallback UX look like?

For deeper eval planning, use /ai-eval-design to build golden datasets and eval pipelines.

Step 5: Estimate costs

ComponentEstimateBasis
Avg. input tokens per request(estimate)(sample prompts)
Avg. output tokens per request(estimate)(sample outputs)
Context retrieval cost(estimate)(embedding/search calls per request)
Model pricing($ per 1K tokens)(current pricing)
Cost per interaction(calculated)input + output + retrieval cost
Projected daily volume(estimate)(user base x usage frequency)
Monthly cost projection(calculated)cost per interaction x volume
Cost per user per month(calculated)monthly cost / active users
Cost ceiling(budget limit)(what's acceptable)

Worked example: A typical interaction with a 500-word user input, 2K-token system prompt, and 800-token response using a mid-tier model at $0.015/1K tokens costs roughly $0.05 per interaction. At 100 interactions/day, that's ~$150/month in model costs alone.

Include notes on cost optimization: caching, shorter prompts, smaller models for simple tasks, batching.

Step 6: Draft the AI product spec

Compile into the spec document:

# AI Product Spec: (Feature name)

## Overview
(What we're building, why AI, and the core user value -- 3-4 sentences.)

## Problem Statement
(The user problem, with evidence. Why AI is the right solution approach.)

## Model Requirements
(Table from Step 2 -- capability, quality bar, latency, context, cost, data sensitivity)

## Prompt Architecture
(Strategy from Step 3 -- system prompt approach, context management, output parsing)

## Quality & Eval Criteria
(Table from Step 4 -- dimensions, thresholds, measurement methods)

## Guardrails & Fallbacks
(What can go wrong and what happens when it does)

## User Experience & Interaction Design
(How users interact with the AI feature -- loading experience, result presentation, confidence display, regeneration/correction mechanisms, human fallback paths. From Step 3b.)

## Integration Architecture
(How this connects to existing systems -- APIs, webhooks, data flow, authentication)

## Cost Projections
(Table from Step 5 -- per-interaction cost, volume, monthly projection, ceiling)

## Goals & Success Metrics
| Goal | Metric | How we measure it |
|------|--------|-------------------|
| (Goal 1) | (Specific metric) | (Tool or method) |
| (Goal 2) | (Specific metric) | (Tool or method) |

## Scope
**In scope:**
- (Capability 1)
- (Capability 2)

**Out of scope:**
- (Excluded item -- and why)

## Open Questions
- (Unresolved decisions)
- (Things that need testing to determine)

## Handoff to Build

When ready to prototype this spec with an AI coding tool:
1. **Gather visual references** -- take screenshots of similar products or UI patterns you want to match
2. **Start with the plan, not the code** -- paste this spec (or a summary) into your AI tool and ask it to create a build plan before writing code
3. **Build in phases** -- break the scope into 3-4 phases; build and review one at a time
4. **Don't over-specify prompts yet** -- get the UX working first, then tune the AI behavior

See `/ai-prototype-guide` for the full prototyping workflow.

Step 7: Stress-test and finalize

Challenge the spec:

  1. What's the worst thing the AI could output? Is the guardrail sufficient?
  2. Is the cost projection realistic at 10x the expected volume?
  3. Can you actually measure the eval criteria with current tooling?
  4. Is the quality bar honest or aspirational?
  5. What would a skeptical engineer ask about the prompt architecture?
  6. What would a skeptical ML engineer ask about model selection?

Revise based on user responses.

Uncertainty Policy

TopicToleranceAction
User problem statementLowSTOP and ask -- spec is useless without a real problem
Quality dimensions and thresholdsLowSTOP and ask -- vague quality bars lead to vague features
Model choiceMediumRecommend + flag [RECOMMENDED] -- can be changed during dev
Cost estimatesMediumEstimate + flag [ESTIMATED] -- refine with actual usage
Prompt architecture detailsMediumPropose approach + flag [PROPOSED] -- will iterate
Guardrail specificsMediumPropose + flag [PROPOSED] -- needs adversarial testing
Competitor detailsHighBest guess from context

Default: STOP and ask when a topic is not listed above.

Output location

Present the spec as formatted text in the conversation for the user to copy into their docs tool.

Example Output

Input

  • Feature name and description: "Contract Redline Assistant" — an AI-powered feature inside Ironclad's contract management platform that automatically identifies risky clauses in uploaded contracts and suggests pre-approved fallback language from the company's playbook
  • User problem: Legal teams at mid-market companies (50–500 employees) spend 3–6 hours per contract manually comparing vendor paper against their standard positions; quotes from users: "I redline the same indemnification clause 20 times a month" and "We miss things when we're rushed"
  • Constraints: < 8s latency acceptable (lawyers aren't waiting in real-time), regulated data environment (SOC 2 Type II required, contracts contain PII and confidential terms), target cost ≤ $0.30 per contract review, expected volume 400 contracts/day across customer base
  • Current workflow: Lawyers download PDFs, mark up in Word, email redlines back — no tooling, fully manual, average 22-day contract cycle
  • Personas: In-house counsel (accuracy-critical, low tolerance for hallucinated legal language), legal ops managers (want audit trails and throughput metrics)

Output

AI Product Spec: Contract Redline Assistant

Overview

The Contract Redline Assistant analyzes uploaded vendor contracts against a company's pre-configured legal playbook, flags clauses that deviate from acceptable positions, and suggests pre-approved fallback language for each identified issue. AI is the right approach because the task requires semantic understanding of contractual intent — not just keyword matching — across unlimited clause variations and contract structures. The feature targets in-house legal teams at mid-market companies where manual redlining is the primary bottleneck in the contract cycle.

Problem Statement

In-house counsel at companies without large legal departments spend 3–6 hours per contract on redlining work that is 70–80% repetitive. Fatigue and time pressure cause clauses to be missed. A rules engine cannot handle this task because vendor paper uses unlimited structural and linguistic variation to express the same legal concepts — semantic understanding is required.

Evidence: User interviews, support tickets referencing "missed indemnification caps," average 22-day contract cycle in customer base.

Model Requirements

DecisionRecommendationNotes
Primary capabilitiesExtraction + classification + text generationExtract clauses → classify risk → generate redline suggestions
Quality barMust be correctLegal output — hallucinated fallback language is a liability risk
Latency targetInteractive (< 8s)Lawyers submit and wait; not real-time
Streaming needed?YesMulti-clause contracts may take 5–8s; stream results clause-by-clause
Context window needLarge (< 128K)Full contracts can run 15,000–40,000 tokens
Cost sensitivityMediumB2B SaaS; $0.30/contract ceiling is workable with mid-tier model
Data sensitivityRegulated (SOC 2 Type II)PII, confidential deal terms — no training on customer data; requires BAA or equivalent

Model recommendation: GPT-4o or Claude 3.5 Sonnet via enterprise API with data processing agreement. Both support 128K context, structured output, and enterprise data commitments. [RECOMMENDED — verify during development]

For clause extraction on well-structured contracts, a smaller model (GPT-4o mini) can pre-segment the document before the full model processes flagged sections — reducing token costs by ~40%. [PROPOSED]

Prompt Architecture

  1. System prompt approach: Establishes the model as a "contract review assistant operating under [Company]'s legal playbook." Constraints include: never fabricate fallback language not present in the playbook, output structured JSON per clause, flag confidence level per finding, and never provide legal advice beyond playbook positions. Core instructions capped at ~40 lines; playbook content loaded via RAG per contract type (NDA, MSA, SaaS agreement). [PROPOSED]

  2. User input handling: Uploaded PDFs are parsed and segmented into clauses using a pre-processing step (rule-based section detection + small model classification). Only flagged or ambiguous clauses are passed to the primary model — not the full document text. This reduces per-request token consumption significantly.

  3. Context management:

    • Playbook positions retrieved via embedding similarity to each extracted clause (top 3 matches per clause)
    • Contract metadata injected: contract type, counterparty tier, jurisdiction if detected
    • No conversation history needed — stateless per review session
  4. Output parsing: Structured JSON required for each clause finding:

    { clause_type, risk_level, extracted_text, playbook_position, suggested_redline, confidence, explanation }
    

    Output validated against schema before rendering. Missing fields trigger a re-call with stricter formatting instructions (max 1 retry). [PROPOSED]

  5. Few-shot examples: 3–5 static examples per contract type (NDA, MSA) embedded in the domain context layer. Examples cover high-risk clause types: indemnification, limitation of liability, IP ownership, data processing.

  6. Multi-step processing:

    • Pass 1 (small model): Segment and classify contract sections
    • Pass 2 (primary model): Analyze flagged clauses against playbook, generate redlines
    • Pass 3 (validation layer): Schema check + confidence thresholding before rendering
  7. Context compaction: Core system prompt (~40 lines) + dynamic playbook retrieval via RAG. Full playbook never stuffed into context. [PROPOSED]

Quality & Eval Criteria

DimensionDefinitionThresholdHow to measure
Clause detection recall% of risky clauses correctly flagged (no misses)> 92% on golden setHuman-reviewed contract set (100 contracts, 800+ clauses)
Fallback accuracySuggested language matches or is a valid variant of playbook position> 97% on playbook-covered clausesAutomated diff against playbook + senior counsel review
False positive rate% of acceptable clauses incorrectly flagged< 15%Human review of flagged output
Hallucination rateSuggested language not grounded in playbook0% acceptableAutomated grounding check + human audit
Explanation clarityLawyer can understand why clause was flagged without additional context> 85% rated clearUser feedback, in-app thumbs rating
SafetyNo output that constitutes legal advice beyond playbook scope0% in adversarial setRed team testing with edge-case contracts

Golden dataset: Build from 100 historical contracts already reviewed by counsel, with human-labeled ground truth for flagged clauses and accepted redlines. [NEEDS INPUT — customer data sharing agreements required]

Guardrails & Fallbacks

  • Low confidence output (< 0.70): Clause flagged as "Needs Manual Review" with explanation surfaced to user; suggested redline shown with explicit warning label. Never silently omitted.
  • Clause type not in playbook: Model instructed to return "No playbook position — escalate to counsel" rather than generate novel fallback language. This is a hard constraint enforced via output schema validation.
  • Model timeout / rate limit: Graceful degradation — partial results displayed for completed clauses; banner shown: "Review incomplete — [N] clauses could not be analyzed. Download for manual review." Full manual download always available.
  • Hallucination prevention: Suggested redlines cross-referenced against playbook embeddings post-generation. Cosine similarity below 0.75 triggers a discard + "Manual review required" flag rather than displaying the output. [PROPOSED — threshold to be calibrated]
  • Adversarial / junk input: Non-contract uploads (e.g., images, blank PDFs) caught at preprocessing; user shown inline error before model is called.

User Experience & Interaction Design

DecisionDesign choiceRationale
Loading experienceStreaming clause-by-clause results with progress bar ("Analyzing clause 4 of 17…")5–8s wait is long; progressive display builds trust and lets lawyers start reading
Result presentationSide-by-side panel: original clause left, flagged issues + suggested redline right; color-coded risk level (red/yellow/green)Mirrors lawyers' existing Word redline mental model
Confidence displayExplicit confidence label per finding ("