Use this when you've built (or are about to ship) an AI-powered feature and need to define what "good" looks like before it reaches users. Covers quality rubrics, golden datasets, eval pipeline design, and pass/fail thresholds. If you're assessing an entire engagement's AI health, use /ai-health-check instead -- this skill is for evaluating a specific feature's output quality.
Related skills: Eval criteria originate in
/ai-product-spec. For engagement-level AI assessment, use/ai-health-check. For ongoing production monitoring, see/llm-observability-plan-- eval thresholds become monitoring thresholds. For clinical/regulated AI validation (more rigorous than standard eval), use/clinical-validation-protocol.
Process
Step 1: Define what "good" means
Ask the user:
- What does this AI feature do? (One-line description of the capability)
- What does a great output look like? (Ask for 2-3 real examples of ideal outputs)
- What does a bad output look like? (Ask for 2-3 examples of outputs that would be unacceptable)
- What are the failure modes you worry about? (Hallucination, wrong tone, harmful content, irrelevance, too slow, too expensive)
- Who judges quality today? (Users via feedback, internal reviewers, automated checks, nobody yet)
- What's the consequence of a bad output? (User annoyance, lost revenue, legal risk, safety risk)
The consequence of failure determines eval rigor. A chatbot that sometimes gives a mediocre answer needs lighter evals than an AI that generates medical summaries.
Step 2: Design the quality rubric
Create a scoring framework with explicit dimensions:
| Dimension | Definition | Score 1 (Fail) | Score 3 (Acceptable) | Score 5 (Excellent) | Weight |
|---|---|---|---|---|---|
| Accuracy | Factual correctness of the output | Contains factual errors or hallucinations | Mostly correct, minor inaccuracies | Fully accurate, verifiable | (weight) |
| Relevance | Output addresses what was asked | Off-topic or misunderstands the request | Addresses the question with some drift | Directly and completely answers the request | (weight) |
| Completeness | Covers all required elements | Missing critical information | Covers main points, misses some details | Comprehensive coverage | (weight) |
| Tone / Style | Matches expected voice and format | Wrong tone, inappropriate style | Acceptable tone, minor style issues | Perfect match to brand/context | (weight) |
| Safety | Avoids harmful or inappropriate content | Contains harmful, biased, or inappropriate content | No harmful content | No harmful content + proactively helpful | (weight) |
Customize dimensions to the feature. A code generation tool needs "correctness" and "security." A writing assistant needs "tone" and "originality." Not every dimension applies to every feature.
CARATS quality dimensions as starting columns: Use knowledge/ai-safety-evals-reference.md § behavior-spec-canvas for the full reference. The CARATS dimensions (Consistency, Accuracy, Reliability, Alignment, Tone, Security) provide a starting rubric -- choose the dimensions that match the feature's risk profile. Not every feature needs all six.
Threshold formula: There is no universal threshold. Each scenario's pass threshold is determined by four factors: stakes (cost of being wrong) x reversibility (can the user recover?) x baseline (what are you comparing against?) x human parity (how well would a person do this?). See knowledge/ai-safety-evals-reference.md § threshold-formula for the full framework. Show the math visibly -- don't hide behind "we set the threshold at 85%."
Topic tree for eval coverage: To ensure your rubric covers the full surface area, build a topic tree: start with the feature's top-level purpose, then branch into sub-capabilities, then branch into dimensions per sub-capability. Each leaf is an eval scenario. This prevents the common failure of testing only the happy path.
Binary pass/fail vs Likert scales: Prefer binary pass/fail for individual eval scenarios. Binary forces clearer criteria and more consistent labeling. Reserve Likert scales (1-5) for aggregate rubric dimensions where you need to track improvement over time. See knowledge/ai-safety-evals-reference.md § testing-vs-evals-bridge for the rationale.
Confusion matrix for classification features: When the AI component is a classifier (router, categorizer, tagger), add a confusion matrix to the rubric:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (correct) | False Negative (missed) |
| Actually Negative | False Positive (false alarm) | True Negative (correct) |
For safety-critical features, weight false negatives much higher than false positives. A missed safety signal is harm; a false alarm is annoyance.
Stakeholder-specific dimensions: Beyond generic quality dimensions, ask which stakeholders have skin in the game and add a dimension for each:
| Stakeholder | Eval question | Example auto-fail |
|---|---|---|
| Legal | "Would the head of legal approve this response knowing it can be cited in a tribunal?" | Making unauthorized commitments or promises |
| Policy/Compliance | "Does this stay within the policy that has been authorized?" | Contradicting published company policy |
| Customer Experience | "Does this meet the user's expectation for this interaction?" | Dismissive tone on sensitive topics |
| Finance/Operations | "Does this protect our operational KPIs?" (e.g., contact deflection, resolution rate) | Routing users to expensive support channels unnecessarily |
Not every feature needs all four. Ask: "Who gets paged if this goes wrong?" -- that person's concern is an eval dimension.
Scores 2 and 4 fall between adjacent anchors -- use them when output is better than "fail" but not quite "acceptable," or better than "acceptable" but not "excellent."
Scoring rules:
- Weight dimensions by importance (must sum to 100%). Start by ranking dimensions by consequence of failure -- the dimension where a bad score hurts most gets the highest weight.
- Set a minimum passing score (e.g., weighted average >= 3.5)
- Set hard-fail dimensions (e.g., Safety score of 1 = automatic fail regardless of other scores)
Step 3: Build the golden dataset
A golden dataset is a curated set of inputs with known-good expected outputs. This is the foundation of repeatable evaluation.
Dataset design:
| Category | Count | Purpose | Example |
|---|---|---|---|
| Happy path | 15-25 | Typical, well-formed inputs | Standard user queries that should work well |
| Edge cases | 10-15 | Unusual but valid inputs | Very long inputs, ambiguous queries, multilingual |
| Adversarial | 5-10 | Inputs designed to break things | Prompt injection attempts, off-topic requests, harmful content requests |
| Regression | 5-10 | Previously failed inputs (add over time) | Bugs found in production that should not recur |
For each entry:
- Input: The exact prompt/query/context sent to the model
- Expected output: What a good response looks like (or key elements it must contain)
- Evaluation criteria: Which rubric dimensions matter most for this input
- Source: Where this test case came from (user research, bug report, adversarial design)
Sourcing test cases: Ask the user what data they can draw from -- production logs, user research sessions, support tickets, competitor examples, or synthetic inputs. Real user data makes the strongest golden datasets.
Start with 30-50 entries. Grow the dataset as you find new failure modes in production. Version-control the dataset alongside your prompts -- when the prompt changes, you need to know which dataset version established the baseline.
Step 4: Choose the eval method
| Method | Best for | Cost | Speed | Accuracy |
|---|---|---|---|---|
| Exact match | Structured outputs (JSON, classification labels) | Free | Instant | High for structured tasks |
| Keyword / regex | Checking for required content or forbidden content | Free | Instant | Medium -- brittle |
| LLM-as-judge | Open-ended text quality, nuanced evaluation | $0.01-0.10 per eval | Seconds | Good -- but calibrate against human judgment |
| Human review | Subjective quality, safety-critical outputs | $1-10 per eval | Minutes-hours | Highest -- but expensive and slow |
| Hybrid | Production systems at scale | Varies | Varies | Best balance |
Where to screen -- input vs output:
Evals don't just run on model outputs. Decide where in the pipeline to evaluate:
| Screening point | What it catches | Example |
|---|---|---|
| Input screening | Dangerous or ambiguous queries before the model processes them | Flag refund/cancellation/policy queries for stricter guardrails; detect prompt injection attempts |
| Output screening | Bad responses before they reach the user | Block unauthorized commitments, PII leakage, policy contradictions |
| Offline evals | Quality trends across batches, pre-deploy regression testing | Run golden dataset before every prompt or model change |
| Online evals | Production drift and real-world failure patterns | Sample live traffic for automated scoring, alert on threshold breaches |
Most teams start with output screening (catch bad responses) and offline evals (regression testing). Add input screening when specific query patterns are known to cause failures.
Recommended approach for most teams:
- Automated first pass: Exact match or keyword checks for structural requirements (output format, required fields, length)
- LLM-as-judge second pass: For quality dimensions that need judgment (relevance, tone, completeness)
- Human review sample: Spot-check 10-20% of outputs weekly to calibrate the automated eval
LLM-as-judge setup:
- Write a clear judging prompt with your rubric embedded
- Include 3-5 calibration examples with scores and reasoning
- Test the judge against human-scored samples -- it should agree 80%+ of the time
- Use a different (and ideally more capable) model for judging than the model being evaluated -- a weaker judge can't reliably score a stronger model's output
Step 5: Define tradeoff boundaries
AI features live on three axes. You can't maximize all three:
| Axis | Metric | Current | Target | Hard limit |
|---|---|---|---|---|
| Quality | Weighted rubric score | (baseline) | (target) | (minimum acceptable) |
| Latency | p50 / p95 response time | (baseline) | (target) | (maximum acceptable) |
| Cost | Cost per evaluation / per interaction | (baseline) | (target) | (budget ceiling) |
Document explicit tradeoff decisions:
- "We'll accept slightly lower quality (3.5 vs 4.0 rubric score) to stay under $0.02 per interaction"
- "We'll use a more expensive model for safety-critical outputs and a cheaper model for low-stakes ones"
- "Latency above 5 seconds is unacceptable even if quality improves"
Step 6: Generate the eval plan
Compile into a structured document:
# Eval Plan: (Feature name)
**Generated:** (date)
**Feature:** (brief description)
**Quality bar:** (minimum passing rubric score)
**Hard-fail criteria:** (dimensions where score of 1 = automatic fail)
## Quality Rubric
(Table from Step 2 -- dimensions, score definitions, weights)
## Golden Dataset
(Summary from Step 3 -- categories, counts, sources)
(Link to or embed the actual dataset)
## Eval Pipeline
(Method from Step 4 -- automated checks, LLM-as-judge config, human review cadence)
### Automated checks
- (Check 1: e.g., output must be valid JSON)
- (Check 2: e.g., response must be under 500 tokens)
- (Check 3: e.g., must not contain PII patterns)
### LLM-as-judge configuration
- Judge model: (model name)
- Judging prompt: (summary or link)
- Calibration accuracy: (% agreement with human scores)
### Human review
- Frequency: (e.g., weekly sample of 20 outputs)
- Reviewer: (who)
- Process: (how disagreements are resolved)
## Tradeoff Boundaries
(Table from Step 5 -- quality, latency, cost targets and limits)
## When to Run Evals
- **Pre-deploy:** Run full golden dataset before every prompt or model change
- **Nightly/weekly:** Run on a sample of recent production outputs to detect drift
- **On regression reports:** When users report quality issues, add the failing input to the golden dataset and re-run
## Alerting
- **Threshold breach:** When eval scores drop below minimum passing score, notify (team/channel)
- **Hard-fail trigger:** When any hard-fail dimension scores 1, notify immediately
- **Drift detection:** When scores trend downward over 3+ eval runs, flag for investigation
## Implementation Checklist
- [ ] **(P0)** Build golden dataset with initial 30-50 entries
- [ ] **(P0)** Implement automated structural checks
- [ ] **(P0)** Set up LLM-as-judge with calibrated prompt
- [ ] **(P1)** Run baseline eval against current model output
- [ ] **(P1)** Establish human review cadence
- [ ] **(P2)** Set up regression testing in CI/CD
- [ ] **(P2)** Build dashboard for eval scores over time
## Open Questions
- (Unresolved eval decisions)
- (Things that need baseline data to determine)
Step 7: Generate demo-ready eval narrative (optional)
When the user needs to present eval progress to stakeholders, generate a plain-language narrative alongside the eval plan. This bridges the gap between having eval data and being able to explain it to people who expect feature demos.
Narrative template:
## Eval Progress Summary -- {{feature name}}, {{date}}
### Where we started
{{1-2 sentences describing baseline quality and what "bad" looked like in week 1}}
### Where we are now
{{1-2 sentences describing current quality with specific score improvements}}
### What moved
| Dimension | Before | Now | What changed |
|-----------|--------|-----|--------------|
| {{dimension}} | {{score}} | {{score}} | {{plain-language explanation of what the team did}} |
### What didn't move (and why)
{{Honest acknowledgment of dimensions that stayed flat, with explanation of prioritization}}
### What this means for the product
{{Business impact translation: fewer errors, less manual review, faster throughput, etc.}}
### What we're targeting next
{{1-2 dimensions, target scores, and what the team plans to do}}
Guidelines for the narrative:
- Lead with examples, not numbers. Show a before/after output side by side, then give the score.
- Translate scores to business impact: "Score went from 2.8 to 3.6" becomes "30% fewer outputs need manual correction."
- Don't hide regressions. Name them, explain the tradeoff, show the plan.
Related skills: For a full demo script built around eval progress, use
/ai-progress-demo.
Step 8: Review and finalize
Ask the user:
- Does the rubric capture what matters most about output quality?
- Is the golden dataset representative of real user inputs?
- Is the eval method practical given team size and budget?
- Are the tradeoff boundaries honest or aspirational?
- How will eval results feed back into model/prompt improvements?
- Who owns maintaining the golden dataset over time? (This is the #1 reason eval systems decay -- nobody adds new test cases.)
Adjust based on feedback.
Clinical validation variant (for safety-critical or regulated AI)
When the AI feature affects patient safety, clinical decisions, or regulated data, apply clinical laboratory QC thinking on top of the standard eval:
Daily controls (continuous QC):
- Run a set of known-good inputs through the model on a defined schedule (daily or per-deployment)
- Compare outputs against established acceptable ranges (not just "pass/fail" -- track the actual scores)
- Apply Westgard-style rules: a single bad result is a warning; trending results or consecutive failures mean stop and investigate
- Document QC results even when they pass -- the trend matters more than any single point
Proficiency testing (external validation):
- Periodically evaluate the model using inputs from an external source (not the team that built it)
- Compare results against peer performance or expert panel consensus
- Proficiency testing reveals blind spots that internal QC cannot -- your golden dataset may have systematic gaps
- Failed proficiency testing requires root cause analysis and CAPA (see
/capa-design)
False-negative cost weighting:
- For safety-critical AI, weight false negatives (missed adverse events, missed contraindications) much higher than false positives
- A false positive is an annoyance; a false negative is a harm. Adjust rubric weights accordingly
- Define hard-fail criteria: "If the model misses a safety signal, the eval fails regardless of all other scores"
When to use clinical validation instead of standard eval:
- The model's output directly informs treatment decisions
- The output is included in regulatory submissions
- Patient safety could be affected by a wrong output
- The system processes data subject to 21 CFR Part 11 or ICH-GCP
- Use
/clinical-validation-protocolfor the full validation plan
Output location
Present the eval plan as formatted text in the conversation for the user to copy into their docs tool.
Example Output
Input
- Feature: AI-generated shift handoff summaries for ICU nurses -- the model reads the past 12 hours of EHR data (vitals, labs, orders, nursing notes) and produces a structured 200-300 word summary for the incoming nurse
- Company: Meridian Health System, rolling out to 3 ICU units at Meridian Regional Medical Center
- Failure modes identified: Hallucinated lab values not in the source record, missed critical flags (e.g., deteriorating trend in MAP), wrong patient context blended from adjacent records
- Current quality judge: Charge nurses manually review all summaries before handoff; target is to reduce review burden by 60% within 90 days
- Consequence of bad output: Clinical decision based on incorrect information; patient safety risk; potential Joint Commission and liability exposure
Output (abbreviated)
Eval Plan: ICU Shift Handoff Summary Generator
Generated: 2025-07-14 Feature: LLM-generated 12-hour ICU shift handoff summaries from EHR source data Quality bar: Weighted rubric score ≥ 4.2 (5-point scale); charge nurse sign-off rate ≥ 95% Hard-fail criteria: Any Accuracy score of 1 (hallucinated clinical value) = automatic fail regardless of other scores. Any missed critical flag (deteriorating MAP, new sepsis criteria, code-status change) = automatic fail.
Quality Rubric
| Dimension | Definition | Score 1 (Fail) | Score 3 (Acceptable) | Score 5 (Excellent) | Weight |
|---|---|---|---|---|---|
| Accuracy | All values traceable to source EHR record | Contains any fabricated lab value, vital, or order | All values present in source; 1-2 minor transcription errors | Every value exactly matches source; no hallucinations | 40% |
| Critical Flag Coverage | Deteriorating trends, new diagnoses, code-status, allergy alerts included | Misses one or more critical flags | Includes all critical flags, inconsistent framing | All flags present, clearly prioritized, actionable | 25% |
| Completeness | Covers all required handoff sections (neuro, respiratory, hemodynamics, lines/drains, pending items) | ≥2 required sections absent or empty | All sections present; 1-2 minor gaps | Comprehensive, no gaps, pending items clearly flagged | 15% |
| Concision / Format | 200-300 words, structured per Meridian ICU template | >400 words or <150 words; wrong structure | Within range, mostly follows template | Exactly within range, template-perfect, scannable | 10% |
| Safety / Harm Avoidance | No contradictory medication instructions, no unsupported clinical recommendations | Contains a contraindicated recommendation or conflicting med instruction | No harmful content | No harmful content; flags uncertainty explicitly rather than guessing | 10% |
Scoring rule: Weighted average ≥ 4.2 to pass. Any hard-fail trigger = fail regardless of weighted score. Accuracy is a hard-fail dimension.
Threshold derivation (shown explicitly):
| Factor | Assessment |
|---|---|
| Stakes | High -- wrong value reaches a bedside nurse making real-time decisions |
| Reversibility | Low -- charge nurse review is the last checkpoint before handoff |
| Baseline | Human-written summaries score ~4.0 on this rubric (measured on 40 historical samples) |
| Human parity target | Must match or exceed human baseline on Accuracy; can slightly lag on Concision |
→ Resulting threshold: 4.2 (above human baseline on the dimensions that matter most; lower bar on format where speed matters more than perfection)
Stakeholder-Specific Auto-Fail Criteria
| Stakeholder | Eval question | Example auto-fail |
|---|---|---|
| Clinical / Patient Safety | Would the charge nurse act on this without correction? | Hallucinated potassium value; missed downtrending MAP |
| Legal / Risk Management | Could this summary be cited in a malpractice proceeding without modification? | Summary attributes an order to wrong physician |
| Compliance / Joint Commission | Does the summary meet TJC handoff communication standards (SBAR)? | Missing Situation or Background section entirely |
| Nursing Operations | Does this reduce, not add to, charge nurse cognitive load? | Summary longer than 400 words or requires >2 corrections before sign-off |
Golden Dataset
Target size: 60 entries at launch; grow to 120 within 90 days
| Category | Count | Sources | Example |
|---|---|---|---|
| Happy path | 20 | De-identified EHR exports from past 6 months of ICU admissions | Stable post-op cardiac patient, routine 12-hour window |
| Critical flag present | 15 | Charge nurse–flagged cases from handoff log | Patient with MAP trending from 72 → 58 over 4 hours; new sepsis criteria met at hour 10 |
| Edge cases | 10 | Nursing informatics team synthetic construction | Patient with 18 active medications; patient transferred mid-shift from a different unit with partial records |
| Adversarial / stress | 8 | Red-team by clinical informatics + charge nurse panel | Incomplete nursing notes, contradictory vitals in two documentation systems, code-status changed twice in shift |
| Regression | 7 (growing) | Bugs found in pilot week | Initial model blended context from room-adjacent patient record; now a fixed test case |
Each entry includes: Source EHR snapshot (de-identified), expected summary key elements, which rubric dimensions are load-bearing for that case, and the clinical reviewer who validated the expected output.
Eval Pipeline
Screening points
| Point | What it catches | Implementation |
|---|---|---|
| Input screening | Missing required EHR fields before model call; patient ID mismatches | Pre-call validation: assert all 5 required data sections present; cross-check patient MRN across all input segments |
| Output screening | Fabricated values, format violations, forbidden language before nurse sees it | Post-call checks run in <200ms before display |
| Offline evals | Regression testing before any prompt or model version change | Full golden dataset run in CI/CD gate |
| Online evals (sampled) | Production drift, real-world failure patterns | 20% of live summaries auto-scored; charge nurse corrections logged as implicit signal |
Automated checks (run on every output)
- All numeric values in summary appear verbatim in source EHR payload (exact-match against extracted values list)
- Output length: 150-400 tokens (hard gate; outside range = flag for review)
- All five SBAR-adjacent sections present (regex: "Neuro:", "Respiratory:", "Hemodynamics:", "Lines/Drains:", "Pending:")
- No PII beyond patient MRN and room number
- Forbidden phrases absent: "likely," "probably," "I believe," "may have" (uncertainty markers that should be explicit flags, not hedges)
LLM-as-judge configuration
- Judge model: GPT-4o (judging GPT-4-turbo outputs -- stronger model judges weaker)
- Judging prompt: Embeds full rubric, 5 calibration examples with scores and clinical reasoning, explicit instruction to cite source-record evidence for every Accuracy score
- Calibration target: ≥85% agreement with charge nurse panel scores on 30-case calibration set (currently at 81% -- one more calibration round scheduled before launch)
- Judge runs on: Accuracy, Critical Flag Coverage, Completeness (the three high-weight dimensions); automated checks handle Format and Safety
Human review
- Pre-launch: 100% charge nurse review with explicit correction logging (captures ground truth for calibration)
- Post-launch (weeks 1-4): 40% sample reviewed by charge nurses; corrections fed back to golden dataset
- Steady state (week 5+): 15% weekly sample; full review triggered if automated eval score drops below 3.8 on any dimension
- Reviewer: ICU charge nurse rotation at Meridian Regional (3 designated reviewers); disagreements resolved by Nurse Manager Sarah Ellison
Tradeoff Boundaries
| Axis | Metric | Current (pilot)