Skip to main content
Product Management/ai-eval-design

AI Eval Design

You need evaluation criteria and test harnesses for an AI-powered feature – quality rubrics, golden datasets, eval pipelines, and pass/fail thresholds.

Use this when you've built (or are about to ship) an AI-powered feature and need to define what "good" looks like before it reaches users. Covers quality rubrics, golden datasets, eval pipeline design, and pass/fail thresholds. If you're assessing an entire engagement's AI health, use /ai-health-check instead -- this skill is for evaluating a specific feature's output quality.

Related skills: Eval criteria originate in /ai-product-spec. For engagement-level AI assessment, use /ai-health-check. For ongoing production monitoring, see /llm-observability-plan -- eval thresholds become monitoring thresholds. For clinical/regulated AI validation (more rigorous than standard eval), use /clinical-validation-protocol.

Process

Step 1: Define what "good" means

Ask the user:

  1. What does this AI feature do? (One-line description of the capability)
  2. What does a great output look like? (Ask for 2-3 real examples of ideal outputs)
  3. What does a bad output look like? (Ask for 2-3 examples of outputs that would be unacceptable)
  4. What are the failure modes you worry about? (Hallucination, wrong tone, harmful content, irrelevance, too slow, too expensive)
  5. Who judges quality today? (Users via feedback, internal reviewers, automated checks, nobody yet)
  6. What's the consequence of a bad output? (User annoyance, lost revenue, legal risk, safety risk)

The consequence of failure determines eval rigor. A chatbot that sometimes gives a mediocre answer needs lighter evals than an AI that generates medical summaries.

Step 2: Design the quality rubric

Create a scoring framework with explicit dimensions:

DimensionDefinitionScore 1 (Fail)Score 3 (Acceptable)Score 5 (Excellent)Weight
AccuracyFactual correctness of the outputContains factual errors or hallucinationsMostly correct, minor inaccuraciesFully accurate, verifiable(weight)
RelevanceOutput addresses what was askedOff-topic or misunderstands the requestAddresses the question with some driftDirectly and completely answers the request(weight)
CompletenessCovers all required elementsMissing critical informationCovers main points, misses some detailsComprehensive coverage(weight)
Tone / StyleMatches expected voice and formatWrong tone, inappropriate styleAcceptable tone, minor style issuesPerfect match to brand/context(weight)
SafetyAvoids harmful or inappropriate contentContains harmful, biased, or inappropriate contentNo harmful contentNo harmful content + proactively helpful(weight)

Customize dimensions to the feature. A code generation tool needs "correctness" and "security." A writing assistant needs "tone" and "originality." Not every dimension applies to every feature.

CARATS quality dimensions as starting columns: Use knowledge/ai-safety-evals-reference.md § behavior-spec-canvas for the full reference. The CARATS dimensions (Consistency, Accuracy, Reliability, Alignment, Tone, Security) provide a starting rubric -- choose the dimensions that match the feature's risk profile. Not every feature needs all six.

Threshold formula: There is no universal threshold. Each scenario's pass threshold is determined by four factors: stakes (cost of being wrong) x reversibility (can the user recover?) x baseline (what are you comparing against?) x human parity (how well would a person do this?). See knowledge/ai-safety-evals-reference.md § threshold-formula for the full framework. Show the math visibly -- don't hide behind "we set the threshold at 85%."

Topic tree for eval coverage: To ensure your rubric covers the full surface area, build a topic tree: start with the feature's top-level purpose, then branch into sub-capabilities, then branch into dimensions per sub-capability. Each leaf is an eval scenario. This prevents the common failure of testing only the happy path.

Binary pass/fail vs Likert scales: Prefer binary pass/fail for individual eval scenarios. Binary forces clearer criteria and more consistent labeling. Reserve Likert scales (1-5) for aggregate rubric dimensions where you need to track improvement over time. See knowledge/ai-safety-evals-reference.md § testing-vs-evals-bridge for the rationale.

Confusion matrix for classification features: When the AI component is a classifier (router, categorizer, tagger), add a confusion matrix to the rubric:

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (correct)False Negative (missed)
Actually NegativeFalse Positive (false alarm)True Negative (correct)

For safety-critical features, weight false negatives much higher than false positives. A missed safety signal is harm; a false alarm is annoyance.

Stakeholder-specific dimensions: Beyond generic quality dimensions, ask which stakeholders have skin in the game and add a dimension for each:

StakeholderEval questionExample auto-fail
Legal"Would the head of legal approve this response knowing it can be cited in a tribunal?"Making unauthorized commitments or promises
Policy/Compliance"Does this stay within the policy that has been authorized?"Contradicting published company policy
Customer Experience"Does this meet the user's expectation for this interaction?"Dismissive tone on sensitive topics
Finance/Operations"Does this protect our operational KPIs?" (e.g., contact deflection, resolution rate)Routing users to expensive support channels unnecessarily

Not every feature needs all four. Ask: "Who gets paged if this goes wrong?" -- that person's concern is an eval dimension.

Scores 2 and 4 fall between adjacent anchors -- use them when output is better than "fail" but not quite "acceptable," or better than "acceptable" but not "excellent."

Scoring rules:

  • Weight dimensions by importance (must sum to 100%). Start by ranking dimensions by consequence of failure -- the dimension where a bad score hurts most gets the highest weight.
  • Set a minimum passing score (e.g., weighted average >= 3.5)
  • Set hard-fail dimensions (e.g., Safety score of 1 = automatic fail regardless of other scores)

Step 3: Build the golden dataset

A golden dataset is a curated set of inputs with known-good expected outputs. This is the foundation of repeatable evaluation.

Dataset design:

CategoryCountPurposeExample
Happy path15-25Typical, well-formed inputsStandard user queries that should work well
Edge cases10-15Unusual but valid inputsVery long inputs, ambiguous queries, multilingual
Adversarial5-10Inputs designed to break thingsPrompt injection attempts, off-topic requests, harmful content requests
Regression5-10Previously failed inputs (add over time)Bugs found in production that should not recur

For each entry:

  • Input: The exact prompt/query/context sent to the model
  • Expected output: What a good response looks like (or key elements it must contain)
  • Evaluation criteria: Which rubric dimensions matter most for this input
  • Source: Where this test case came from (user research, bug report, adversarial design)

Sourcing test cases: Ask the user what data they can draw from -- production logs, user research sessions, support tickets, competitor examples, or synthetic inputs. Real user data makes the strongest golden datasets.

Start with 30-50 entries. Grow the dataset as you find new failure modes in production. Version-control the dataset alongside your prompts -- when the prompt changes, you need to know which dataset version established the baseline.

Step 4: Choose the eval method

MethodBest forCostSpeedAccuracy
Exact matchStructured outputs (JSON, classification labels)FreeInstantHigh for structured tasks
Keyword / regexChecking for required content or forbidden contentFreeInstantMedium -- brittle
LLM-as-judgeOpen-ended text quality, nuanced evaluation$0.01-0.10 per evalSecondsGood -- but calibrate against human judgment
Human reviewSubjective quality, safety-critical outputs$1-10 per evalMinutes-hoursHighest -- but expensive and slow
HybridProduction systems at scaleVariesVariesBest balance

Where to screen -- input vs output:

Evals don't just run on model outputs. Decide where in the pipeline to evaluate:

Screening pointWhat it catchesExample
Input screeningDangerous or ambiguous queries before the model processes themFlag refund/cancellation/policy queries for stricter guardrails; detect prompt injection attempts
Output screeningBad responses before they reach the userBlock unauthorized commitments, PII leakage, policy contradictions
Offline evalsQuality trends across batches, pre-deploy regression testingRun golden dataset before every prompt or model change
Online evalsProduction drift and real-world failure patternsSample live traffic for automated scoring, alert on threshold breaches

Most teams start with output screening (catch bad responses) and offline evals (regression testing). Add input screening when specific query patterns are known to cause failures.

Recommended approach for most teams:

  1. Automated first pass: Exact match or keyword checks for structural requirements (output format, required fields, length)
  2. LLM-as-judge second pass: For quality dimensions that need judgment (relevance, tone, completeness)
  3. Human review sample: Spot-check 10-20% of outputs weekly to calibrate the automated eval

LLM-as-judge setup:

  • Write a clear judging prompt with your rubric embedded
  • Include 3-5 calibration examples with scores and reasoning
  • Test the judge against human-scored samples -- it should agree 80%+ of the time
  • Use a different (and ideally more capable) model for judging than the model being evaluated -- a weaker judge can't reliably score a stronger model's output

Step 5: Define tradeoff boundaries

AI features live on three axes. You can't maximize all three:

AxisMetricCurrentTargetHard limit
QualityWeighted rubric score(baseline)(target)(minimum acceptable)
Latencyp50 / p95 response time(baseline)(target)(maximum acceptable)
CostCost per evaluation / per interaction(baseline)(target)(budget ceiling)

Document explicit tradeoff decisions:

  • "We'll accept slightly lower quality (3.5 vs 4.0 rubric score) to stay under $0.02 per interaction"
  • "We'll use a more expensive model for safety-critical outputs and a cheaper model for low-stakes ones"
  • "Latency above 5 seconds is unacceptable even if quality improves"

Step 6: Generate the eval plan

Compile into a structured document:

# Eval Plan: (Feature name)

**Generated:** (date)
**Feature:** (brief description)
**Quality bar:** (minimum passing rubric score)
**Hard-fail criteria:** (dimensions where score of 1 = automatic fail)

## Quality Rubric
(Table from Step 2 -- dimensions, score definitions, weights)

## Golden Dataset
(Summary from Step 3 -- categories, counts, sources)
(Link to or embed the actual dataset)

## Eval Pipeline
(Method from Step 4 -- automated checks, LLM-as-judge config, human review cadence)

### Automated checks
- (Check 1: e.g., output must be valid JSON)
- (Check 2: e.g., response must be under 500 tokens)
- (Check 3: e.g., must not contain PII patterns)

### LLM-as-judge configuration
- Judge model: (model name)
- Judging prompt: (summary or link)
- Calibration accuracy: (% agreement with human scores)

### Human review
- Frequency: (e.g., weekly sample of 20 outputs)
- Reviewer: (who)
- Process: (how disagreements are resolved)

## Tradeoff Boundaries
(Table from Step 5 -- quality, latency, cost targets and limits)

## When to Run Evals
- **Pre-deploy:** Run full golden dataset before every prompt or model change
- **Nightly/weekly:** Run on a sample of recent production outputs to detect drift
- **On regression reports:** When users report quality issues, add the failing input to the golden dataset and re-run

## Alerting
- **Threshold breach:** When eval scores drop below minimum passing score, notify (team/channel)
- **Hard-fail trigger:** When any hard-fail dimension scores 1, notify immediately
- **Drift detection:** When scores trend downward over 3+ eval runs, flag for investigation

## Implementation Checklist
- [ ] **(P0)** Build golden dataset with initial 30-50 entries
- [ ] **(P0)** Implement automated structural checks
- [ ] **(P0)** Set up LLM-as-judge with calibrated prompt
- [ ] **(P1)** Run baseline eval against current model output
- [ ] **(P1)** Establish human review cadence
- [ ] **(P2)** Set up regression testing in CI/CD
- [ ] **(P2)** Build dashboard for eval scores over time

## Open Questions
- (Unresolved eval decisions)
- (Things that need baseline data to determine)

Step 7: Generate demo-ready eval narrative (optional)

When the user needs to present eval progress to stakeholders, generate a plain-language narrative alongside the eval plan. This bridges the gap between having eval data and being able to explain it to people who expect feature demos.

Narrative template:

## Eval Progress Summary -- {{feature name}}, {{date}}

### Where we started
{{1-2 sentences describing baseline quality and what "bad" looked like in week 1}}

### Where we are now
{{1-2 sentences describing current quality with specific score improvements}}

### What moved
| Dimension | Before | Now | What changed |
|-----------|--------|-----|--------------|
| {{dimension}} | {{score}} | {{score}} | {{plain-language explanation of what the team did}} |

### What didn't move (and why)
{{Honest acknowledgment of dimensions that stayed flat, with explanation of prioritization}}

### What this means for the product
{{Business impact translation: fewer errors, less manual review, faster throughput, etc.}}

### What we're targeting next
{{1-2 dimensions, target scores, and what the team plans to do}}

Guidelines for the narrative:

  • Lead with examples, not numbers. Show a before/after output side by side, then give the score.
  • Translate scores to business impact: "Score went from 2.8 to 3.6" becomes "30% fewer outputs need manual correction."
  • Don't hide regressions. Name them, explain the tradeoff, show the plan.

Related skills: For a full demo script built around eval progress, use /ai-progress-demo.

Step 8: Review and finalize

Ask the user:

  • Does the rubric capture what matters most about output quality?
  • Is the golden dataset representative of real user inputs?
  • Is the eval method practical given team size and budget?
  • Are the tradeoff boundaries honest or aspirational?
  • How will eval results feed back into model/prompt improvements?
  • Who owns maintaining the golden dataset over time? (This is the #1 reason eval systems decay -- nobody adds new test cases.)

Adjust based on feedback.

Clinical validation variant (for safety-critical or regulated AI)

When the AI feature affects patient safety, clinical decisions, or regulated data, apply clinical laboratory QC thinking on top of the standard eval:

Daily controls (continuous QC):

  • Run a set of known-good inputs through the model on a defined schedule (daily or per-deployment)
  • Compare outputs against established acceptable ranges (not just "pass/fail" -- track the actual scores)
  • Apply Westgard-style rules: a single bad result is a warning; trending results or consecutive failures mean stop and investigate
  • Document QC results even when they pass -- the trend matters more than any single point

Proficiency testing (external validation):

  • Periodically evaluate the model using inputs from an external source (not the team that built it)
  • Compare results against peer performance or expert panel consensus
  • Proficiency testing reveals blind spots that internal QC cannot -- your golden dataset may have systematic gaps
  • Failed proficiency testing requires root cause analysis and CAPA (see /capa-design)

False-negative cost weighting:

  • For safety-critical AI, weight false negatives (missed adverse events, missed contraindications) much higher than false positives
  • A false positive is an annoyance; a false negative is a harm. Adjust rubric weights accordingly
  • Define hard-fail criteria: "If the model misses a safety signal, the eval fails regardless of all other scores"

When to use clinical validation instead of standard eval:

  • The model's output directly informs treatment decisions
  • The output is included in regulatory submissions
  • Patient safety could be affected by a wrong output
  • The system processes data subject to 21 CFR Part 11 or ICH-GCP
  • Use /clinical-validation-protocol for the full validation plan

Output location

Present the eval plan as formatted text in the conversation for the user to copy into their docs tool.

Example Output

Input

  • Feature: AI-generated shift handoff summaries for ICU nurses -- the model reads the past 12 hours of EHR data (vitals, labs, orders, nursing notes) and produces a structured 200-300 word summary for the incoming nurse
  • Company: Meridian Health System, rolling out to 3 ICU units at Meridian Regional Medical Center
  • Failure modes identified: Hallucinated lab values not in the source record, missed critical flags (e.g., deteriorating trend in MAP), wrong patient context blended from adjacent records
  • Current quality judge: Charge nurses manually review all summaries before handoff; target is to reduce review burden by 60% within 90 days
  • Consequence of bad output: Clinical decision based on incorrect information; patient safety risk; potential Joint Commission and liability exposure

Output (abbreviated)

Eval Plan: ICU Shift Handoff Summary Generator

Generated: 2025-07-14 Feature: LLM-generated 12-hour ICU shift handoff summaries from EHR source data Quality bar: Weighted rubric score ≥ 4.2 (5-point scale); charge nurse sign-off rate ≥ 95% Hard-fail criteria: Any Accuracy score of 1 (hallucinated clinical value) = automatic fail regardless of other scores. Any missed critical flag (deteriorating MAP, new sepsis criteria, code-status change) = automatic fail.


Quality Rubric

DimensionDefinitionScore 1 (Fail)Score 3 (Acceptable)Score 5 (Excellent)Weight
AccuracyAll values traceable to source EHR recordContains any fabricated lab value, vital, or orderAll values present in source; 1-2 minor transcription errorsEvery value exactly matches source; no hallucinations40%
Critical Flag CoverageDeteriorating trends, new diagnoses, code-status, allergy alerts includedMisses one or more critical flagsIncludes all critical flags, inconsistent framingAll flags present, clearly prioritized, actionable25%
CompletenessCovers all required handoff sections (neuro, respiratory, hemodynamics, lines/drains, pending items)≥2 required sections absent or emptyAll sections present; 1-2 minor gapsComprehensive, no gaps, pending items clearly flagged15%
Concision / Format200-300 words, structured per Meridian ICU template>400 words or <150 words; wrong structureWithin range, mostly follows templateExactly within range, template-perfect, scannable10%
Safety / Harm AvoidanceNo contradictory medication instructions, no unsupported clinical recommendationsContains a contraindicated recommendation or conflicting med instructionNo harmful contentNo harmful content; flags uncertainty explicitly rather than guessing10%

Scoring rule: Weighted average ≥ 4.2 to pass. Any hard-fail trigger = fail regardless of weighted score. Accuracy is a hard-fail dimension.

Threshold derivation (shown explicitly):

FactorAssessment
StakesHigh -- wrong value reaches a bedside nurse making real-time decisions
ReversibilityLow -- charge nurse review is the last checkpoint before handoff
BaselineHuman-written summaries score ~4.0 on this rubric (measured on 40 historical samples)
Human parity targetMust match or exceed human baseline on Accuracy; can slightly lag on Concision

→ Resulting threshold: 4.2 (above human baseline on the dimensions that matter most; lower bar on format where speed matters more than perfection)


Stakeholder-Specific Auto-Fail Criteria

StakeholderEval questionExample auto-fail
Clinical / Patient SafetyWould the charge nurse act on this without correction?Hallucinated potassium value; missed downtrending MAP
Legal / Risk ManagementCould this summary be cited in a malpractice proceeding without modification?Summary attributes an order to wrong physician
Compliance / Joint CommissionDoes the summary meet TJC handoff communication standards (SBAR)?Missing Situation or Background section entirely
Nursing OperationsDoes this reduce, not add to, charge nurse cognitive load?Summary longer than 400 words or requires >2 corrections before sign-off

Golden Dataset

Target size: 60 entries at launch; grow to 120 within 90 days

CategoryCountSourcesExample
Happy path20De-identified EHR exports from past 6 months of ICU admissionsStable post-op cardiac patient, routine 12-hour window
Critical flag present15Charge nurse–flagged cases from handoff logPatient with MAP trending from 72 → 58 over 4 hours; new sepsis criteria met at hour 10
Edge cases10Nursing informatics team synthetic constructionPatient with 18 active medications; patient transferred mid-shift from a different unit with partial records
Adversarial / stress8Red-team by clinical informatics + charge nurse panelIncomplete nursing notes, contradictory vitals in two documentation systems, code-status changed twice in shift
Regression7 (growing)Bugs found in pilot weekInitial model blended context from room-adjacent patient record; now a fixed test case

Each entry includes: Source EHR snapshot (de-identified), expected summary key elements, which rubric dimensions are load-bearing for that case, and the clinical reviewer who validated the expected output.


Eval Pipeline

Screening points

PointWhat it catchesImplementation
Input screeningMissing required EHR fields before model call; patient ID mismatchesPre-call validation: assert all 5 required data sections present; cross-check patient MRN across all input segments
Output screeningFabricated values, format violations, forbidden language before nurse sees itPost-call checks run in <200ms before display
Offline evalsRegression testing before any prompt or model version changeFull golden dataset run in CI/CD gate
Online evals (sampled)Production drift, real-world failure patterns20% of live summaries auto-scored; charge nurse corrections logged as implicit signal

Automated checks (run on every output)

  • All numeric values in summary appear verbatim in source EHR payload (exact-match against extracted values list)
  • Output length: 150-400 tokens (hard gate; outside range = flag for review)
  • All five SBAR-adjacent sections present (regex: "Neuro:", "Respiratory:", "Hemodynamics:", "Lines/Drains:", "Pending:")
  • No PII beyond patient MRN and room number
  • Forbidden phrases absent: "likely," "probably," "I believe," "may have" (uncertainty markers that should be explicit flags, not hedges)

LLM-as-judge configuration

  • Judge model: GPT-4o (judging GPT-4-turbo outputs -- stronger model judges weaker)
  • Judging prompt: Embeds full rubric, 5 calibration examples with scores and clinical reasoning, explicit instruction to cite source-record evidence for every Accuracy score
  • Calibration target: ≥85% agreement with charge nurse panel scores on 30-case calibration set (currently at 81% -- one more calibration round scheduled before launch)
  • Judge runs on: Accuracy, Critical Flag Coverage, Completeness (the three high-weight dimensions); automated checks handle Format and Safety

Human review

  • Pre-launch: 100% charge nurse review with explicit correction logging (captures ground truth for calibration)
  • Post-launch (weeks 1-4): 40% sample reviewed by charge nurses; corrections fed back to golden dataset
  • Steady state (week 5+): 15% weekly sample; full review triggered if automated eval score drops below 3.8 on any dimension
  • Reviewer: ICU charge nurse rotation at Meridian Regional (3 designated reviewers); disagreements resolved by Nurse Manager Sarah Ellison

Tradeoff Boundaries

| Axis | Metric | Current (pilot)