Skip to main content
Engineering/ai-testing-strategy

AI Testing Strategy

Build a testing strategy for an AI system: the testing pyramid, golden datasets, regression suites, red-teaming, and deploy gates.

Use this when you need a testing strategy for an AI system, broader than evaluating one feature's output. Covers what is deterministic and can be unit-tested, what is probabilistic and needs evals, the AI testing pyramid that ties them together, regression suites, red-teaming, and what gates a deploy. For deep eval design of a single feature, use /ai-eval-design; this skill is the system-level strategy that decides where evals fit.

Related skills: Single-feature eval design is /ai-eval-design. Adversarial cases come from /ai-guardrails-design. Production monitoring is the online layer, see /llm-observability-plan. Agent loops are tested as designed in /ai-agent-design.

The hard part most teams miss

You cannot unit-test a probabilistic system the way you test deterministic code. The testing pyramid inverts, and teams that ignore that either over-test with brittle exact-match assertions or under-test with vibes.

  1. Testing and evals are different things, and you need both. Tests are deterministic: did the code do what it should, the same way every time. Evals are probabilistic: is the output good, on average, against a bar. A pass/fail unit test on a model output that should vary is brittle; a vibes-check on the deterministic plumbing is negligent. Sort each component into the right bucket (Step 2).
  2. The pyramid inverts: the expensive layer is the middle. In normal software the base is cheap unit tests and the top is a few expensive end-to-end tests. In an AI system the deterministic base is still cheap, but the load-bearing layer is the eval middle (golden datasets), which costs real money and judgment to maintain. Budget for it.
  3. Adversarial cases are the test, not an afterthought. The happy path passes easily. What a competent attacker or a confused user does is where AI systems fail, and red-teaming those cases is a first-class part of the strategy, not a security team's separate problem.

Process

Step 1: Gather inputs

Ask the user:

  1. What is the system, end to end? (Components: prompts, retrieval, tools, an agent loop, the surrounding app.)
  2. Which parts are deterministic vs probabilistic? (The plumbing around the model vs the model output itself.)
  3. What is the risk profile? (Consumer-facing and high-stakes, internal and moderate, batch and low.)
  4. What changes most often? (Prompts, models, the corpus. The thing that changes most needs the tightest regression net.)
  5. What is the deploy process? (CI in place, manual releases. This decides where tests gate.)
  6. What has broken before? (Past failures are your first regression cases.)

Step 2: Sort each component into the pyramid

LayerWhat goes hereMethodRuns
Deterministic baseParsing, routing, schema validation, tool wrappers, retrieval plumbingOrdinary unit and integration tests, exact assertionsEvery commit, fast
Eval middle (load-bearing)Model output quality, retrieval recall, end-to-end answer qualityGolden datasets scored by rubric / LLM-as-judge (/ai-eval-design)Pre-deploy and on a schedule
Adversarial topJailbreaks, injection, abuse, harmful-output attemptsRed-team suite (/ai-guardrails-design)Pre-deploy and after any guardrail change
Online (production)Real-world drift, sampled live qualitySampled scoring and behavioral proxies (/llm-observability-plan)Continuous

The base is cheap, keep it broad. The middle is expensive, make it count. The top is small but non-negotiable.

Step 3: Golden datasets and regression suites

  • Golden dataset: the curated inputs with known-good outputs that anchor the eval middle. Design it with /ai-eval-design; the strategy here is making it a gate, not a one-time exercise.
  • Regression suite: every production failure becomes a permanent test case. This is the suite that catches the bug that came back. It only works if someone owns adding to it (the single most common reason eval systems decay).
  • Version with the prompt: the dataset travels with the prompt and model version, so a change runs against the baseline that established it.

Step 4: Red-teaming

  • Source the cases: pull adversarial inputs from the threat surface in /ai-guardrails-design (injection, jailbreak, abuse, false-premise) plus anything users have actually tried.
  • Run it as a suite, not a one-off: red-teaming is a regression layer too. Re-run it after any prompt or guardrail change, because a warmth or capability tweak can reopen a closed hole.
  • Score for harm, not quality: the question is "did it do the thing it must never do," a hard-fail, not a graded score.

Step 5: Decide what gates a deploy

  • Block on: deterministic test failures, any red-team hard-fail, and an eval score below the bar.
  • Warn on: an eval score that dropped but is still above the bar (investigate, do not necessarily block).
  • Cannot gate on: online metrics (they are post-deploy by definition); those drive rollback, not the gate.

Step 6: Output the testing strategy

# AI Testing Strategy: (system)

**Risk profile:** (level)
**Changes most often:** (component)

## Pyramid map
(Table from Step 2: each component placed in a layer)

## Golden + regression
- Golden dataset: (link / size)
- Regression owner: (name)
- Versioning: (with prompt/model)

## Red-team suite
- Case sources: (list)
- Re-run trigger: (after which changes)

## Deploy gates
- Block on: (list)
- Warn on: (list)
- Rollback triggers (online): (list)

## Open questions
- (unresolved decisions)

Step 7: Review

Ask the user:

  • Is each component sorted into the right bucket (test vs eval)?
  • Who owns adding production failures to the regression suite?
  • Does the red-team suite re-run after prompt changes, or only at launch?
  • What actually blocks a deploy today, and is that the right line?

Anti-patterns

Anti-patternWhy it failsDo instead
Exact-match tests on model outputBrittle; breaks on valid variationEval the probabilistic output, unit-test the plumbing
Vibes-only testingNo baseline; regressions ship unseenGolden dataset as a gate
Red-team once at launchA later prompt change reopens a holeRe-run the adversarial suite on every relevant change
No regression ownerThe suite stops resembling realityName who adds production failures
Gating on online metricsThey are post-deploy; cannot block a releaseGate on offline evals; online drives rollback
Skipping the eval middleThe load-bearing layer is untestedBudget for golden-dataset evals

Output location

Present the testing strategy as formatted text in the conversation for the user to copy into their engineering docs.