Experiment Design - AI Agent Skill

Use this when you have a product assumption you want to validate before investing significant development time. Defines the hypothesis, test method, success criteria, minimum viable test, timeline, and a decision framework (ship/iterate/kill).

Framework attribution: Experimentation as the way product teams reduce guesswork (The Product Economy), practiced inside the Pivotal balanced-team culture of sharing experiments and celebrating failure. See playbook/experiment-driven.

Related resources: discovery-assumptions-workshop.md -- 4-step team activity for surfacing and testing riskiest assumptions (standalone, printable).

Process

Step 1: Gather the hypothesis

Ask the user:

What do you believe? -- the assumption to test (e.g., "users will pay for a premium tier," "adding search will reduce support tickets")
Who is the target user? -- who does this affect?
What outcome do you expect? -- what should change if you're right?
What KPI does this connect to? -- what business metric improves?
What's at stake? -- how much are you planning to invest if the hypothesis is true? (This calibrates how rigorous the test needs to be.)

Step 2: Structure the hypothesis

Format using the standard template:

We believe that (doing this) for (these people) will achieve (this outcome). We'll know this is true when we see (measurable signal) that improves (this KPI).

Step 3: Select the test method

Recommend the cheapest/fastest way to validate the hypothesis:

Method	When to use	Effort	Confidence
Customer interviews (5-8)	Exploring a new problem space or need	Low	Medium
Paper/Figma prototype test	Validating a UX flow or interaction pattern	Low	Medium
Fake door / painted door	Testing demand for a feature before building it	Low	High
Landing page + signup	Testing demand with a commitment signal (email, waitlist)	Low	High
Wizard of Oz	Testing an experience where the backend is manual	Medium	High
Concierge MVP	Delivering the service manually to a small group	Medium	High
Letter of intent	Getting written commitment (LOI, pre-order, deposit) before building	Low	High
Pop-up store	Temporary physical or digital presence to test demand in a new market	Medium	High
Partner interview	Testing feasibility/viability assumptions through potential partners, not customers	Low	Medium
A/B test	Comparing two approaches with real usage data	Medium-High	Very High
Feature flag rollout	Gradually releasing to a % of users	High	Very High

Choose the method that provides enough confidence for the decision at stake. Don't A/B test what you can validate with 5 interviews.

Start with LEARN, not BUILD. The lean startup loop is Build-Measure-Learn, but most teams default to Build-Build-Build. Flip it: start with what you need to learn, then design the cheapest test that would teach you that. If you cannot name the assumption this experiment tests, you may be in the "illusion of productivity" -- building to feel productive rather than building to learn.

If A/B test is selected, continue to Step 3b. Otherwise, skip to Step 4.

Step 3b: A/B test design (only when A/B test is selected)

When A/B test is the chosen method, gather additional details and produce the extended A/B test plan below.

Null and alternative hypotheses

Restate the hypothesis in formal terms:

H₀ (null): There is no difference in (primary metric) between the control and the variant. H₁ (alternative): The variant produces a (direction: higher/lower) (primary metric) than the control by at least (minimum detectable effect).

Ask the user:

What is the current baseline value for the primary metric? (e.g., "checkout conversion is 3.2%")
What is the minimum improvement that would justify shipping the variant? (This is the minimum detectable effect / MDE.)

Sample size calculation

Estimate the required sample size per variant using these inputs:

Parameter	Value
Baseline conversion rate	(current value, e.g., 3.2%)
Minimum detectable effect (MDE)	(absolute or relative, e.g., +0.5pp or +15% relative)
Statistical significance level (α)	0.05 (default -- 95% confidence)
Statistical power (1 − β)	0.80 (default -- 80% power)
Number of variants	(2 for simple A/B, more for multivariate)
One-tailed or two-tailed test	(two-tailed unless there is a strong directional prior)

Estimated sample size per variant: Use the standard formula or reference a calculator (e.g., Evan Miller, Optimizely, or VWO sample size calculator). State the number clearly.

Estimated test duration: Given current daily traffic to the test surface, how many days to reach the required sample size? Include a recommendation to not stop the test early based on interim results (peeking problem).

Significance thresholds and decision rules

Decision	Condition
Ship variant	p-value < α AND effect size ≥ MDE AND no guardrail violations
Iterate	p-value < α but effect size < MDE, or guardrail metric shows minor degradation
Kill variant	p-value ≥ α (no significant difference) OR guardrail violation
Extend test	Observed effect is promising but sample size is below target -- extend, do not peek-and-decide

Correction for multiple comparisons: If testing more than 2 variants, apply Bonferroni correction (α / number of comparisons) or use a sequential testing framework.

Variant design

Document each variant clearly:

Variant	Name	Description	What changes from control
Control (A)	(name)	(current experience -- no changes)	--
Variant (B)	(name)	(describe the change)	(specific UI/flow/copy/logic differences)
(C, D...)	(if multivariate)	(describe)	(differences)

Isolation rule: Each variant should change one thing unless you are deliberately testing a bundle. If you're testing multiple changes, you have a multivariate test and need more traffic.

Rollout and decision criteria

Phase	Traffic allocation	Duration	Decision point
Burn-in	5-10% per variant	1-2 days	Verify instrumentation is working, no crashes or errors
Ramp	50/50 (or equal split)	Until sample size target is reached	Monitor guardrail metrics daily
Decision	--	After full sample collected	Apply decision rules above
Rollout	100% to winner	--	Full deploy with monitoring for 1 week post-rollout

Post-rollout monitoring: After shipping the winning variant to 100%, monitor the primary metric for at least 7 days to confirm the effect persists outside the test context (novelty effect, selection bias).

Risks and ethical considerations

Assess and document:

Risk	Mitigation
Peeking / early stopping	Pre-commit to the sample size. Do not make decisions on interim results unless using a sequential testing framework.
Novelty effect	The variant may win because it's new, not because it's better. Monitor post-rollout for regression.
Simpson's paradox	Segment results by key dimensions (device, geography, user tenure) to check for contradictory sub-group effects.
User harm	If the variant could degrade the experience for a segment (e.g., accessibility, performance), define a kill switch and monitoring threshold.
Revenue / compliance impact	For tests affecting pricing, payments, or regulated features, get legal/compliance review before launch.
Contamination	Ensure users are consistently bucketed (same user always sees the same variant). Use sticky assignment by user ID, not session.

Step 4: Define success criteria

Specify:

Primary metric -- the one number that determines success (e.g., "15% of visitors click 'Learn More'")
Secondary metrics -- supporting signals (e.g., "average time on page > 30 seconds")
Guardrail metrics -- things that should NOT get worse (e.g., "support ticket volume doesn't increase")
Sample size / duration -- how many participants or how long the test runs before deciding

Step 5: Generate the experiment plan

Output in this format:

Experiment Plan: (experiment name)

Hypothesis

We believe that (doing this) for (these people) will achieve (this outcome). We'll know this is true when we see (measurable signal) that improves (this KPI).

Test method: (method name)

Why this method: (rationale for choosing this over alternatives)

What to build / prepare

(Specific deliverable -- e.g., "Figma prototype with 3 screens covering the signup flow")
(Setup step -- e.g., "recruit 6 users matching our target persona")

Success criteria

Metric	Target	Measurement
Primary: (metric)	(threshold)	(how to measure)
Secondary: (metric)	(threshold)	(how to measure)
Guardrail: (metric)	(should not exceed)	(how to measure)

Timeline

Preparation: (X days) -- build the test artifact
Execution: (X days/weeks) -- run the experiment
Analysis: (X days) -- review results and decide
Total: (X days/weeks)

Decision framework

Result	Action
Primary metric exceeds target	Ship -- move to full implementation
Primary metric meets target	Iterate -- refine and retest, or proceed with caution
Primary metric below target	Kill -- archive the learning, move on
Mixed signals (primary up, guardrail down)	Investigate -- dig deeper before deciding

Risks and mitigations

(What could invalidate the test -- e.g., "sample too small to be conclusive")
(Mitigation -- e.g., "extend test duration to 2 weeks if sample is below target")

Step 5b: Effect size and confidence intervals (for quantitative tests)

When the test produces quantitative results, report effect sizes and confidence intervals alongside p-values:

### Effect Size and Precision

| Metric | Control | Variant | Difference | 95% CI | Effect size |
|--------|---------|---------|-----------|--------|-------------|
| Primary: (metric) | (value) | (value) | (absolute diff) | [lower, upper] | (Cohen's d or relative %) |

**Interpretation:**
- The 95% confidence interval means: if we repeated this experiment many times, 95% of the intervals would contain the true effect.
- CI width indicates precision. A CI of [+0.5%, +8.0%] is much less precise than [+3.0%, +5.5%], even if both are "significant."
- If the CI includes zero, the true effect could be no change -- even if the point estimate is positive.

When to consider Bayesian analysis:

When you need to make a decision with limited data (small sample, can't wait for full power)
When you want to express results as "probability that B is better than A" rather than "reject/fail to reject the null"
When stakeholders find "95% probability B is better" more intuitive than "p < 0.05"
Bayesian methods handle sequential testing more naturally (no peeking problem if using proper updating)

Related skills: For causal questions when randomization isn't possible, use /causal-inference-guide. For choosing the right statistical test, use /statistical-test-selector.

Step 5c: Method validation variant (when proving a method works, not comparing alternatives)

When the experiment isn't comparing alternatives (A vs. B) but proving that a single method is fit for purpose, use method validation instead of A/B testing. This is common for: clinical scoring algorithms, biomarker calculations, AI clinical decision support, diagnostic tools, or any system where the question is "does this work?" not "which is better?"

When to use method validation instead of A/B testing:

There is no "control" -- you're validating a new capability, not comparing two versions
The method must work across a range of inputs, not just on average
Regulatory or safety requirements demand structured proof of fitness for purpose
The question is "is this method accurate and reliable?" not "is version B better than version A?"

Method validation studies:

Study	Question	How to run	Acceptance criteria
Accuracy	Does it give the right answer?	Compare outputs against a reference standard (expert panel, validated method, ground truth)	Agreement >= {{threshold}}% across all input categories
Precision (repeatability)	Same input, same results?	Run identical inputs {{n}} times; measure consistency	CV < {{threshold}}% or agreement >= {{threshold}}%
Precision (reproducibility)	Consistent across conditions?	Run across days, operators, system versions	CV < {{threshold}}% (looser than repeatability)
Linearity / range	Works across the full input range?	Test at 5+ difficulty/complexity levels	Performance degrades < {{threshold}} between levels
Specificity	Only measures what it should?	Test with known confounders, adversarial inputs	No clinically significant interference

Critical rule: Define acceptance criteria BEFORE running the studies. Deciding what "good enough" means after seeing results is not validation -- it's rationalization.

Related skill: For a complete clinical validation protocol, use /clinical-validation-protocol.

Step 6: Review

Ask the user:

Is the hypothesis crisp enough? Does the team agree on what we're testing?
Is the test method proportional to what's at stake?
Are the success criteria specific enough to make a clear decision?
Who owns running this experiment?

Output location

Present the experiment plan as formatted text in the conversation.

Example Output

Input

Assumption to test: "If we add an AI-generated trip summary email sent 48 hours before departure, travelers will feel more prepared and contact our support team less frequently"
Target user: Leisure travelers who have booked multi-leg international trips through Elsewhere Travel Co.'s platform (average booking value: $4,200)
KPI: Inbound support contacts per booking in the 48-hour pre-departure window
What's at stake: ~$140K engineering investment to build the AI summarization pipeline, email rendering system, and personalization layer

Output

Experiment Plan: AI Pre-Departure Summary Email

Hypothesis

We believe that sending an AI-generated trip summary email 48 hours before departure for leisure travelers with multi-leg international bookings on Elsewhere Travel Co. will reduce pre-departure anxiety and self-serve most common questions before they arise. We'll know this is true when we see a ≥20% reduction in inbound support contacts per booking during the 48-hour pre-departure window that improves our support cost per booking KPI (currently $18.40/booking, driven largely by pre-departure volume).

Test Method: Wizard of Oz

Why this method: The full AI pipeline would take 6–8 weeks to build. Before committing, we can manually generate trip summaries using a template + a support agent spending ~15 minutes per booking. This tests whether travelers respond to the email (open, click, reduce contact behavior) without building the AI layer. If the manual version doesn't move the needle, the automated version won't either. A/B testing the real feature would require building it first — that's the $140K decision we're trying to de-risk.

What to Build / Prepare

Email template — Design a plain-text + HTML email in Mailchimp covering: flight sequence, check-in times, baggage rules, hotel confirmation numbers, local emergency contacts, and a "top 3 things to do the night before" checklist
Runbook for ops team — 1-page guide for support agents to populate the template from existing booking data in Salesforce; target 15 min/booking
Cohort selection — Pull 60 upcoming multi-leg international departures (departing 10–20 days out) and randomly assign 30 to receive the manual summary email; 30 as holdout control
Tagging in Zendesk — Add pre_departure_email_test tag to treated bookings so support contacts can be filtered by cohort
Send schedule — Trigger manually via Mailchimp 48 hours ± 2 hours before each departure

Success Criteria

Metric	Target	Measurement
Primary: Inbound support contacts per booking (48-hr pre-departure window)	≥20% reduction vs. control	Zendesk ticket count by booking ID, cohort tag filter
Secondary: Email open rate	≥55%	Mailchimp campaign report
Secondary: Email click-through rate (at least one link)	≥25%	Mailchimp click map
Secondary: CSAT score (post-trip survey)	No degradation vs. control baseline of 4.3/5	Delighted post-trip NPS survey
Guardrail: Escalated / urgent support contacts	Should not increase	Zendesk priority tag count by cohort
Guardrail: Agent time to prepare email	≤20 min/booking median	Ops team time log in Notion

Timeline

Preparation: 5 days — finalize email template, get brand/legal sign-off, write agent runbook, set up Zendesk tagging, pull cohort list
Execution: 21 days — rolling sends as departures hit the 48-hour mark across the 60-booking cohort
Analysis: 4 days — pull Zendesk data, Mailchimp stats, CSAT responses; write readout
Total: ~4.5 weeks

Decision Framework

Result	Action
Primary metric exceeds target (≥20% reduction)	Ship — green-light the $140K AI pipeline build; email program is validated
Primary metric meets target (10–19% reduction)	Iterate — test a revised template or send timing before committing to full build; consider a lighter automation approach first
Primary metric below target (<10% reduction)	Kill — archive learning; explore alternative support deflection strategies (e.g., in-app checklist, chatbot)
Mixed signals (contacts down, but CSAT or urgent escalations up)	Investigate — email may be creating confusion or anxiety; qualitative follow-up interviews with 5 travelers from each cohort before deciding

Risks and Mitigations

Sample size is modest (n=30 per cohort): With ~4 support contacts per booking on average, we need the treatment group to drop to ~3.2 to hit the 20% target. Run a quick power check — at n=30, we have ~70% power to detect a 20% effect. Acceptable for a go/no-go on investing in discovery; not a regulatory decision. If results are borderline, extend cohort to 50/50 before deciding.
Operator fatigue inflating prep time: If agents take >20 min/booking, the manual test becomes unrepresentative of the AI version's speed advantage. Monitor the Notion time log weekly and flag if median exceeds threshold.
Self-selection in cohort: Departures are assigned by departure date proximity, not randomized by traveler profile. Check that both cohorts are balanced on trip complexity (number of legs) and booking value before analyzing.
Novelty effect in open rates: Travelers have never received this email before — open rates may be inflated by novelty. Weight the support contact reduction metric more heavily than email engagement metrics when making the decision.
Timing variance: "48 hours ± 2 hours" is manual; some sends may slip. Log actual send-to-departure gap in the Mailchimp notes field and exclude any booking where the email was sent <24 hours before departure.

Step 6 Review Prompts for the Team

Do we agree that a 20% reduction in pre-departure contacts is the right bar — or does legal/finance have a different ROI threshold for the $140K build?
Who owns the ops runbook and agent training? (Recommended: Customer Experience Lead)
Is 30 bookings per cohort enough for the confidence level this decision requires, or should we wait for a larger natural cohort?
What's our plan if the Wizard of Oz test succeeds but the AI-generated copy quality is meaningfully lower than human-written summaries?

Run this now

Try /experiment-design on your own input

0/4000

Part of these Playbook topics

Experiment-Driven Development

Related Product Management skills

2x2 Prioritize A/B Test Planner Activation Optimization AI Prototype Guide Analytics Dashboard Design Audience Segmentation Backlog Craft Backlog Refine

Back to Skills Catalog