Skip to main content
Product Management/experiment-design

Experiment Design

You need to define a test for a product hypothesis.

Use this when you have a product assumption you want to validate before investing significant development time. Defines the hypothesis, test method, success criteria, minimum viable test, timeline, and a decision framework (ship/iterate/kill).

Related resources: discovery-assumptions-workshop.md -- 4-step team activity for surfacing and testing riskiest assumptions (standalone, printable).

Process

Step 1: Gather the hypothesis

Ask the user:

  1. What do you believe? — the assumption to test (e.g., "users will pay for a premium tier," "adding search will reduce support tickets")
  2. Who is the target user? — who does this affect?
  3. What outcome do you expect? — what should change if you're right?
  4. What KPI does this connect to? — what business metric improves?
  5. What's at stake? — how much are you planning to invest if the hypothesis is true? (This calibrates how rigorous the test needs to be.)

Step 2: Structure the hypothesis

Format using the standard template:

We believe that (doing this) for (these people) will achieve (this outcome). We'll know this is true when we see (measurable signal) that improves (this KPI).

Step 3: Select the test method

Recommend the cheapest/fastest way to validate the hypothesis:

MethodWhen to useEffortConfidence
Customer interviews (5-8)Exploring a new problem space or needLowMedium
Paper/Figma prototype testValidating a UX flow or interaction patternLowMedium
Fake door / painted doorTesting demand for a feature before building itLowHigh
Landing page + signupTesting demand with a commitment signal (email, waitlist)LowHigh
Wizard of OzTesting an experience where the backend is manualMediumHigh
Concierge MVPDelivering the service manually to a small groupMediumHigh
Letter of intentGetting written commitment (LOI, pre-order, deposit) before buildingLowHigh
Pop-up storeTemporary physical or digital presence to test demand in a new marketMediumHigh
Partner interviewTesting feasibility/viability assumptions through potential partners, not customersLowMedium
A/B testComparing two approaches with real usage dataMedium-HighVery High
Feature flag rolloutGradually releasing to a % of usersHighVery High

Choose the method that provides enough confidence for the decision at stake. Don't A/B test what you can validate with 5 interviews.

Start with LEARN, not BUILD. The lean startup loop is Build-Measure-Learn, but most teams default to Build-Build-Build. Flip it: start with what you need to learn, then design the cheapest test that would teach you that. If you cannot name the assumption this experiment tests, you may be in the "illusion of productivity" -- building to feel productive rather than building to learn.

If A/B test is selected, continue to Step 3b. Otherwise, skip to Step 4.

Step 3b: A/B test design (only when A/B test is selected)

When A/B test is the chosen method, gather additional details and produce the extended A/B test plan below.

Null and alternative hypotheses

Restate the hypothesis in formal terms:

H₀ (null): There is no difference in (primary metric) between the control and the variant. H₁ (alternative): The variant produces a (direction: higher/lower) (primary metric) than the control by at least (minimum detectable effect).

Ask the user:

  • What is the current baseline value for the primary metric? (e.g., "checkout conversion is 3.2%")
  • What is the minimum improvement that would justify shipping the variant? (This is the minimum detectable effect / MDE.)

Sample size calculation

Estimate the required sample size per variant using these inputs:

ParameterValue
Baseline conversion rate(current value, e.g., 3.2%)
Minimum detectable effect (MDE)(absolute or relative, e.g., +0.5pp or +15% relative)
Statistical significance level (α)0.05 (default — 95% confidence)
Statistical power (1 − β)0.80 (default — 80% power)
Number of variants(2 for simple A/B, more for multivariate)
One-tailed or two-tailed test(two-tailed unless there is a strong directional prior)

Estimated sample size per variant: Use the standard formula or reference a calculator (e.g., Evan Miller, Optimizely, or VWO sample size calculator). State the number clearly.

Estimated test duration: Given current daily traffic to the test surface, how many days to reach the required sample size? Include a recommendation to not stop the test early based on interim results (peeking problem).

Significance thresholds and decision rules

DecisionCondition
Ship variantp-value < α AND effect size ≥ MDE AND no guardrail violations
Iteratep-value < α but effect size < MDE, or guardrail metric shows minor degradation
Kill variantp-value ≥ α (no significant difference) OR guardrail violation
Extend testObserved effect is promising but sample size is below target — extend, do not peek-and-decide

Correction for multiple comparisons: If testing more than 2 variants, apply Bonferroni correction (α / number of comparisons) or use a sequential testing framework.

Variant design

Document each variant clearly:

VariantNameDescriptionWhat changes from control
Control (A)(name)(current experience — no changes)
Variant (B)(name)(describe the change)(specific UI/flow/copy/logic differences)
(C, D...)(if multivariate)(describe)(differences)

Isolation rule: Each variant should change one thing unless you are deliberately testing a bundle. If you're testing multiple changes, you have a multivariate test and need more traffic.

Rollout and decision criteria

PhaseTraffic allocationDurationDecision point
Burn-in5-10% per variant1-2 daysVerify instrumentation is working, no crashes or errors
Ramp50/50 (or equal split)Until sample size target is reachedMonitor guardrail metrics daily
DecisionAfter full sample collectedApply decision rules above
Rollout100% to winnerFull deploy with monitoring for 1 week post-rollout

Post-rollout monitoring: After shipping the winning variant to 100%, monitor the primary metric for at least 7 days to confirm the effect persists outside the test context (novelty effect, selection bias).

Risks and ethical considerations

Assess and document:

RiskMitigation
Peeking / early stoppingPre-commit to the sample size. Do not make decisions on interim results unless using a sequential testing framework.
Novelty effectThe variant may win because it's new, not because it's better. Monitor post-rollout for regression.
Simpson's paradoxSegment results by key dimensions (device, geography, user tenure) to check for contradictory sub-group effects.
User harmIf the variant could degrade the experience for a segment (e.g., accessibility, performance), define a kill switch and monitoring threshold.
Revenue / compliance impactFor tests affecting pricing, payments, or regulated features, get legal/compliance review before launch.
ContaminationEnsure users are consistently bucketed (same user always sees the same variant). Use sticky assignment by user ID, not session.

Step 4: Define success criteria

Specify:

  • Primary metric — the one number that determines success (e.g., "15% of visitors click 'Learn More'")
  • Secondary metrics — supporting signals (e.g., "average time on page > 30 seconds")
  • Guardrail metrics — things that should NOT get worse (e.g., "support ticket volume doesn't increase")
  • Sample size / duration — how many participants or how long the test runs before deciding

Step 5: Generate the experiment plan

Output in this format:


Experiment Plan: (experiment name)

Hypothesis

We believe that (doing this) for (these people) will achieve (this outcome). We'll know this is true when we see (measurable signal) that improves (this KPI).

Test method: (method name)

Why this method: (rationale for choosing this over alternatives)

What to build / prepare

  • (Specific deliverable — e.g., "Figma prototype with 3 screens covering the signup flow")
  • (Setup step — e.g., "recruit 6 users matching our target persona")

Success criteria

MetricTargetMeasurement
Primary: (metric)(threshold)(how to measure)
Secondary: (metric)(threshold)(how to measure)
Guardrail: (metric)(should not exceed)(how to measure)

Timeline

  • Preparation: (X days) — build the test artifact
  • Execution: (X days/weeks) — run the experiment
  • Analysis: (X days) — review results and decide
  • Total: (X days/weeks)

Decision framework

ResultAction
Primary metric exceeds targetShip — move to full implementation
Primary metric meets targetIterate — refine and retest, or proceed with caution
Primary metric below targetKill — archive the learning, move on
Mixed signals (primary up, guardrail down)Investigate — dig deeper before deciding

Risks and mitigations

  • (What could invalidate the test — e.g., "sample too small to be conclusive")
  • (Mitigation — e.g., "extend test duration to 2 weeks if sample is below target")

Step 5b: Effect size and confidence intervals (for quantitative tests)

When the test produces quantitative results, report effect sizes and confidence intervals alongside p-values:

### Effect Size and Precision

| Metric | Control | Variant | Difference | 95% CI | Effect size |
|--------|---------|---------|-----------|--------|-------------|
| Primary: (metric) | (value) | (value) | (absolute diff) | [lower, upper] | (Cohen's d or relative %) |

**Interpretation:**
- The 95% confidence interval means: if we repeated this experiment many times, 95% of the intervals would contain the true effect.
- CI width indicates precision. A CI of [+0.5%, +8.0%] is much less precise than [+3.0%, +5.5%], even if both are "significant."
- If the CI includes zero, the true effect could be no change -- even if the point estimate is positive.

When to consider Bayesian analysis:

  • When you need to make a decision with limited data (small sample, can't wait for full power)
  • When you want to express results as "probability that B is better than A" rather than "reject/fail to reject the null"
  • When stakeholders find "95% probability B is better" more intuitive than "p < 0.05"
  • Bayesian methods handle sequential testing more naturally (no peeking problem if using proper updating)

Related skills: For causal questions when randomization isn't possible, use /causal-inference-guide. For choosing the right statistical test, use /statistical-test-selector.

Step 5c: Method validation variant (when proving a method works, not comparing alternatives)

When the experiment isn't comparing alternatives (A vs. B) but proving that a single method is fit for purpose, use method validation instead of A/B testing. This is common for: clinical scoring algorithms, biomarker calculations, AI clinical decision support, diagnostic tools, or any system where the question is "does this work?" not "which is better?"

When to use method validation instead of A/B testing:

  • There is no "control" -- you're validating a new capability, not comparing two versions
  • The method must work across a range of inputs, not just on average
  • Regulatory or safety requirements demand structured proof of fitness for purpose
  • The question is "is this method accurate and reliable?" not "is version B better than version A?"

Method validation studies:

StudyQuestionHow to runAcceptance criteria
AccuracyDoes it give the right answer?Compare outputs against a reference standard (expert panel, validated method, ground truth)Agreement >= {{threshold}}% across all input categories
Precision (repeatability)Same input, same results?Run identical inputs {{n}} times; measure consistencyCV < {{threshold}}% or agreement >= {{threshold}}%
Precision (reproducibility)Consistent across conditions?Run across days, operators, system versionsCV < {{threshold}}% (looser than repeatability)
Linearity / rangeWorks across the full input range?Test at 5+ difficulty/complexity levelsPerformance degrades < {{threshold}} between levels
SpecificityOnly measures what it should?Test with known confounders, adversarial inputsNo clinically significant interference

Critical rule: Define acceptance criteria BEFORE running the studies. Deciding what "good enough" means after seeing results is not validation -- it's rationalization.

Related skill: For a complete clinical validation protocol, use /clinical-validation-protocol.

Step 6: Review

Ask the user:

  • Is the hypothesis crisp enough? Does the team agree on what we're testing?
  • Is the test method proportional to what's at stake?
  • Are the success criteria specific enough to make a clear decision?
  • Who owns running this experiment?

Output location

Present the experiment plan as formatted text in the conversation.

Example Output

Input

  • Assumption to test: "If we add an AI-generated trip summary email sent 48 hours before departure, travelers will feel more prepared and contact our support team less frequently"
  • Target user: Leisure travelers who have booked multi-leg international trips through Elsewhere Travel Co.'s platform (average booking value: $4,200)
  • KPI: Inbound support contacts per booking in the 48-hour pre-departure window
  • What's at stake: ~$140K engineering investment to build the AI summarization pipeline, email rendering system, and personalization layer

Output

Experiment Plan: AI Pre-Departure Summary Email

Hypothesis

We believe that sending an AI-generated trip summary email 48 hours before departure for leisure travelers with multi-leg international bookings on Elsewhere Travel Co. will reduce pre-departure anxiety and self-serve most common questions before they arise. We'll know this is true when we see a ≥20% reduction in inbound support contacts per booking during the 48-hour pre-departure window that improves our support cost per booking KPI (currently $18.40/booking, driven largely by pre-departure volume).


Test Method: Wizard of Oz

Why this method: The full AI pipeline would take 6–8 weeks to build. Before committing, we can manually generate trip summaries using a template + a support agent spending ~15 minutes per booking. This tests whether travelers respond to the email (open, click, reduce contact behavior) without building the AI layer. If the manual version doesn't move the needle, the automated version won't either. A/B testing the real feature would require building it first — that's the $140K decision we're trying to de-risk.


What to Build / Prepare

  • Email template — Design a plain-text + HTML email in Mailchimp covering: flight sequence, check-in times, baggage rules, hotel confirmation numbers, local emergency contacts, and a "top 3 things to do the night before" checklist
  • Runbook for ops team — 1-page guide for support agents to populate the template from existing booking data in Salesforce; target 15 min/booking
  • Cohort selection — Pull 60 upcoming multi-leg international departures (departing 10–20 days out) and randomly assign 30 to receive the manual summary email; 30 as holdout control
  • Tagging in Zendesk — Add pre_departure_email_test tag to treated bookings so support contacts can be filtered by cohort
  • Send schedule — Trigger manually via Mailchimp 48 hours ± 2 hours before each departure

Success Criteria

MetricTargetMeasurement
Primary: Inbound support contacts per booking (48-hr pre-departure window)≥20% reduction vs. controlZendesk ticket count by booking ID, cohort tag filter
Secondary: Email open rate≥55%Mailchimp campaign report
Secondary: Email click-through rate (at least one link)≥25%Mailchimp click map
Secondary: CSAT score (post-trip survey)No degradation vs. control baseline of 4.3/5Delighted post-trip NPS survey
Guardrail: Escalated / urgent support contactsShould not increaseZendesk priority tag count by cohort
Guardrail: Agent time to prepare email≤20 min/booking medianOps team time log in Notion

Timeline

  • Preparation: 5 days — finalize email template, get brand/legal sign-off, write agent runbook, set up Zendesk tagging, pull cohort list
  • Execution: 21 days — rolling sends as departures hit the 48-hour mark across the 60-booking cohort
  • Analysis: 4 days — pull Zendesk data, Mailchimp stats, CSAT responses; write readout
  • Total: ~4.5 weeks

Decision Framework

ResultAction
Primary metric exceeds target (≥20% reduction)Ship — green-light the $140K AI pipeline build; email program is validated
Primary metric meets target (10–19% reduction)Iterate — test a revised template or send timing before committing to full build; consider a lighter automation approach first
Primary metric below target (<10% reduction)Kill — archive learning; explore alternative support deflection strategies (e.g., in-app checklist, chatbot)
Mixed signals (contacts down, but CSAT or urgent escalations up)Investigate — email may be creating confusion or anxiety; qualitative follow-up interviews with 5 travelers from each cohort before deciding

Risks and Mitigations

  • Sample size is modest (n=30 per cohort): With ~4 support contacts per booking on average, we need the treatment group to drop to ~3.2 to hit the 20% target. Run a quick power check — at n=30, we have ~70% power to detect a 20% effect. Acceptable for a go/no-go on investing in discovery; not a regulatory decision. If results are borderline, extend cohort to 50/50 before deciding.
  • Operator fatigue inflating prep time: If agents take >20 min/booking, the manual test becomes unrepresentative of the AI version's speed advantage. Monitor the Notion time log weekly and flag if median exceeds threshold.
  • Self-selection in cohort: Departures are assigned by departure date proximity, not randomized by traveler profile. Check that both cohorts are balanced on trip complexity (number of legs) and booking value before analyzing.
  • Novelty effect in open rates: Travelers have never received this email before — open rates may be inflated by novelty. Weight the support contact reduction metric more heavily than email engagement metrics when making the decision.
  • Timing variance: "48 hours ± 2 hours" is manual; some sends may slip. Log actual send-to-departure gap in the Mailchimp notes field and exclude any booking where the email was sent <24 hours before departure.

Step 6 Review Prompts for the Team

  • Do we agree that a 20% reduction in pre-departure contacts is the right bar — or does legal/finance have a different ROI threshold for the $140K build?
  • Who owns the ops runbook and agent training? (Recommended: Customer Experience Lead)
  • Is 30 bookings per cohort enough for the confidence level this decision requires, or should we wait for a larger natural cohort?
  • What's our plan if the Wizard of Oz test succeeds but the AI-generated copy quality is meaningfully lower than human-written summaries?