Use this when you have a product assumption you want to validate before investing significant development time. Defines the hypothesis, test method, success criteria, minimum viable test, timeline, and a decision framework (ship/iterate/kill).
Related resources:
discovery-assumptions-workshop.md-- 4-step team activity for surfacing and testing riskiest assumptions (standalone, printable).
Process
Step 1: Gather the hypothesis
Ask the user:
- What do you believe? — the assumption to test (e.g., "users will pay for a premium tier," "adding search will reduce support tickets")
- Who is the target user? — who does this affect?
- What outcome do you expect? — what should change if you're right?
- What KPI does this connect to? — what business metric improves?
- What's at stake? — how much are you planning to invest if the hypothesis is true? (This calibrates how rigorous the test needs to be.)
Step 2: Structure the hypothesis
Format using the standard template:
We believe that (doing this) for (these people) will achieve (this outcome). We'll know this is true when we see (measurable signal) that improves (this KPI).
Step 3: Select the test method
Recommend the cheapest/fastest way to validate the hypothesis:
| Method | When to use | Effort | Confidence |
|---|---|---|---|
| Customer interviews (5-8) | Exploring a new problem space or need | Low | Medium |
| Paper/Figma prototype test | Validating a UX flow or interaction pattern | Low | Medium |
| Fake door / painted door | Testing demand for a feature before building it | Low | High |
| Landing page + signup | Testing demand with a commitment signal (email, waitlist) | Low | High |
| Wizard of Oz | Testing an experience where the backend is manual | Medium | High |
| Concierge MVP | Delivering the service manually to a small group | Medium | High |
| Letter of intent | Getting written commitment (LOI, pre-order, deposit) before building | Low | High |
| Pop-up store | Temporary physical or digital presence to test demand in a new market | Medium | High |
| Partner interview | Testing feasibility/viability assumptions through potential partners, not customers | Low | Medium |
| A/B test | Comparing two approaches with real usage data | Medium-High | Very High |
| Feature flag rollout | Gradually releasing to a % of users | High | Very High |
Choose the method that provides enough confidence for the decision at stake. Don't A/B test what you can validate with 5 interviews.
Start with LEARN, not BUILD. The lean startup loop is Build-Measure-Learn, but most teams default to Build-Build-Build. Flip it: start with what you need to learn, then design the cheapest test that would teach you that. If you cannot name the assumption this experiment tests, you may be in the "illusion of productivity" -- building to feel productive rather than building to learn.
If A/B test is selected, continue to Step 3b. Otherwise, skip to Step 4.
Step 3b: A/B test design (only when A/B test is selected)
When A/B test is the chosen method, gather additional details and produce the extended A/B test plan below.
Null and alternative hypotheses
Restate the hypothesis in formal terms:
H₀ (null): There is no difference in (primary metric) between the control and the variant. H₁ (alternative): The variant produces a (direction: higher/lower) (primary metric) than the control by at least (minimum detectable effect).
Ask the user:
- What is the current baseline value for the primary metric? (e.g., "checkout conversion is 3.2%")
- What is the minimum improvement that would justify shipping the variant? (This is the minimum detectable effect / MDE.)
Sample size calculation
Estimate the required sample size per variant using these inputs:
| Parameter | Value |
|---|---|
| Baseline conversion rate | (current value, e.g., 3.2%) |
| Minimum detectable effect (MDE) | (absolute or relative, e.g., +0.5pp or +15% relative) |
| Statistical significance level (α) | 0.05 (default — 95% confidence) |
| Statistical power (1 − β) | 0.80 (default — 80% power) |
| Number of variants | (2 for simple A/B, more for multivariate) |
| One-tailed or two-tailed test | (two-tailed unless there is a strong directional prior) |
Estimated sample size per variant: Use the standard formula or reference a calculator (e.g., Evan Miller, Optimizely, or VWO sample size calculator). State the number clearly.
Estimated test duration: Given current daily traffic to the test surface, how many days to reach the required sample size? Include a recommendation to not stop the test early based on interim results (peeking problem).
Significance thresholds and decision rules
| Decision | Condition |
|---|---|
| Ship variant | p-value < α AND effect size ≥ MDE AND no guardrail violations |
| Iterate | p-value < α but effect size < MDE, or guardrail metric shows minor degradation |
| Kill variant | p-value ≥ α (no significant difference) OR guardrail violation |
| Extend test | Observed effect is promising but sample size is below target — extend, do not peek-and-decide |
Correction for multiple comparisons: If testing more than 2 variants, apply Bonferroni correction (α / number of comparisons) or use a sequential testing framework.
Variant design
Document each variant clearly:
| Variant | Name | Description | What changes from control |
|---|---|---|---|
| Control (A) | (name) | (current experience — no changes) | — |
| Variant (B) | (name) | (describe the change) | (specific UI/flow/copy/logic differences) |
| (C, D...) | (if multivariate) | (describe) | (differences) |
Isolation rule: Each variant should change one thing unless you are deliberately testing a bundle. If you're testing multiple changes, you have a multivariate test and need more traffic.
Rollout and decision criteria
| Phase | Traffic allocation | Duration | Decision point |
|---|---|---|---|
| Burn-in | 5-10% per variant | 1-2 days | Verify instrumentation is working, no crashes or errors |
| Ramp | 50/50 (or equal split) | Until sample size target is reached | Monitor guardrail metrics daily |
| Decision | — | After full sample collected | Apply decision rules above |
| Rollout | 100% to winner | — | Full deploy with monitoring for 1 week post-rollout |
Post-rollout monitoring: After shipping the winning variant to 100%, monitor the primary metric for at least 7 days to confirm the effect persists outside the test context (novelty effect, selection bias).
Risks and ethical considerations
Assess and document:
| Risk | Mitigation |
|---|---|
| Peeking / early stopping | Pre-commit to the sample size. Do not make decisions on interim results unless using a sequential testing framework. |
| Novelty effect | The variant may win because it's new, not because it's better. Monitor post-rollout for regression. |
| Simpson's paradox | Segment results by key dimensions (device, geography, user tenure) to check for contradictory sub-group effects. |
| User harm | If the variant could degrade the experience for a segment (e.g., accessibility, performance), define a kill switch and monitoring threshold. |
| Revenue / compliance impact | For tests affecting pricing, payments, or regulated features, get legal/compliance review before launch. |
| Contamination | Ensure users are consistently bucketed (same user always sees the same variant). Use sticky assignment by user ID, not session. |
Step 4: Define success criteria
Specify:
- Primary metric — the one number that determines success (e.g., "15% of visitors click 'Learn More'")
- Secondary metrics — supporting signals (e.g., "average time on page > 30 seconds")
- Guardrail metrics — things that should NOT get worse (e.g., "support ticket volume doesn't increase")
- Sample size / duration — how many participants or how long the test runs before deciding
Step 5: Generate the experiment plan
Output in this format:
Experiment Plan: (experiment name)
Hypothesis
We believe that (doing this) for (these people) will achieve (this outcome). We'll know this is true when we see (measurable signal) that improves (this KPI).
Test method: (method name)
Why this method: (rationale for choosing this over alternatives)
What to build / prepare
- (Specific deliverable — e.g., "Figma prototype with 3 screens covering the signup flow")
- (Setup step — e.g., "recruit 6 users matching our target persona")
Success criteria
| Metric | Target | Measurement |
|---|---|---|
| Primary: (metric) | (threshold) | (how to measure) |
| Secondary: (metric) | (threshold) | (how to measure) |
| Guardrail: (metric) | (should not exceed) | (how to measure) |
Timeline
- Preparation: (X days) — build the test artifact
- Execution: (X days/weeks) — run the experiment
- Analysis: (X days) — review results and decide
- Total: (X days/weeks)
Decision framework
| Result | Action |
|---|---|
| Primary metric exceeds target | Ship — move to full implementation |
| Primary metric meets target | Iterate — refine and retest, or proceed with caution |
| Primary metric below target | Kill — archive the learning, move on |
| Mixed signals (primary up, guardrail down) | Investigate — dig deeper before deciding |
Risks and mitigations
- (What could invalidate the test — e.g., "sample too small to be conclusive")
- (Mitigation — e.g., "extend test duration to 2 weeks if sample is below target")
Step 5b: Effect size and confidence intervals (for quantitative tests)
When the test produces quantitative results, report effect sizes and confidence intervals alongside p-values:
### Effect Size and Precision
| Metric | Control | Variant | Difference | 95% CI | Effect size |
|--------|---------|---------|-----------|--------|-------------|
| Primary: (metric) | (value) | (value) | (absolute diff) | [lower, upper] | (Cohen's d or relative %) |
**Interpretation:**
- The 95% confidence interval means: if we repeated this experiment many times, 95% of the intervals would contain the true effect.
- CI width indicates precision. A CI of [+0.5%, +8.0%] is much less precise than [+3.0%, +5.5%], even if both are "significant."
- If the CI includes zero, the true effect could be no change -- even if the point estimate is positive.
When to consider Bayesian analysis:
- When you need to make a decision with limited data (small sample, can't wait for full power)
- When you want to express results as "probability that B is better than A" rather than "reject/fail to reject the null"
- When stakeholders find "95% probability B is better" more intuitive than "p < 0.05"
- Bayesian methods handle sequential testing more naturally (no peeking problem if using proper updating)
Related skills: For causal questions when randomization isn't possible, use
/causal-inference-guide. For choosing the right statistical test, use/statistical-test-selector.
Step 5c: Method validation variant (when proving a method works, not comparing alternatives)
When the experiment isn't comparing alternatives (A vs. B) but proving that a single method is fit for purpose, use method validation instead of A/B testing. This is common for: clinical scoring algorithms, biomarker calculations, AI clinical decision support, diagnostic tools, or any system where the question is "does this work?" not "which is better?"
When to use method validation instead of A/B testing:
- There is no "control" -- you're validating a new capability, not comparing two versions
- The method must work across a range of inputs, not just on average
- Regulatory or safety requirements demand structured proof of fitness for purpose
- The question is "is this method accurate and reliable?" not "is version B better than version A?"
Method validation studies:
| Study | Question | How to run | Acceptance criteria |
|---|---|---|---|
| Accuracy | Does it give the right answer? | Compare outputs against a reference standard (expert panel, validated method, ground truth) | Agreement >= {{threshold}}% across all input categories |
| Precision (repeatability) | Same input, same results? | Run identical inputs {{n}} times; measure consistency | CV < {{threshold}}% or agreement >= {{threshold}}% |
| Precision (reproducibility) | Consistent across conditions? | Run across days, operators, system versions | CV < {{threshold}}% (looser than repeatability) |
| Linearity / range | Works across the full input range? | Test at 5+ difficulty/complexity levels | Performance degrades < {{threshold}} between levels |
| Specificity | Only measures what it should? | Test with known confounders, adversarial inputs | No clinically significant interference |
Critical rule: Define acceptance criteria BEFORE running the studies. Deciding what "good enough" means after seeing results is not validation -- it's rationalization.
Related skill: For a complete clinical validation protocol, use
/clinical-validation-protocol.
Step 6: Review
Ask the user:
- Is the hypothesis crisp enough? Does the team agree on what we're testing?
- Is the test method proportional to what's at stake?
- Are the success criteria specific enough to make a clear decision?
- Who owns running this experiment?
Output location
Present the experiment plan as formatted text in the conversation.
Example Output
Input
- Assumption to test: "If we add an AI-generated trip summary email sent 48 hours before departure, travelers will feel more prepared and contact our support team less frequently"
- Target user: Leisure travelers who have booked multi-leg international trips through Elsewhere Travel Co.'s platform (average booking value: $4,200)
- KPI: Inbound support contacts per booking in the 48-hour pre-departure window
- What's at stake: ~$140K engineering investment to build the AI summarization pipeline, email rendering system, and personalization layer
Output
Experiment Plan: AI Pre-Departure Summary Email
Hypothesis
We believe that sending an AI-generated trip summary email 48 hours before departure for leisure travelers with multi-leg international bookings on Elsewhere Travel Co. will reduce pre-departure anxiety and self-serve most common questions before they arise. We'll know this is true when we see a ≥20% reduction in inbound support contacts per booking during the 48-hour pre-departure window that improves our support cost per booking KPI (currently $18.40/booking, driven largely by pre-departure volume).
Test Method: Wizard of Oz
Why this method: The full AI pipeline would take 6–8 weeks to build. Before committing, we can manually generate trip summaries using a template + a support agent spending ~15 minutes per booking. This tests whether travelers respond to the email (open, click, reduce contact behavior) without building the AI layer. If the manual version doesn't move the needle, the automated version won't either. A/B testing the real feature would require building it first — that's the $140K decision we're trying to de-risk.
What to Build / Prepare
- Email template — Design a plain-text + HTML email in Mailchimp covering: flight sequence, check-in times, baggage rules, hotel confirmation numbers, local emergency contacts, and a "top 3 things to do the night before" checklist
- Runbook for ops team — 1-page guide for support agents to populate the template from existing booking data in Salesforce; target 15 min/booking
- Cohort selection — Pull 60 upcoming multi-leg international departures (departing 10–20 days out) and randomly assign 30 to receive the manual summary email; 30 as holdout control
- Tagging in Zendesk — Add
pre_departure_email_testtag to treated bookings so support contacts can be filtered by cohort - Send schedule — Trigger manually via Mailchimp 48 hours ± 2 hours before each departure
Success Criteria
| Metric | Target | Measurement |
|---|---|---|
| Primary: Inbound support contacts per booking (48-hr pre-departure window) | ≥20% reduction vs. control | Zendesk ticket count by booking ID, cohort tag filter |
| Secondary: Email open rate | ≥55% | Mailchimp campaign report |
| Secondary: Email click-through rate (at least one link) | ≥25% | Mailchimp click map |
| Secondary: CSAT score (post-trip survey) | No degradation vs. control baseline of 4.3/5 | Delighted post-trip NPS survey |
| Guardrail: Escalated / urgent support contacts | Should not increase | Zendesk priority tag count by cohort |
| Guardrail: Agent time to prepare email | ≤20 min/booking median | Ops team time log in Notion |
Timeline
- Preparation: 5 days — finalize email template, get brand/legal sign-off, write agent runbook, set up Zendesk tagging, pull cohort list
- Execution: 21 days — rolling sends as departures hit the 48-hour mark across the 60-booking cohort
- Analysis: 4 days — pull Zendesk data, Mailchimp stats, CSAT responses; write readout
- Total: ~4.5 weeks
Decision Framework
| Result | Action |
|---|---|
| Primary metric exceeds target (≥20% reduction) | Ship — green-light the $140K AI pipeline build; email program is validated |
| Primary metric meets target (10–19% reduction) | Iterate — test a revised template or send timing before committing to full build; consider a lighter automation approach first |
| Primary metric below target (<10% reduction) | Kill — archive learning; explore alternative support deflection strategies (e.g., in-app checklist, chatbot) |
| Mixed signals (contacts down, but CSAT or urgent escalations up) | Investigate — email may be creating confusion or anxiety; qualitative follow-up interviews with 5 travelers from each cohort before deciding |
Risks and Mitigations
- Sample size is modest (n=30 per cohort): With ~4 support contacts per booking on average, we need the treatment group to drop to ~3.2 to hit the 20% target. Run a quick power check — at n=30, we have ~70% power to detect a 20% effect. Acceptable for a go/no-go on investing in discovery; not a regulatory decision. If results are borderline, extend cohort to 50/50 before deciding.
- Operator fatigue inflating prep time: If agents take >20 min/booking, the manual test becomes unrepresentative of the AI version's speed advantage. Monitor the Notion time log weekly and flag if median exceeds threshold.
- Self-selection in cohort: Departures are assigned by departure date proximity, not randomized by traveler profile. Check that both cohorts are balanced on trip complexity (number of legs) and booking value before analyzing.
- Novelty effect in open rates: Travelers have never received this email before — open rates may be inflated by novelty. Weight the support contact reduction metric more heavily than email engagement metrics when making the decision.
- Timing variance: "48 hours ± 2 hours" is manual; some sends may slip. Log actual send-to-departure gap in the Mailchimp notes field and exclude any booking where the email was sent <24 hours before departure.
Step 6 Review Prompts for the Team
- Do we agree that a 20% reduction in pre-departure contacts is the right bar — or does legal/finance have a different ROI threshold for the $140K build?
- Who owns the ops runbook and agent training? (Recommended: Customer Experience Lead)
- Is 30 bookings per cohort enough for the confidence level this decision requires, or should we wait for a larger natural cohort?
- What's our plan if the Wizard of Oz test succeeds but the AI-generated copy quality is meaningfully lower than human-written summaries?