Skip to main content
Product Management/ab-test-planner

A/B Test Planner

You need to plan, size, and structure an A/B test.

You need to plan, prioritize, or document A/B tests with proper hypothesis formation, statistical rigor, and decision frameworks. Covers hypothesis structure, test prioritization (ICE/PIE), variant design, traffic allocation, duration calculation, results documentation, and ship/iterate/kill decisions.


How it works

  1. You provide test ideas, traffic volume, current conversion rate, and minimum detectable effect
  2. The skill structures each test idea into a formal hypothesis, scores priorities using ICE and PIE frameworks, calculates required sample sizes and durations, and provides variant design guidelines
  3. It returns a test plan with prioritized tests, statistical requirements, a results documentation template, and a decision framework for acting on results

Prompt

You are building an A/B test plan for a Kate Makrigiannis consulting engagement. Kate helps clients run disciplined experiments instead of "let's just try it and see" testing that wastes traffic and produces inconclusive results. Before writing, read knowledge/voice-tone-guide.md -- use the client-facing voice.

Inputs I will provide:

  • Test ideas: {{TEST_IDEAS}} (list of potential tests -- can be rough ideas like "try a different headline" or detailed hypotheses)
  • Traffic volume: {{TRAFFIC}} (monthly unique visitors or sessions to the pages being tested)
  • Current conversion rate: {{CURRENT_RATE}} (baseline conversion rate for the primary metric -- e.g., "3.2% of visitors sign up")
  • Minimum detectable effect: {{MDE}} (smallest meaningful improvement -- e.g., "10% relative lift" or "0.5 percentage points absolute")
  • Context (optional): {{CONTEXT}} (testing tool in use, prior test history, organizational constraints, statistical preferences)

Step 1: Hypothesis formation

Transform each test idea into a structured hypothesis using the if/then/because format.

Hypothesis Template

For each test idea:

ElementContent
Test name[descriptive name -- e.g., "Homepage CTA text: action-oriented vs. benefit-oriented"]
HypothesisIf we [change], then [metric] will [improve/increase/decrease] because [reasoning based on user behavior, data, or best practice]
Primary metric[the one metric that determines success -- e.g., "form submission rate"]
Secondary metrics[supporting metrics to monitor -- e.g., "bounce rate, time on page, downstream conversion"]
Guardrail metrics[metrics that must not degrade -- e.g., "revenue per visitor, customer support tickets"]

Hypothesis Quality Check

For each hypothesis, verify:

  • The "if" is specific and testable (a concrete change, not "improve the experience")
  • The "then" is measurable and tied to a single primary metric
  • The "because" is grounded in evidence, user research, or a defensible assumption
  • The change is isolated (one variable per test, not a full redesign)

Bad hypothesis: "If we redesign the page, it will convert better." Good hypothesis: "If we replace the generic stock photo hero image with a product screenshot showing the dashboard, then sign-up rate will increase by 15% because visitors in our last survey said they wanted to see the product before committing."

Step 2: Test prioritization

Score each test using both ICE and PIE frameworks. Use both to triangulate priority, not just one.

ICE Scoring (Impact x Confidence x Ease)

#Test NameImpact (1-10)Confidence (1-10)Ease (1-10)ICE ScoreRank
1[test][score][score][score][I x C x E][rank]

ICE scoring guide:

Factor1-34-67-10
ImpactMarginal improvement; affects small segmentModerate improvement; affects meaningful trafficLarge improvement; affects high-traffic or high-value page
ConfidenceGut feeling; no supporting dataSome evidence (best practices, competitor examples)Strong evidence (user research, analytics, prior tests)
EaseRequires engineering, design, and stakeholder alignment (weeks)Moderate effort (days, one team)Copy/config change, can launch today

PIE Scoring (Potential x Importance x Ease)

#Test NamePotential (1-10)Importance (1-10)Ease (1-10)PIE ScoreRank
1[test][score][score][score][P x I x E][rank]

PIE scoring guide:

Factor1-34-67-10
PotentialPage is already well-optimized; small room for improvementSome clear issues but not brokenSignificant friction, poor heuristic scores, obvious problems
ImportanceLow-traffic page, minor step in funnelModerate traffic or mid-funnel pageHigh-traffic page, critical funnel step, or high revenue impact
EaseSame as ICE Ease definitionSameSame

Combined Priority

#Test NameICE ScoreICE RankPIE ScorePIE RankAvg RankFinal Priority
1[test][score][rank][score][rank][avg][P1/P2/P3]

Show the math for all scores. Do not round or hide calculations.

Step 3: Variant design guidelines

For each prioritized test, define the control and variant(s).

Test Design: [Test Name]

ElementControl (A)Variant (B)Variant (C, if applicable)
What changes[current state, described specifically][the specific change][alternative change, if testing multiple variants]
What stays the same[everything else -- explicitly list key elements that remain constant]SameSame
Screenshot / mockup[describe or attach][describe or attach][describe or attach]

Variant design rules:

  • Change one variable per test. If you change the headline and the CTA and the image, you will not know what worked.
  • The control must be the current live experience, not a theoretical "ideal."
  • Variants must be meaningfully different. Changing "Submit" to "Send" is unlikely to be detectable. Changing "Submit" to "Get my free report" is testable.
  • If testing more than 2 variants (A/B/C or A/B/C/D), traffic requirements increase proportionally. Flag if traffic is insufficient.

Step 4: Traffic allocation and duration calculation

Traffic Requirements

For each test, calculate the required sample size and duration.

Sample size formula (per variant, for a two-tailed test at 95% confidence and 80% power):

Sample per variant = 16 x p x (1 - p) / delta^2

Where:

  • p = baseline conversion rate (as a decimal)
  • delta = minimum detectable effect (as an absolute decimal)

Show the math for each test:

TestBaseline Rate (p)MDE (relative)MDE (absolute, delta)Sample per VariantTotal Sample (all variants)Weekly Traffic to PageEstimated Duration
[test name][e.g., 0.032][e.g., 10%][e.g., 0.0032][calculated][sample x number of variants][traffic][weeks]

Example calculation:

  • Baseline rate: 3.2% (p = 0.032)
  • MDE: 10% relative = 0.32 percentage points absolute (delta = 0.0032)
  • Sample per variant: 16 x 0.032 x 0.968 / 0.0032^2 = 16 x 0.030976 / 0.00001024 = 48,400 per variant
  • Total sample (2 variants): 96,800
  • Weekly traffic to page: 12,000
  • Duration: 96,800 / 12,000 = 8.1 weeks, round up to 9 weeks

Traffic allocation recommendations:

  • 50/50 split (default): Equal traffic to control and variant. Fastest to reach significance.
  • 70/30 or 80/20 split: Use when the variant is risky and you want to limit exposure. Increases duration by 30-50%.
  • Multi-variant (A/B/C): Split traffic equally among variants. Requires proportionally more total traffic. Flag if total duration exceeds 8 weeks.

Duration guardrails:

  • Minimum test duration: 2 full business weeks (to capture weekly patterns)
  • Maximum recommended duration: 8 weeks (after that, external factors confound results)
  • If a test requires > 8 weeks at current traffic, either: increase the MDE, focus on higher-traffic pages, or flag as not testable with current traffic

Step 5: Results documentation template

For each completed test, document results using this structure:

Test Results: [Test Name]

FieldValue
Test name[name]
Hypothesis[if/then/because]
Date range[start date] to [end date]
Duration[weeks]
Traffic split[allocation per variant]
Total sample[visitors per variant]
Total conversions[conversions per variant]

Results

VariantVisitorsConversionsConversion RateRelative Lift vs. ControlStatistical SignificanceConfidence
Control (A)[n][n][%]------
Variant (B)[n][n][%][+/- %][Yes/No][% confidence]

Secondary Metrics

MetricControlVariantChangeSignificant?
[metric 1][value][value][+/- %][Yes/No]
[metric 2][value][value][+/- %][Yes/No]

Guardrail Check

Guardrail MetricControlVariantChangeWithin Tolerance?
[metric][value][value][+/- %][Yes/No]

Learnings

  • What did we learn about user behavior?
  • Did the hypothesis hold? Why or why not?
  • What does this suggest for the next test?

Step 6: Decision framework

After a test concludes, apply this decision framework:

Ship / Iterate / Kill

DecisionCriteriaAction
ShipVariant won with statistical significance (p < 0.05), lift exceeds MDE, no guardrail violationsDeploy the variant to 100% of traffic. Document the win.
IterateResults are directionally positive but not statistically significant, or the lift is below MDEDesign a follow-up test with a bolder variant or higher MDE. Do not ship an inconclusive result.
KillVariant lost or showed no difference with sufficient sample sizeReject the hypothesis. Document the learning. Move to the next priority test.
InvestigateUnexpected results (variant won on primary but lost on guardrail, or secondary metrics tell a different story)Dig into segments before deciding. Look at the result by device, traffic source, new vs. returning, etc.

Common decision traps to avoid:

  • Peeking at results early and stopping the test when it "looks good" (inflates false positive rate)
  • Running past the planned duration hoping a losing test will turn around (sunk cost fallacy)
  • Shipping a "flat" result because the variant is preferred by the team (opinion, not data)
  • Ignoring guardrail violations because the primary metric improved ("we'll fix revenue later")

Related skills: Pairs with /experiment-design for broader experimentation strategy beyond A/B tests. Receives test ideas from /cro-audit audit findings. Uses /funnel-analysis to identify which funnel steps need testing most.


Example Output

Input

  • Test ideas: (1) Change the pricing page hero headline from "Simple, transparent pricing" to "Start free. Upgrade when you're ready." (2) Add a live chat widget to the pricing page (3) Replace the three-column plan comparison table with a single recommended plan highlighted with "Most Popular" badge and a toggle to see other plans (4) Change the primary CTA button color from grey to orange
  • Traffic volume: 28,000 monthly unique visitors to the pricing page; roughly 6,500/week
  • Current conversion rate: 4.1% of pricing page visitors start a free trial
  • Minimum detectable effect: 15% relative lift (approximately 0.6 percentage points absolute)
  • Context: Testing tool is VWO. Prior test history: one inconclusive CTA copy test run last quarter (stopped early at 12 days). Team has front-end dev capacity but no dedicated experimentation analyst. Preference for 95% confidence, 80% power.

Output

A/B Test Plan: Pricing Page Optimization

Client: [Redacted] | Prepared by: Kate Makrigiannis Consulting | Testing tool: VWO


Step 1: Hypothesis Formation

Test 1 — Pricing Page Hero Headline

ElementContent
Test namePricing headline: passive description vs. action-oriented free trial framing
HypothesisIf we replace "Simple, transparent pricing" with "Start free. Upgrade when you're ready." then free trial start rate will increase because the current headline describes the pricing page's format rather than reducing commitment anxiety — the most common reason visitors exit without converting (per exit survey data)
Primary metricFree trial start rate (pricing page → trial signup)
Secondary metricsTime on page, scroll depth past plan comparison table, plan tier selected
Guardrail metricsTrial-to-paid conversion rate at 30 days (ensure we're not attracting lower-intent signups)

Hypothesis quality check:

  • "If" is specific: exact copy change, not "improve the headline"
  • "Then" tied to one primary metric
  • "Because" grounded in exit survey evidence
  • One variable changed: headline text only

Test 2 — Live Chat Widget on Pricing Page

ElementContent
Test namePricing page: no live chat vs. live chat widget (Intercom)
HypothesisIf we add a live chat widget (Intercom, business hours only) to the pricing page, then free trial start rate will increase because visitors who reach the pricing page but don't convert often have unanswered plan comparison questions — removing that friction point should reduce exit rate
Primary metricFree trial start rate
Secondary metricsChat engagement rate, support ticket volume, session duration
Guardrail metricsTrial-to-paid 30-day conversion rate; customer support cost per pricing page session

Hypothesis quality check:

  • Specific change: Intercom widget, business hours, pricing page only
  • Single primary metric
  • Reasoning tied to observed behavior (pricing page exit without trial start)
  • ⚠️ Flag: Live chat introduces a confound — results will be partially driven by chat response quality and agent availability, not just presence of the widget. Document agent coverage during test window.

Test 3 — Plan Comparison Layout: Full Table vs. Highlighted Recommendation

ElementContent
Test namePlan comparison: three-column table vs. single recommended plan with toggle
HypothesisIf we replace the three-column plan comparison table with a single "Most Popular" highlighted plan and a "See all plans" toggle, then free trial start rate will increase because choice overload in three-way comparisons increases decision paralysis — reducing the default choice set to one recommended option lowers cognitive load at the moment of commitment
Primary metricFree trial start rate
Secondary metricsPlan tier distribution among new trials, "See all plans" toggle engagement rate, time-to-click-CTA
Guardrail metricsAverage trial plan tier (ensure we're not pushing lower-tier signups); revenue per new trial at 30 days

Hypothesis quality check:

  • Specific layout change described
  • Tied to single primary metric
  • "Because" grounded in choice overload research and funnel behavior
  • One variable: layout structure (copy, pricing, and features remain identical)

Test 4 — CTA Button Color: Grey vs. Orange

ElementContent
Test nameCTA button color: grey (#6B7280) vs. orange (#F97316)
HypothesisIf we change the primary CTA button from grey to orange, then free trial start rate will increase because the grey button has low contrast against the white pricing card background — orange creates stronger visual salience and is consistent with the conversion best practice of high-contrast CTAs
Primary metricFree trial start rate
Secondary metricsCTA click-through rate (button click ÷ page visitors), bounce rate
Guardrail metricsNone flagged — low-risk visual change

Hypothesis quality check:

  • Exact hex values specified — not "a brighter color"
  • One variable: button color only (copy, size, placement unchanged)
  • Evidence: contrast ratio calculation + best practice
  • ⚠️ Note: This is a low-effort test, but color tests on pages with this conversion rate often produce small absolute lifts. The 15% relative MDE may be ambitious for a single-element change. Flag for ICE/PIE scoring.

Step 2: Test Prioritization

ICE Scoring

#Test NameImpact (1–10)Confidence (1–10)Ease (1–10)ICE ScoreRank
1Headline: action-oriented framing7794412
2Live chat widget6541204
3Plan comparison layout8752803
4CTA button color56103001

ICE scoring rationale:

  • Test 1 — Headline: Impact 7 (headline is above the fold, seen by all visitors; moderate-to-large expected effect). Confidence 7 (exit survey data directly supports the anxiety hypothesis). Ease 9 (copy change in VWO, no dev required).
  • Test 2 — Live chat: Impact 6 (chat can meaningfully reduce friction, but effect size is uncertain). Confidence 5 (plausible but no prior test or direct data). Ease 4 (requires Intercom integration, agent scheduling, and QA — multi-team effort).
  • Test 3 — Plan layout: Impact 8 (affects the core decision moment on the page; choice overload is a high-confidence CRO lever). Confidence 7 (choice overload research + the current three-column table has no clear visual hierarchy). Ease 5 (requires front-end dev for toggle component; 2–3 days estimated).
  • Test 4 — Button color: Impact 5 (isolated element; unlikely to move the needle 15% on its own). Confidence 6 (contrast principle is well-established). Ease 10 (VWO CSS change, launchable same day).

PIE Scoring

#Test NamePotential (1–10)Importance (1–10)Ease (1–10)PIE ScoreRank
1Headline: action-oriented framing7995671
2Live chat widget6942164
3Plan comparison layout8953602
4CTA button color59104503