A/B Test Planner

You need to plan, prioritize, or document A/B tests with proper hypothesis formation, statistical rigor, and decision frameworks. Covers hypothesis structure, test prioritization (ICE/PIE), variant design, traffic allocation, duration calculation, results documentation, and ship/iterate/kill decisions.

How it works

You provide test ideas, traffic volume, current conversion rate, and minimum detectable effect
The skill structures each test idea into a formal hypothesis, scores priorities using ICE and PIE frameworks, calculates required sample sizes and durations, and provides variant design guidelines
It returns a test plan with prioritized tests, statistical requirements, a results documentation template, and a decision framework for acting on results

Prompt

You are building an A/B test plan for a consulting engagement. This skill helps clients run disciplined experiments instead of "let's just try it and see" testing that wastes traffic and produces inconclusive results. Before writing, read knowledge/voice-tone-guide.md -- use the client-facing voice.

Inputs I will provide:

Test ideas: {{TEST_IDEAS}} (list of potential tests -- can be rough ideas like "try a different headline" or detailed hypotheses)
Traffic volume: {{TRAFFIC}} (monthly unique visitors or sessions to the pages being tested)
Current conversion rate: {{CURRENT_RATE}} (baseline conversion rate for the primary metric -- e.g., "3.2% of visitors sign up")
Minimum detectable effect: {{MDE}} (smallest meaningful improvement -- e.g., "10% relative lift" or "0.5 percentage points absolute")
Context (optional): {{CONTEXT}} (testing tool in use, prior test history, organizational constraints, statistical preferences)

Step 1: Hypothesis formation

Transform each test idea into a structured hypothesis using the if/then/because format.

Hypothesis Template

For each test idea:

Element	Content
Test name	[descriptive name -- e.g., "Homepage CTA text: action-oriented vs. benefit-oriented"]
Hypothesis	If we [change], then [metric] will [improve/increase/decrease] because [reasoning based on user behavior, data, or best practice]
Primary metric	[the one metric that determines success -- e.g., "form submission rate"]
Secondary metrics	[supporting metrics to monitor -- e.g., "bounce rate, time on page, downstream conversion"]
Guardrail metrics	[metrics that must not degrade -- e.g., "revenue per visitor, customer support tickets"]

Hypothesis Quality Check

For each hypothesis, verify:

The "if" is specific and testable (a concrete change, not "improve the experience")
The "then" is measurable and tied to a single primary metric
The "because" is grounded in evidence, user research, or a defensible assumption
The change is isolated (one variable per test, not a full redesign)

Bad hypothesis: "If we redesign the page, it will convert better." Good hypothesis: "If we replace the generic stock photo hero image with a product screenshot showing the dashboard, then sign-up rate will increase by 15% because visitors in our last survey said they wanted to see the product before committing."

Step 2: Test prioritization

Score each test using both ICE and PIE frameworks. Use both to triangulate priority, not just one.

ICE Scoring (Impact x Confidence x Ease)

#	Test Name	Impact (1-10)	Confidence (1-10)	Ease (1-10)	ICE Score	Rank
1	[test]	[score]	[score]	[score]	[I x C x E]	[rank]

ICE scoring guide:

Factor	1-3	4-6	7-10
Impact	Marginal improvement; affects small segment	Moderate improvement; affects meaningful traffic	Large improvement; affects high-traffic or high-value page
Confidence	Gut feeling; no supporting data	Some evidence (best practices, competitor examples)	Strong evidence (user research, analytics, prior tests)
Ease	Requires engineering, design, and stakeholder alignment (weeks)	Moderate effort (days, one team)	Copy/config change, can launch today

PIE Scoring (Potential x Importance x Ease)

#	Test Name	Potential (1-10)	Importance (1-10)	Ease (1-10)	PIE Score	Rank
1	[test]	[score]	[score]	[score]	[P x I x E]	[rank]

PIE scoring guide:

Factor	1-3	4-6	7-10
Potential	Page is already well-optimized; small room for improvement	Some clear issues but not broken	Significant friction, poor heuristic scores, obvious problems
Importance	Low-traffic page, minor step in funnel	Moderate traffic or mid-funnel page	High-traffic page, critical funnel step, or high revenue impact
Ease	Same as ICE Ease definition	Same	Same

Combined Priority

#	Test Name	ICE Score	ICE Rank	PIE Score	PIE Rank	Avg Rank	Final Priority
1	[test]	[score]	[rank]	[score]	[rank]	[avg]	[P1/P2/P3]

Show the math for all scores. Do not round or hide calculations.

Step 3: Variant design guidelines

For each prioritized test, define the control and variant(s).

Test Design: [Test Name]

Element	Control (A)	Variant (B)	Variant (C, if applicable)
What changes	[current state, described specifically]	[the specific change]	[alternative change, if testing multiple variants]
What stays the same	[everything else -- explicitly list key elements that remain constant]	Same	Same
Screenshot / mockup	[describe or attach]	[describe or attach]	[describe or attach]

Variant design rules:

Change one variable per test. If you change the headline and the CTA and the image, you will not know what worked.
The control must be the current live experience, not a theoretical "ideal."
Variants must be meaningfully different. Changing "Submit" to "Send" is unlikely to be detectable. Changing "Submit" to "Get my free report" is testable.
If testing more than 2 variants (A/B/C or A/B/C/D), traffic requirements increase proportionally. Flag if traffic is insufficient.

Step 4: Traffic allocation and duration calculation

Traffic Requirements

For each test, calculate the required sample size and duration.

Sample size formula (per variant, for a two-tailed test at 95% confidence and 80% power):

Sample per variant = 16 x p x (1 - p) / delta^2

Where:

p = baseline conversion rate (as a decimal)
delta = minimum detectable effect (as an absolute decimal)

Show the math for each test:

Test	Baseline Rate (p)	MDE (relative)	MDE (absolute, delta)	Sample per Variant	Total Sample (all variants)	Weekly Traffic to Page	Estimated Duration
[test name]	[e.g., 0.032]	[e.g., 10%]	[e.g., 0.0032]	[calculated]	[sample x number of variants]	[traffic]	[weeks]

Example calculation:

Baseline rate: 3.2% (p = 0.032)
MDE: 10% relative = 0.32 percentage points absolute (delta = 0.0032)
Sample per variant: 16 x 0.032 x 0.968 / 0.0032^2 = 16 x 0.030976 / 0.00001024 = 48,400 per variant
Total sample (2 variants): 96,800
Weekly traffic to page: 12,000
Duration: 96,800 / 12,000 = 8.1 weeks, round up to 9 weeks

Traffic allocation recommendations:

50/50 split (default): Equal traffic to control and variant. Fastest to reach significance.
70/30 or 80/20 split: Use when the variant is risky and you want to limit exposure. Increases duration by 30-50%.
Multi-variant (A/B/C): Split traffic equally among variants. Requires proportionally more total traffic. Flag if total duration exceeds 8 weeks.

Duration guardrails:

Minimum test duration: 2 full business weeks (to capture weekly patterns)
Maximum recommended duration: 8 weeks (after that, external factors confound results)
If a test requires > 8 weeks at current traffic, either: increase the MDE, focus on higher-traffic pages, or flag as not testable with current traffic

Step 4b: Modern statistical defaults (Bayesian, sequential, CUPED)

The fixed-sample formula in Step 4 is the floor, not the ceiling. 2026 experimentation platforms (GrowthBook, Statsig, Spotify Confidence) default to methods that get to a valid decision faster and read more cleanly to stakeholders. Apply these unless the engagement's tooling forces classic fixed-horizon frequentist.

Bayesian readout. Report "probability the variant beats control" and "expected loss if you ship the wrong arm" instead of a p-value and a binary significant/not-significant. A stakeholder understands "87% chance B wins, and if we're wrong the downside is 0.2% conversion" far better than "p = 0.04." Set a decision threshold up front (for example, ship at >= 95% probability to beat control, kill at <= 5%). This replaces the Step 6 p < 0.05 gate when the engagement runs Bayesian.

Sequential testing. Classic frequentist tests forbid peeking: looking early and stopping inflates the false-positive rate. Sequential methods (always-valid p-values, group sequential boundaries) make continuous peeking statistically valid, so you can stop early when the result is clear. If the engagement's platform supports sequential testing, peeking is allowed and the "do not peek" trap in Step 6 is relaxed. If it does not, the fixed-duration discipline still holds: pick the duration, then look once.

CUPED variance reduction. CUPED (Controlled experiment Using Pre-Existing Data) uses each user's pre-experiment behavior as a covariate to strip out baseline noise, which can reach significance up to 2x faster at the same traffic. The current standard is the Negi-Wooldridge full-regression estimator, not the original single-covariate version. Recommend CUPED when pre-period data exists for the metric (returning users, logged-in traffic). Note the catch: CUPED needs a stable pre-period per user, so it does little for brand-new visitors with no history. When CUPED applies, the Step 4 sample-size estimate is conservative: real duration will likely be shorter.

Lever	Use when	Effect on the plan
Bayesian readout	Stakeholders need an intuitive ship/kill call	Replaces p-value with probability-to-beat and expected loss
Sequential testing	Platform supports always-valid stats	Peeking allowed, stop early when clear
CUPED	Pre-period data exists for the metric	Same traffic reaches significance faster (often up to 2x)

Step 4c: Validity guardrails (SRM, trigger analysis, suspicious uplift)

Before trusting any result, run the checks that 2026 platforms ship as defaults. A "winning" test with a broken split or a 40% lift is usually a data bug, not a win.

Sample Ratio Mismatch (SRM): the actual traffic split must match the intended split. A 50/50 test that lands at 52/48 with high volume signals a bug (redirect bias, bot filtering, broken randomization). Run a chi-square check. If SRM fires, the test is invalid, full stop. Do not read the result.
Trigger / exposure analysis: measure only users who actually saw the change, not everyone bucketed. If the variant only renders on a sub-page, including users who never reached it dilutes the effect and hides a real win.
Suspicious uplift check: a lift far larger than the MDE you powered for (a "20% relative lift" on a copy tweak) is more often instrumentation error or a novelty spike than a true effect. Treat outsized wins as flags to investigate, not victories to ship.
Guardrail metrics are mandatory, not optional. Every test ships with guardrails (revenue per visitor, latency, error rate, support tickets) that must not degrade. A primary-metric win that tanks a guardrail is not a win. This is already in the Step 1 hypothesis template; enforce it here as a release gate.

Step 5: Results documentation template

For each completed test, document results using this structure:

Test Results: [Test Name]

Field	Value
Test name	[name]
Hypothesis	[if/then/because]
Date range	[start date] to [end date]
Duration	[weeks]
Traffic split	[allocation per variant]
Total sample	[visitors per variant]
Total conversions	[conversions per variant]

Results

Variant	Visitors	Conversions	Conversion Rate	Relative Lift vs. Control	Statistical Significance	Confidence
Control (A)	[n]	[n]	[%]	--	--	--
Variant (B)	[n]	[n]	[%]	[+/- %]	[Yes/No]	[% confidence]

Secondary Metrics

Metric	Control	Variant	Change	Significant?
[metric 1]	[value]	[value]	[+/- %]	[Yes/No]
[metric 2]	[value]	[value]	[+/- %]	[Yes/No]

Guardrail Check

Guardrail Metric	Control	Variant	Change	Within Tolerance?
[metric]	[value]	[value]	[+/- %]	[Yes/No]

Learnings

What did we learn about user behavior?
Did the hypothesis hold? Why or why not?
What does this suggest for the next test?

Step 6: Decision framework

After a test concludes, apply this decision framework:

Ship / Iterate / Kill

Decision	Criteria	Action
Ship	Variant won (p < 0.05 frequentist, or probability-to-beat past your Bayesian threshold), lift exceeds MDE, SRM clean, no guardrail violations	Deploy the variant to 100% of traffic. Document the win.
Iterate	Results are directionally positive but not statistically significant, or the lift is below MDE	Design a follow-up test with a bolder variant or higher MDE. Do not ship an inconclusive result.
Kill	Variant lost or showed no difference with sufficient sample size	Reject the hypothesis. Document the learning. Move to the next priority test.
Investigate	Unexpected results (variant won on primary but lost on guardrail, or secondary metrics tell a different story)	Dig into segments before deciding. Look at the result by device, traffic source, new vs. returning, etc.

Common decision traps to avoid:

Peeking at results early and stopping the test when it "looks good" (inflates false positive rate, unless the platform runs sequential testing per Step 4b, which makes peeking valid)
Running past the planned duration hoping a losing test will turn around (sunk cost fallacy)
Shipping a "flat" result because the variant is preferred by the team (opinion, not data)
Ignoring guardrail violations because the primary metric improved ("we'll fix revenue later")

Related skills: Pairs with /experiment-design for broader experimentation strategy beyond A/B tests. Receives test ideas from /cro-audit audit findings. Uses /funnel-analysis to identify which funnel steps need testing most.

#	Test Name	Impact (1–10)	Confidence (1–10)	Ease (1–10)	ICE Score	Rank
1	Headline: action-oriented framing	7	7	9	441	2
2	Live chat widget	6	5	4	120	4
3	Plan comparison layout	8	7	5	280	3
4	CTA button color	5	6	10	300	1

#	Test Name	Potential (1–10)	Importance (1–10)	Ease (1–10)	PIE Score	Rank
1	Headline: action-oriented framing	7	9	9	567	1
2	Live chat widget	6	9	4	216	4
3	Plan comparison layout	8	9	5	360	2
4	CTA button color	5	9	10	450	3

How it works

Prompt

Hypothesis Template

Hypothesis Quality Check

ICE Scoring (Impact x Confidence x Ease)

PIE Scoring (Potential x Importance x Ease)

Combined Priority

Test Design: [Test Name]

Traffic Requirements

Test Results: [Test Name]

Results

Secondary Metrics

Guardrail Check

Learnings

Ship / Iterate / Kill

Example Output

Input

Output

A/B Test Plan: Pricing Page Optimization

Step 1: Hypothesis Formation

Test 1 — Pricing Page Hero Headline

Test 2 — Live Chat Widget on Pricing Page

Test 3 — Plan Comparison Layout: Full Table vs. Highlighted Recommendation

Test 4 — CTA Button Color: Grey vs. Orange

Step 2: Test Prioritization

ICE Scoring

PIE Scoring

Run this now

Part of these Playbook topics

Related Product Management skills

Element	Content
Test name	Pricing headline: passive description vs. action-oriented free trial framing
Hypothesis	If we replace "Simple, transparent pricing" with "Start free. Upgrade when you're ready." then free trial start rate will increase because the current headline describes the pricing page's format rather than reducing commitment anxiety — the most common reason visitors exit without converting (per exit survey data)
Primary metric	Free trial start rate (pricing page → trial signup)
Secondary metrics	Time on page, scroll depth past plan comparison table, plan tier selected
Guardrail metrics	Trial-to-paid conversion rate at 30 days (ensure we're not attracting lower-intent signups)

Element	Content
Test name	Pricing page: no live chat vs. live chat widget (Intercom)
Hypothesis	If we add a live chat widget (Intercom, business hours only) to the pricing page, then free trial start rate will increase because visitors who reach the pricing page but don't convert often have unanswered plan comparison questions — removing that friction point should reduce exit rate
Primary metric	Free trial start rate
Secondary metrics	Chat engagement rate, support ticket volume, session duration
Guardrail metrics	Trial-to-paid 30-day conversion rate; customer support cost per pricing page session

Element	Content
Test name	Plan comparison: three-column table vs. single recommended plan with toggle
Hypothesis	If we replace the three-column plan comparison table with a single "Most Popular" highlighted plan and a "See all plans" toggle, then free trial start rate will increase because choice overload in three-way comparisons increases decision paralysis — reducing the default choice set to one recommended option lowers cognitive load at the moment of commitment
Primary metric	Free trial start rate
Secondary metrics	Plan tier distribution among new trials, "See all plans" toggle engagement rate, time-to-click-CTA
Guardrail metrics	Average trial plan tier (ensure we're not pushing lower-tier signups); revenue per new trial at 30 days

Element	Content
Test name	CTA button color: grey (#6B7280) vs. orange (#F97316)
Hypothesis	If we change the primary CTA button from grey to orange, then free trial start rate will increase because the grey button has low contrast against the white pricing card background — orange creates stronger visual salience and is consistent with the conversion best practice of high-contrast CTAs
Primary metric	Free trial start rate
Secondary metrics	CTA click-through rate (button click ÷ page visitors), bounce rate
Guardrail metrics	None flagged — low-risk visual change