You need to plan, prioritize, or document A/B tests with proper hypothesis formation, statistical rigor, and decision frameworks. Covers hypothesis structure, test prioritization (ICE/PIE), variant design, traffic allocation, duration calculation, results documentation, and ship/iterate/kill decisions.
How it works
- You provide test ideas, traffic volume, current conversion rate, and minimum detectable effect
- The skill structures each test idea into a formal hypothesis, scores priorities using ICE and PIE frameworks, calculates required sample sizes and durations, and provides variant design guidelines
- It returns a test plan with prioritized tests, statistical requirements, a results documentation template, and a decision framework for acting on results
Prompt
You are building an A/B test plan for a Kate Makrigiannis consulting engagement. Kate helps clients run disciplined experiments instead of "let's just try it and see" testing that wastes traffic and produces inconclusive results. Before writing, read knowledge/voice-tone-guide.md -- use the client-facing voice.
Inputs I will provide:
- Test ideas: {{TEST_IDEAS}} (list of potential tests -- can be rough ideas like "try a different headline" or detailed hypotheses)
- Traffic volume: {{TRAFFIC}} (monthly unique visitors or sessions to the pages being tested)
- Current conversion rate: {{CURRENT_RATE}} (baseline conversion rate for the primary metric -- e.g., "3.2% of visitors sign up")
- Minimum detectable effect: {{MDE}} (smallest meaningful improvement -- e.g., "10% relative lift" or "0.5 percentage points absolute")
- Context (optional): {{CONTEXT}} (testing tool in use, prior test history, organizational constraints, statistical preferences)
Step 1: Hypothesis formation
Transform each test idea into a structured hypothesis using the if/then/because format.
Hypothesis Template
For each test idea:
| Element | Content |
|---|---|
| Test name | [descriptive name -- e.g., "Homepage CTA text: action-oriented vs. benefit-oriented"] |
| Hypothesis | If we [change], then [metric] will [improve/increase/decrease] because [reasoning based on user behavior, data, or best practice] |
| Primary metric | [the one metric that determines success -- e.g., "form submission rate"] |
| Secondary metrics | [supporting metrics to monitor -- e.g., "bounce rate, time on page, downstream conversion"] |
| Guardrail metrics | [metrics that must not degrade -- e.g., "revenue per visitor, customer support tickets"] |
Hypothesis Quality Check
For each hypothesis, verify:
- The "if" is specific and testable (a concrete change, not "improve the experience")
- The "then" is measurable and tied to a single primary metric
- The "because" is grounded in evidence, user research, or a defensible assumption
- The change is isolated (one variable per test, not a full redesign)
Bad hypothesis: "If we redesign the page, it will convert better." Good hypothesis: "If we replace the generic stock photo hero image with a product screenshot showing the dashboard, then sign-up rate will increase by 15% because visitors in our last survey said they wanted to see the product before committing."
Step 2: Test prioritization
Score each test using both ICE and PIE frameworks. Use both to triangulate priority, not just one.
ICE Scoring (Impact x Confidence x Ease)
| # | Test Name | Impact (1-10) | Confidence (1-10) | Ease (1-10) | ICE Score | Rank |
|---|---|---|---|---|---|---|
| 1 | [test] | [score] | [score] | [score] | [I x C x E] | [rank] |
ICE scoring guide:
| Factor | 1-3 | 4-6 | 7-10 |
|---|---|---|---|
| Impact | Marginal improvement; affects small segment | Moderate improvement; affects meaningful traffic | Large improvement; affects high-traffic or high-value page |
| Confidence | Gut feeling; no supporting data | Some evidence (best practices, competitor examples) | Strong evidence (user research, analytics, prior tests) |
| Ease | Requires engineering, design, and stakeholder alignment (weeks) | Moderate effort (days, one team) | Copy/config change, can launch today |
PIE Scoring (Potential x Importance x Ease)
| # | Test Name | Potential (1-10) | Importance (1-10) | Ease (1-10) | PIE Score | Rank |
|---|---|---|---|---|---|---|
| 1 | [test] | [score] | [score] | [score] | [P x I x E] | [rank] |
PIE scoring guide:
| Factor | 1-3 | 4-6 | 7-10 |
|---|---|---|---|
| Potential | Page is already well-optimized; small room for improvement | Some clear issues but not broken | Significant friction, poor heuristic scores, obvious problems |
| Importance | Low-traffic page, minor step in funnel | Moderate traffic or mid-funnel page | High-traffic page, critical funnel step, or high revenue impact |
| Ease | Same as ICE Ease definition | Same | Same |
Combined Priority
| # | Test Name | ICE Score | ICE Rank | PIE Score | PIE Rank | Avg Rank | Final Priority |
|---|---|---|---|---|---|---|---|
| 1 | [test] | [score] | [rank] | [score] | [rank] | [avg] | [P1/P2/P3] |
Show the math for all scores. Do not round or hide calculations.
Step 3: Variant design guidelines
For each prioritized test, define the control and variant(s).
Test Design: [Test Name]
| Element | Control (A) | Variant (B) | Variant (C, if applicable) |
|---|---|---|---|
| What changes | [current state, described specifically] | [the specific change] | [alternative change, if testing multiple variants] |
| What stays the same | [everything else -- explicitly list key elements that remain constant] | Same | Same |
| Screenshot / mockup | [describe or attach] | [describe or attach] | [describe or attach] |
Variant design rules:
- Change one variable per test. If you change the headline and the CTA and the image, you will not know what worked.
- The control must be the current live experience, not a theoretical "ideal."
- Variants must be meaningfully different. Changing "Submit" to "Send" is unlikely to be detectable. Changing "Submit" to "Get my free report" is testable.
- If testing more than 2 variants (A/B/C or A/B/C/D), traffic requirements increase proportionally. Flag if traffic is insufficient.
Step 4: Traffic allocation and duration calculation
Traffic Requirements
For each test, calculate the required sample size and duration.
Sample size formula (per variant, for a two-tailed test at 95% confidence and 80% power):
Sample per variant = 16 x p x (1 - p) / delta^2
Where:
- p = baseline conversion rate (as a decimal)
- delta = minimum detectable effect (as an absolute decimal)
Show the math for each test:
| Test | Baseline Rate (p) | MDE (relative) | MDE (absolute, delta) | Sample per Variant | Total Sample (all variants) | Weekly Traffic to Page | Estimated Duration |
|---|---|---|---|---|---|---|---|
| [test name] | [e.g., 0.032] | [e.g., 10%] | [e.g., 0.0032] | [calculated] | [sample x number of variants] | [traffic] | [weeks] |
Example calculation:
- Baseline rate: 3.2% (p = 0.032)
- MDE: 10% relative = 0.32 percentage points absolute (delta = 0.0032)
- Sample per variant: 16 x 0.032 x 0.968 / 0.0032^2 = 16 x 0.030976 / 0.00001024 = 48,400 per variant
- Total sample (2 variants): 96,800
- Weekly traffic to page: 12,000
- Duration: 96,800 / 12,000 = 8.1 weeks, round up to 9 weeks
Traffic allocation recommendations:
- 50/50 split (default): Equal traffic to control and variant. Fastest to reach significance.
- 70/30 or 80/20 split: Use when the variant is risky and you want to limit exposure. Increases duration by 30-50%.
- Multi-variant (A/B/C): Split traffic equally among variants. Requires proportionally more total traffic. Flag if total duration exceeds 8 weeks.
Duration guardrails:
- Minimum test duration: 2 full business weeks (to capture weekly patterns)
- Maximum recommended duration: 8 weeks (after that, external factors confound results)
- If a test requires > 8 weeks at current traffic, either: increase the MDE, focus on higher-traffic pages, or flag as not testable with current traffic
Step 5: Results documentation template
For each completed test, document results using this structure:
Test Results: [Test Name]
| Field | Value |
|---|---|
| Test name | [name] |
| Hypothesis | [if/then/because] |
| Date range | [start date] to [end date] |
| Duration | [weeks] |
| Traffic split | [allocation per variant] |
| Total sample | [visitors per variant] |
| Total conversions | [conversions per variant] |
Results
| Variant | Visitors | Conversions | Conversion Rate | Relative Lift vs. Control | Statistical Significance | Confidence |
|---|---|---|---|---|---|---|
| Control (A) | [n] | [n] | [%] | -- | -- | -- |
| Variant (B) | [n] | [n] | [%] | [+/- %] | [Yes/No] | [% confidence] |
Secondary Metrics
| Metric | Control | Variant | Change | Significant? |
|---|---|---|---|---|
| [metric 1] | [value] | [value] | [+/- %] | [Yes/No] |
| [metric 2] | [value] | [value] | [+/- %] | [Yes/No] |
Guardrail Check
| Guardrail Metric | Control | Variant | Change | Within Tolerance? |
|---|---|---|---|---|
| [metric] | [value] | [value] | [+/- %] | [Yes/No] |
Learnings
- What did we learn about user behavior?
- Did the hypothesis hold? Why or why not?
- What does this suggest for the next test?
Step 6: Decision framework
After a test concludes, apply this decision framework:
Ship / Iterate / Kill
| Decision | Criteria | Action |
|---|---|---|
| Ship | Variant won with statistical significance (p < 0.05), lift exceeds MDE, no guardrail violations | Deploy the variant to 100% of traffic. Document the win. |
| Iterate | Results are directionally positive but not statistically significant, or the lift is below MDE | Design a follow-up test with a bolder variant or higher MDE. Do not ship an inconclusive result. |
| Kill | Variant lost or showed no difference with sufficient sample size | Reject the hypothesis. Document the learning. Move to the next priority test. |
| Investigate | Unexpected results (variant won on primary but lost on guardrail, or secondary metrics tell a different story) | Dig into segments before deciding. Look at the result by device, traffic source, new vs. returning, etc. |
Common decision traps to avoid:
- Peeking at results early and stopping the test when it "looks good" (inflates false positive rate)
- Running past the planned duration hoping a losing test will turn around (sunk cost fallacy)
- Shipping a "flat" result because the variant is preferred by the team (opinion, not data)
- Ignoring guardrail violations because the primary metric improved ("we'll fix revenue later")
Related skills: Pairs with
/experiment-designfor broader experimentation strategy beyond A/B tests. Receives test ideas from/cro-auditaudit findings. Uses/funnel-analysisto identify which funnel steps need testing most.
Example Output
Input
- Test ideas: (1) Change the pricing page hero headline from "Simple, transparent pricing" to "Start free. Upgrade when you're ready." (2) Add a live chat widget to the pricing page (3) Replace the three-column plan comparison table with a single recommended plan highlighted with "Most Popular" badge and a toggle to see other plans (4) Change the primary CTA button color from grey to orange
- Traffic volume: 28,000 monthly unique visitors to the pricing page; roughly 6,500/week
- Current conversion rate: 4.1% of pricing page visitors start a free trial
- Minimum detectable effect: 15% relative lift (approximately 0.6 percentage points absolute)
- Context: Testing tool is VWO. Prior test history: one inconclusive CTA copy test run last quarter (stopped early at 12 days). Team has front-end dev capacity but no dedicated experimentation analyst. Preference for 95% confidence, 80% power.
Output
A/B Test Plan: Pricing Page Optimization
Client: [Redacted] | Prepared by: Kate Makrigiannis Consulting | Testing tool: VWO
Step 1: Hypothesis Formation
Test 1 — Pricing Page Hero Headline
| Element | Content |
|---|---|
| Test name | Pricing headline: passive description vs. action-oriented free trial framing |
| Hypothesis | If we replace "Simple, transparent pricing" with "Start free. Upgrade when you're ready." then free trial start rate will increase because the current headline describes the pricing page's format rather than reducing commitment anxiety — the most common reason visitors exit without converting (per exit survey data) |
| Primary metric | Free trial start rate (pricing page → trial signup) |
| Secondary metrics | Time on page, scroll depth past plan comparison table, plan tier selected |
| Guardrail metrics | Trial-to-paid conversion rate at 30 days (ensure we're not attracting lower-intent signups) |
Hypothesis quality check:
- "If" is specific: exact copy change, not "improve the headline"
- "Then" tied to one primary metric
- "Because" grounded in exit survey evidence
- One variable changed: headline text only
Test 2 — Live Chat Widget on Pricing Page
| Element | Content |
|---|---|
| Test name | Pricing page: no live chat vs. live chat widget (Intercom) |
| Hypothesis | If we add a live chat widget (Intercom, business hours only) to the pricing page, then free trial start rate will increase because visitors who reach the pricing page but don't convert often have unanswered plan comparison questions — removing that friction point should reduce exit rate |
| Primary metric | Free trial start rate |
| Secondary metrics | Chat engagement rate, support ticket volume, session duration |
| Guardrail metrics | Trial-to-paid 30-day conversion rate; customer support cost per pricing page session |
Hypothesis quality check:
- Specific change: Intercom widget, business hours, pricing page only
- Single primary metric
- Reasoning tied to observed behavior (pricing page exit without trial start)
- ⚠️ Flag: Live chat introduces a confound — results will be partially driven by chat response quality and agent availability, not just presence of the widget. Document agent coverage during test window.
Test 3 — Plan Comparison Layout: Full Table vs. Highlighted Recommendation
| Element | Content |
|---|---|
| Test name | Plan comparison: three-column table vs. single recommended plan with toggle |
| Hypothesis | If we replace the three-column plan comparison table with a single "Most Popular" highlighted plan and a "See all plans" toggle, then free trial start rate will increase because choice overload in three-way comparisons increases decision paralysis — reducing the default choice set to one recommended option lowers cognitive load at the moment of commitment |
| Primary metric | Free trial start rate |
| Secondary metrics | Plan tier distribution among new trials, "See all plans" toggle engagement rate, time-to-click-CTA |
| Guardrail metrics | Average trial plan tier (ensure we're not pushing lower-tier signups); revenue per new trial at 30 days |
Hypothesis quality check:
- Specific layout change described
- Tied to single primary metric
- "Because" grounded in choice overload research and funnel behavior
- One variable: layout structure (copy, pricing, and features remain identical)
Test 4 — CTA Button Color: Grey vs. Orange
| Element | Content |
|---|---|
| Test name | CTA button color: grey (#6B7280) vs. orange (#F97316) |
| Hypothesis | If we change the primary CTA button from grey to orange, then free trial start rate will increase because the grey button has low contrast against the white pricing card background — orange creates stronger visual salience and is consistent with the conversion best practice of high-contrast CTAs |
| Primary metric | Free trial start rate |
| Secondary metrics | CTA click-through rate (button click ÷ page visitors), bounce rate |
| Guardrail metrics | None flagged — low-risk visual change |
Hypothesis quality check:
- Exact hex values specified — not "a brighter color"
- One variable: button color only (copy, size, placement unchanged)
- Evidence: contrast ratio calculation + best practice
- ⚠️ Note: This is a low-effort test, but color tests on pages with this conversion rate often produce small absolute lifts. The 15% relative MDE may be ambitious for a single-element change. Flag for ICE/PIE scoring.
Step 2: Test Prioritization
ICE Scoring
| # | Test Name | Impact (1–10) | Confidence (1–10) | Ease (1–10) | ICE Score | Rank |
|---|---|---|---|---|---|---|
| 1 | Headline: action-oriented framing | 7 | 7 | 9 | 441 | 2 |
| 2 | Live chat widget | 6 | 5 | 4 | 120 | 4 |
| 3 | Plan comparison layout | 8 | 7 | 5 | 280 | 3 |
| 4 | CTA button color | 5 | 6 | 10 | 300 | 1 |
ICE scoring rationale:
- Test 1 — Headline: Impact 7 (headline is above the fold, seen by all visitors; moderate-to-large expected effect). Confidence 7 (exit survey data directly supports the anxiety hypothesis). Ease 9 (copy change in VWO, no dev required).
- Test 2 — Live chat: Impact 6 (chat can meaningfully reduce friction, but effect size is uncertain). Confidence 5 (plausible but no prior test or direct data). Ease 4 (requires Intercom integration, agent scheduling, and QA — multi-team effort).
- Test 3 — Plan layout: Impact 8 (affects the core decision moment on the page; choice overload is a high-confidence CRO lever). Confidence 7 (choice overload research + the current three-column table has no clear visual hierarchy). Ease 5 (requires front-end dev for toggle component; 2–3 days estimated).
- Test 4 — Button color: Impact 5 (isolated element; unlikely to move the needle 15% on its own). Confidence 6 (contrast principle is well-established). Ease 10 (VWO CSS change, launchable same day).
PIE Scoring
| # | Test Name | Potential (1–10) | Importance (1–10) | Ease (1–10) | PIE Score | Rank |
|---|---|---|---|---|---|---|
| 1 | Headline: action-oriented framing | 7 | 9 | 9 | 567 | 1 |
| 2 | Live chat widget | 6 | 9 | 4 | 216 | 4 |
| 3 | Plan comparison layout | 8 | 9 | 5 | 360 | 2 |
| 4 | CTA button color | 5 | 9 | 10 | 450 | 3 |