Use this when a team needs to make a data-informed decision but isn't sure which statistical method to apply. Common situations: comparing two variants, testing whether a metric changed after a launch, determining if a difference between segments is real or noise, or validating whether a correlation is meaningful. Produces a test selection with assumptions check, sample size guidance, and interpretation framework.
Related skills: Complements
/experiment-designand/ab-test-plannerfor experiment analysis. Use/causal-inference-guidewhen the question is causal, not just "is this different?" Pair with/data-quality-assessmentto verify data before testing.
Process
Step 1: Gather inputs
Ask the user to provide:
- The question -- what are you trying to determine? (e.g., "Is variant B better than A?" or "Did churn change after the pricing update?" or "Is there a relationship between usage frequency and NPS?")
- Data type -- what kind of data are you comparing? (Proportions/rates, continuous measurements, counts, time-to-event, ordinal/ranked.)
- Comparison structure -- are you comparing two groups, more than two groups, paired/repeated measurements, or looking for a relationship?
- Sample sizes -- how much data do you have for each group?
- Practical significance -- what size difference would actually matter for a decision? (Not statistical significance -- business significance.)
Step 2: Select the statistical test
Match the question to the right test:
## Statistical Test Selection -- {{question}}, {{date}}
### Decision matrix
| Your situation | Recommended test | Why |
|---------------|-----------------|-----|
| **Comparing two proportions** (e.g., conversion rates for A vs. B) | Z-test for proportions (or chi-square test of independence) | Standard for comparing rates between two groups |
| **Comparing two means** (e.g., average revenue per user in two segments) | Welch's t-test (not Student's t-test) | Robust to unequal variances, which is almost always the case in product data |
| **Comparing two medians** (e.g., session duration, which is heavily skewed) | Mann-Whitney U test (Wilcoxon rank-sum) | Non-parametric; doesn't assume normal distribution |
| **Comparing more than two groups** (e.g., conversion across 3+ variants) | ANOVA (if normal) or Kruskal-Wallis (if not) | Controls family-wise error rate; follow up with pairwise tests |
| **Paired/before-after comparison** (e.g., same users before and after a change) | Paired t-test (if normal) or Wilcoxon signed-rank (if not) | Accounts for within-subject correlation |
| **Relationship between two continuous variables** (e.g., usage frequency vs. satisfaction) | Pearson correlation (if linear) or Spearman rank correlation (if not) | Quantifies association strength and direction |
| **Predicting an outcome from multiple factors** | Multiple regression (linear for continuous, logistic for binary) | Controls for confounders; quantifies individual factor contributions |
| **Comparing distributions** (e.g., is revenue distribution different between cohorts?) | Kolmogorov-Smirnov test | Detects any difference in distribution shape, not just mean |
| **Time-to-event** (e.g., time to first purchase, time to churn) | Log-rank test (comparing groups) or Cox regression (with covariates) | Handles censored data (users who haven't experienced the event yet) |
| **Count data** (e.g., number of support tickets per user) | Poisson regression or negative binomial regression | Models count outcomes correctly; handles overdispersion |
### Selected test: {{test_name}}
**Rationale:** (Why this test fits the data type and question.)
Step 3: Check assumptions
Every test has assumptions. Violating them can make results misleading:
### Assumptions check
| Assumption | Required by | How to check | Status |
|-----------|------------|-------------|--------|
| Independence | Most tests | Are observations independent? (Not clustered, not repeated measures, no network effects) | (Met / Violated / Partial) |
| Normality | t-test, ANOVA, Pearson | Histogram + Shapiro-Wilk test; with n > 30, Central Limit Theorem helps | (Met / Not needed / Violated) |
| Equal variances | Student's t-test | Levene's test; if violated, use Welch's t-test instead | (Met / Violated -- switching to Welch's) |
| Sufficient sample size | All tests | See power analysis below | (Met / Underpowered) |
| No extreme outliers | Mean-based tests | Box plot; consider Winsorizing or switching to median-based test | (Clean / Outliers present -- action taken) |
**If assumptions are violated:**
- Non-normal data with small n: use non-parametric alternative (Mann-Whitney, Kruskal-Wallis, Wilcoxon)
- Clustered data (e.g., users within accounts): use clustered standard errors or mixed-effects models
- Multiple comparisons: apply Benjamini-Hochberg (FDR) correction, not just Bonferroni (which is too conservative for most product work)
Step 4: Calculate sample size and power
### Power analysis
**For the selected test:**
| Parameter | Value | Source |
|-----------|-------|--------|
| Significance level (alpha) | 0.05 (standard) or (adjusted for multiple comparisons) | (Convention or business requirement) |
| Power (1 - beta) | 0.80 (standard) or 0.90 (for high-stakes decisions) | (How important is it to detect a real effect?) |
| Minimum detectable effect (MDE) | (The smallest difference that matters for the business) | (Business context -- not a statistical choice) |
| Baseline rate/mean | (Current value of the metric) | (Historical data) |
| Required sample size per group | (Calculated) | (Formula or tool) |
| Available sample size | (What you actually have) | (Data) |
| Achieved power | (Given actual sample size and observed effect) | (Post-hoc calculation) |
**Interpretation:**
- If underpowered (power < 0.80): a non-significant result does NOT mean "no effect." It means you can't tell.
- If overpowered (n >> required): even trivially small differences will be "significant." Focus on effect size, not p-value.
Step 5: Interpret results
### Results interpretation framework
| What to report | How to report it | Why it matters |
|---------------|-----------------|---------------|
| Effect size | Point estimate with units (e.g., "+2.3 percentage points" or "+$4.50 per user") | The magnitude of the difference -- what actually changes |
| Confidence interval | 95% CI around the effect (e.g., "[+0.8pp, +3.8pp]") | The range of plausible true effects given the data |
| Statistical significance | p-value AND whether it crosses your threshold | Whether the result is unlikely to be noise |
| Practical significance | Is the effect size large enough to matter for the business? | A statistically significant but tiny effect may not be worth acting on |
| Sample size and power | n per group, achieved power | Whether a null result is informative or just underpowered |
### Common interpretation mistakes to avoid
- "Not significant" does NOT mean "no effect." It means you can't rule out noise with this sample.
- "Significant" does NOT mean "important." A p < 0.001 result with a 0.01% improvement is noise in business terms.
- Never compare p-values across tests ("this one is more significant"). Compare effect sizes and confidence intervals.
- Confidence intervals that include zero mean the true effect could be positive, negative, or zero.
- Post-hoc power analysis (after seeing the data) is circular. Use it only to plan the next test.
Step 6: Handle multiple comparisons
### Multiple comparisons correction
If you're running more than one test (e.g., testing 5 metrics, or comparing 4 segments):
| Scenario | Correction method | When to use |
|----------|------------------|-------------|
| Small number of pre-planned comparisons (2-5) | Bonferroni (alpha / number of tests) | Conservative; use when false positives are costly |
| Moderate number of comparisons (5-20) | Benjamini-Hochberg (FDR) | Controls false discovery rate; better balance of power and error control |
| Exploratory analysis (20+) | Report uncorrected p-values, flag as exploratory | Be transparent that these are hypothesis-generating, not confirmatory |
| One primary metric + secondary metrics | No correction on primary; correct or label secondaries as exploratory | Protects the main result while allowing exploration |
**Rule of thumb:** If you test 20 metrics at alpha = 0.05, expect 1 false positive by chance alone. Correct for this or be explicit about the risk.
Step 7: Review
Ask the user:
- Is the question framed as a comparison, a relationship, or a prediction? (This determines the test family.)
- What decision does this test inform? (If no decision hinges on the result, the test may not be worth running.)
- What would you do if the result is significant? What if it's not? (If the answer is "nothing changes," reconsider the test.)
- Are you testing a pre-planned hypothesis or exploring data? (Exploratory analysis needs different correction and communication.)
- Who will see these results? (A data scientist wants the CI; a PM wants the business impact; an executive wants the decision.)
Output location
Present the test selection and interpretation as formatted text in the conversation. The results framework can be exported for team documentation or stakeholder presentation.
Example Output
Input
- Question: "Did our 7-day free trial conversion rate change after we switched from a credit-card-required signup to a no-card-required signup on March 12th?"
- Company & data: Lumen Analytics (B2B SaaS, project management tool); 4,200 signups in the 30 days before the change, 5,800 signups in the 30 days after; baseline conversion rate was 18.4%
- Data type: Proportions (converted to paid vs. did not convert)
- Comparison structure: Two independent groups (pre-change cohort vs. post-change cohort)
- Practical significance: A lift of ≥ 2 percentage points would justify keeping the no-card flow; smaller than that doesn't offset the projected increase in support volume from lower-intent signups
Output (abbreviated)
Statistical Test Selection — Trial Conversion Rate Change, Lumen Analytics
Selected test: Z-test for two proportions (chi-square test of independence)
Rationale: Both outcomes are binary (converted / did not convert), the two cohorts are independent (no overlap in users), and sample sizes are large enough that the normal approximation to the binomial is valid. Chi-square is equivalent and can be confirmed in any standard stats package.
Decision matrix
| Your situation | Recommended test | Why |
|---|---|---|
| Two proportions, independent groups, large n | Z-test for proportions / chi-square | Directly compares rates; large n satisfies normal approximation (np > 5 in both cells) |
| Two proportions, small n (< ~30 per cell) | Fisher's Exact Test | Doesn't rely on approximation |
| Same users, before/after conversion | McNemar's test | Accounts for paired binary responses |
Assumptions check
| Assumption | Required by | How to check | Status |
|---|---|---|---|
| Independence of observations | Z-test, chi-square | Users in each cohort are distinct; no shared accounts or family plans affecting both groups | Met |
| Normality approximation | Z-test | n₁p₁ = 773, n₁(1−p₁) = 3,427; n₂p₂ ≈ estimated; all >> 5 | Met |
| No contamination across periods | Both | Confirm no users saw both flows (e.g., via cached signup pages or multi-device) | Partial — verify cache TTL with engineering |
| Stable external conditions | Causal inference | Check for concurrent campaigns, seasonality, or pricing changes in the same window | Review needed — March typically shows seasonal uptick in SaaS signups |
⚠️ Confounding flag: The pre/post design cannot isolate the card-removal effect from seasonal trends. If the conversion lift is marginal (near the 2pp threshold), consider a concurrent A/B test before committing to the no-card flow permanently. See
/causal-inference-guidefor a difference-in-differences approach if a control group can be constructed from a comparable segment.
Power analysis
| Parameter | Value | Source |
|---|---|---|
| Significance level (alpha) | 0.05 | Standard; single pre-planned test on one primary metric |
| Power (1 − beta) | 0.80 | Standard threshold; this is a reversible product decision |
| Minimum detectable effect (MDE) | 2.0 percentage points (18.4% → 20.4%) | Business requirement; stated threshold for action |
| Baseline conversion rate | 18.4% | 30-day pre-period actuals |
| Required sample size per group | ~3,100 | Calculated via two-proportion z-test power formula |
| Available sample size | 4,200 (pre) / 5,800 (post) | Actuals |
| Achieved power | ~0.97 | Well-powered; even a 1.3pp difference is detectable at 80% power |
Interpretation: You are well above the required sample size. This means a non-significant result would be genuinely informative — it's unlikely to be an underpowered null. However, because you're overpowered, a statistically significant result could reflect a lift smaller than 2pp. Do not stop at the p-value — report the effect size and confidence interval and compare to the 2pp business threshold.
Results interpretation framework
| What to report | How to report it | Why it matters |
|---|---|---|
| Effect size | Point estimate in percentage points (e.g., "+3.1pp") | Raw conversion rate change in business terms |
| Confidence interval | 95% CI in pp (e.g., "[+2.2pp, +4.0pp]") | If the lower bound exceeds 2pp, the result is both statistically and practically significant |
| p-value | Report value + threshold (e.g., p = 0.003, threshold = 0.05) | Confirms result is unlikely to be noise |
| Practical significance | Compare point estimate to 2pp MDE | Drives the actual keep/revert decision |
| Confound risk | Flag seasonal/campaign overlap | Qualifies confidence in causal attribution |
Decision rules for stakeholders
| Result | Action |
|---|---|
| Lift ≥ 2pp AND p < 0.05 | Keep no-card flow; monitor support ticket volume for 30 days |
| Lift ≥ 2pp BUT p ≥ 0.05 | Should not happen given sample size — investigate data integrity |
| Lift < 2pp AND p < 0.05 | Statistically real but not business-meaningful; revert or run a longer test |
| Lift < 2pp AND p ≥ 0.05 | No detectable meaningful effect; revert to card-required flow |
Multiple comparisons
You are testing one primary metric (7-day trial conversion rate). No correction needed.
If the team also examines secondary metrics (time-to-convert, plan tier selected, 30-day retention, support ticket rate), treat those as exploratory. Apply Benjamini-Hochberg correction across secondaries, and communicate findings as hypothesis-generating, not confirmatory.
Recommended next steps
- Pull the contingency table (converted / not converted × pre / post) and run the chi-square test in Python (
scipy.stats.chi2_contingency) or R (prop.test). - Check for cache contamination with engineering before finalizing cohort definitions.
- Overlay the signup volume trend against historical March data to assess seasonality exposure.
- If the result is near the 2pp boundary, escalate to
/experiment-designto plan a concurrent randomized test before making the change permanent.