Statistical Test Selector - AI Agent Skill

Use this when a team needs to make a data-informed decision but isn't sure which statistical method to apply. Common situations: comparing two variants, testing whether a metric changed after a launch, determining if a difference between segments is real or noise, or validating whether a correlation is meaningful. Produces a test selection with assumptions check, sample size guidance, and interpretation framework.

Related skills: Complements /experiment-design and /ab-test-planner for experiment analysis. Use /causal-inference-guide when the question is causal, not just "is this different?" Pair with /data-quality-assessment to verify data before testing.

Process

Step 1: Gather inputs

Ask the user to provide:

The question -- what are you trying to determine? (e.g., "Is variant B better than A?" or "Did churn change after the pricing update?" or "Is there a relationship between usage frequency and NPS?")
Data type -- what kind of data are you comparing? (Proportions/rates, continuous measurements, counts, time-to-event, ordinal/ranked.)
Comparison structure -- are you comparing two groups, more than two groups, paired/repeated measurements, or looking for a relationship?
Sample sizes -- how much data do you have for each group?
Practical significance -- what size difference would actually matter for a decision? (Not statistical significance -- business significance.)

Step 2: Select the statistical test

Match the question to the right test:

## Statistical Test Selection -- {{question}}, {{date}}

### Decision matrix

| Your situation | Recommended test | Why |
|---------------|-----------------|-----|
| **Comparing two proportions** (e.g., conversion rates for A vs. B) | Z-test for proportions (or chi-square test of independence) | Standard for comparing rates between two groups |
| **Comparing two means** (e.g., average revenue per user in two segments) | Welch's t-test (not Student's t-test) | Robust to unequal variances, which is almost always the case in product data |
| **Comparing two medians** (e.g., session duration, which is heavily skewed) | Mann-Whitney U test (Wilcoxon rank-sum) | Non-parametric; doesn't assume normal distribution |
| **Comparing more than two groups** (e.g., conversion across 3+ variants) | ANOVA (if normal) or Kruskal-Wallis (if not) | Controls family-wise error rate; follow up with pairwise tests |
| **Paired/before-after comparison** (e.g., same users before and after a change) | Paired t-test (if normal) or Wilcoxon signed-rank (if not) | Accounts for within-subject correlation |
| **Relationship between two continuous variables** (e.g., usage frequency vs. satisfaction) | Pearson correlation (if linear) or Spearman rank correlation (if not) | Quantifies association strength and direction |
| **Predicting an outcome from multiple factors** | Multiple regression (linear for continuous, logistic for binary) | Controls for confounders; quantifies individual factor contributions |
| **Comparing distributions** (e.g., is revenue distribution different between cohorts?) | Kolmogorov-Smirnov test | Detects any difference in distribution shape, not just mean |
| **Time-to-event** (e.g., time to first purchase, time to churn) | Log-rank test (comparing groups) or Cox regression (with covariates) | Handles censored data (users who haven't experienced the event yet) |
| **Count data** (e.g., number of support tickets per user) | Poisson regression or negative binomial regression | Models count outcomes correctly; handles overdispersion |

### Selected test: {{test_name}}

**Rationale:** (Why this test fits the data type and question.)

Step 3: Check assumptions

Every test has assumptions. Violating them can make results misleading:

### Assumptions check

| Assumption | Required by | How to check | Status |
|-----------|------------|-------------|--------|
| Independence | Most tests | Are observations independent? (Not clustered, not repeated measures, no network effects) | (Met / Violated / Partial) |
| Normality | t-test, ANOVA, Pearson | Histogram + Shapiro-Wilk test; with n > 30, Central Limit Theorem helps | (Met / Not needed / Violated) |
| Equal variances | Student's t-test | Levene's test; if violated, use Welch's t-test instead | (Met / Violated -- switching to Welch's) |
| Sufficient sample size | All tests | See power analysis below | (Met / Underpowered) |
| No extreme outliers | Mean-based tests | Box plot; consider Winsorizing or switching to median-based test | (Clean / Outliers present -- action taken) |

**If assumptions are violated:**
- Non-normal data with small n: use non-parametric alternative (Mann-Whitney, Kruskal-Wallis, Wilcoxon)
- Clustered data (e.g., users within accounts): use clustered standard errors or mixed-effects models
- Multiple comparisons: apply Benjamini-Hochberg (FDR) correction, not just Bonferroni (which is too conservative for most product work)

Step 4: Calculate sample size and power

### Power analysis

**For the selected test:**

| Parameter | Value | Source |
|-----------|-------|--------|
| Significance level (alpha) | 0.05 (standard) or (adjusted for multiple comparisons) | (Convention or business requirement) |
| Power (1 - beta) | 0.80 (standard) or 0.90 (for high-stakes decisions) | (How important is it to detect a real effect?) |
| Minimum detectable effect (MDE) | (The smallest difference that matters for the business) | (Business context -- not a statistical choice) |
| Baseline rate/mean | (Current value of the metric) | (Historical data) |
| Required sample size per group | (Calculated) | (Formula or tool) |
| Available sample size | (What you actually have) | (Data) |
| Achieved power | (Given actual sample size and observed effect) | (Post-hoc calculation) |

**Interpretation:**
- If underpowered (power < 0.80): a non-significant result does NOT mean "no effect." It means you can't tell.
- If overpowered (n >> required): even trivially small differences will be "significant." Focus on effect size, not p-value.

Step 5: Interpret results

### Results interpretation framework

| What to report | How to report it | Why it matters |
|---------------|-----------------|---------------|
| Effect size | Point estimate with units (e.g., "+2.3 percentage points" or "+$4.50 per user") | The magnitude of the difference -- what actually changes |
| Confidence interval | 95% CI around the effect (e.g., "[+0.8pp, +3.8pp]") | The range of plausible true effects given the data |
| Statistical significance | p-value AND whether it crosses your threshold | Whether the result is unlikely to be noise |
| Practical significance | Is the effect size large enough to matter for the business? | A statistically significant but tiny effect may not be worth acting on |
| Sample size and power | n per group, achieved power | Whether a null result is informative or just underpowered |

### Common interpretation mistakes to avoid
- "Not significant" does NOT mean "no effect." It means you can't rule out noise with this sample.
- "Significant" does NOT mean "important." A p < 0.001 result with a 0.01% improvement is noise in business terms.
- Never compare p-values across tests ("this one is more significant"). Compare effect sizes and confidence intervals.
- Confidence intervals that include zero mean the true effect could be positive, negative, or zero.
- Post-hoc power analysis (after seeing the data) is circular. Use it only to plan the next test.

Step 6: Handle multiple comparisons

### Multiple comparisons correction

If you're running more than one test (e.g., testing 5 metrics, or comparing 4 segments):

| Scenario | Correction method | When to use |
|----------|------------------|-------------|
| Small number of pre-planned comparisons (2-5) | Bonferroni (alpha / number of tests) | Conservative; use when false positives are costly |
| Moderate number of comparisons (5-20) | Benjamini-Hochberg (FDR) | Controls false discovery rate; better balance of power and error control |
| Exploratory analysis (20+) | Report uncorrected p-values, flag as exploratory | Be transparent that these are hypothesis-generating, not confirmatory |
| One primary metric + secondary metrics | No correction on primary; correct or label secondaries as exploratory | Protects the main result while allowing exploration |

**Rule of thumb:** If you test 20 metrics at alpha = 0.05, expect 1 false positive by chance alone. Correct for this or be explicit about the risk.

Step 7: Review

Ask the user:

Is the question framed as a comparison, a relationship, or a prediction? (This determines the test family.)
What decision does this test inform? (If no decision hinges on the result, the test may not be worth running.)
What would you do if the result is significant? What if it's not? (If the answer is "nothing changes," reconsider the test.)
Are you testing a pre-planned hypothesis or exploring data? (Exploratory analysis needs different correction and communication.)
Who will see these results? (A data scientist wants the CI; a PM wants the business impact; an executive wants the decision.)

Output location

Present the test selection and interpretation as formatted text in the conversation. The results framework can be exported for team documentation or stakeholder presentation.

Example Output

Input

Question: "Did our 7-day free trial conversion rate change after we switched from a credit-card-required signup to a no-card-required signup on March 12th?"
Company & data: Lumen Analytics (B2B SaaS, project management tool); 4,200 signups in the 30 days before the change, 5,800 signups in the 30 days after; baseline conversion rate was 18.4%
Data type: Proportions (converted to paid vs. did not convert)
Comparison structure: Two independent groups (pre-change cohort vs. post-change cohort)
Practical significance: A lift of ≥ 2 percentage points would justify keeping the no-card flow; smaller than that doesn't offset the projected increase in support volume from lower-intent signups

Output (abbreviated)

Statistical Test Selection — Trial Conversion Rate Change, Lumen Analytics

Selected test: Z-test for two proportions (chi-square test of independence)

Rationale: Both outcomes are binary (converted / did not convert), the two cohorts are independent (no overlap in users), and sample sizes are large enough that the normal approximation to the binomial is valid. Chi-square is equivalent and can be confirmed in any standard stats package.

Decision matrix

Your situation	Recommended test	Why
Two proportions, independent groups, large n	Z-test for proportions / chi-square	Directly compares rates; large n satisfies normal approximation (np > 5 in both cells)
Two proportions, small n (< ~30 per cell)	Fisher's Exact Test	Doesn't rely on approximation
Same users, before/after conversion	McNemar's test	Accounts for paired binary responses

Assumptions check

Assumption	Required by	How to check	Status
Independence of observations	Z-test, chi-square	Users in each cohort are distinct; no shared accounts or family plans affecting both groups	Met
Normality approximation	Z-test	n₁p₁ = 773, n₁(1−p₁) = 3,427; n₂p₂ ≈ estimated; all >> 5	Met
No contamination across periods	Both	Confirm no users saw both flows (e.g., via cached signup pages or multi-device)	Partial — verify cache TTL with engineering
Stable external conditions	Causal inference	Check for concurrent campaigns, seasonality, or pricing changes in the same window	Review needed — March typically shows seasonal uptick in SaaS signups

⚠️ Confounding flag: The pre/post design cannot isolate the card-removal effect from seasonal trends. If the conversion lift is marginal (near the 2pp threshold), consider a concurrent A/B test before committing to the no-card flow permanently. See /causal-inference-guide for a difference-in-differences approach if a control group can be constructed from a comparable segment.

Power analysis

Parameter	Value	Source
Significance level (alpha)	0.05	Standard; single pre-planned test on one primary metric
Power (1 − beta)	0.80	Standard threshold; this is a reversible product decision
Minimum detectable effect (MDE)	2.0 percentage points (18.4% → 20.4%)	Business requirement; stated threshold for action
Baseline conversion rate	18.4%	30-day pre-period actuals
Required sample size per group	~3,100	Calculated via two-proportion z-test power formula
Available sample size	4,200 (pre) / 5,800 (post)	Actuals
Achieved power	~0.97	Well-powered; even a 1.3pp difference is detectable at 80% power

Interpretation: You are well above the required sample size. This means a non-significant result would be genuinely informative — it's unlikely to be an underpowered null. However, because you're overpowered, a statistically significant result could reflect a lift smaller than 2pp. Do not stop at the p-value — report the effect size and confidence interval and compare to the 2pp business threshold.

Results interpretation framework

What to report	How to report it	Why it matters
Effect size	Point estimate in percentage points (e.g., "+3.1pp")	Raw conversion rate change in business terms
Confidence interval	95% CI in pp (e.g., "[+2.2pp, +4.0pp]")	If the lower bound exceeds 2pp, the result is both statistically and practically significant
p-value	Report value + threshold (e.g., p = 0.003, threshold = 0.05)	Confirms result is unlikely to be noise
Practical significance	Compare point estimate to 2pp MDE	Drives the actual keep/revert decision
Confound risk	Flag seasonal/campaign overlap	Qualifies confidence in causal attribution

Decision rules for stakeholders

Result	Action
Lift ≥ 2pp AND p < 0.05	Keep no-card flow; monitor support ticket volume for 30 days
Lift ≥ 2pp BUT p ≥ 0.05	Should not happen given sample size — investigate data integrity
Lift < 2pp AND p < 0.05	Statistically real but not business-meaningful; revert or run a longer test
Lift < 2pp AND p ≥ 0.05	No detectable meaningful effect; revert to card-required flow

Multiple comparisons

You are testing one primary metric (7-day trial conversion rate). No correction needed.

If the team also examines secondary metrics (time-to-convert, plan tier selected, 30-day retention, support ticket rate), treat those as exploratory. Apply Benjamini-Hochberg correction across secondaries, and communicate findings as hypothesis-generating, not confirmatory.

Recommended next steps

Pull the contingency table (converted / not converted × pre / post) and run the chi-square test in Python (scipy.stats.chi2_contingency) or R (prop.test).
Check for cache contamination with engineering before finalizing cohort definitions.
Overlay the signup volume trend against historical March data to assess seasonality exposure.
If the result is near the 2pp boundary, escalate to /experiment-design to plan a concurrent randomized test before making the change permanent.

Run this now

Try /statistical-test-selector on your own input

0/4000

Related Product Management skills

2x2 Prioritize A/B Test Planner Activation Optimization AI Prototype Guide Analytics Dashboard Design Audience Segmentation Backlog Craft Backlog Refine

Back to Skills Catalog