Causal Inference Guide - AI Agent Skill

Use this when you need to answer "did X cause Y?" and can't run a randomized experiment. Common situations: evaluating a feature that already launched without a holdout, measuring the impact of a policy change, understanding whether a marketing campaign drove conversions or just correlated with a seasonal trend. Produces a causal inference design with method selection, assumptions audit, and analysis plan.

Related skills: Complements /experiment-design (use that when randomization is possible; use this when it isn't). Upstream: /north-star-metric identifies the outcome to measure. Downstream: findings feed into /decision-brief for stakeholder communication. Pair with /data-quality-assessment to verify data before analysis.

Process

Step 1: Gather inputs

Ask the user to provide:

The causal question -- what treatment or intervention are you trying to measure the effect of? What outcome do you care about?
Why randomization failed -- why can't you run (or didn't you run) an A/B test? (Already launched, ethical constraints, too few units, management decision, retroactive analysis.)
Treatment assignment -- how were users/units assigned to the treatment? (Self-selected, rule-based cutoff, geographic rollout, time-based rollout, management decision.)
Available data -- what data do you have on the treatment group, control group, and outcome? How far back does it go? What covariates are available?
Confounders -- what other factors might explain the outcome besides the treatment? (Seasonality, concurrent launches, user self-selection, external events.)
Decision context -- what will you do with the answer? How precise does the estimate need to be?

Step 2: Select the causal method

Match the situation to the appropriate method:

## Causal Inference Design -- {{question}}, {{date}}

### Method selection

| Method | When to use | Key assumption | Your situation |
|--------|------------|----------------|---------------|
| Difference-in-Differences (DiD) | Treatment applied to one group at a known time; parallel pre-treatment trends available | Parallel trends: treated and control would have followed the same trajectory absent treatment | (Fit? Why or why not?) |
| Regression Discontinuity (RDD) | Treatment assigned by a threshold/cutoff (e.g., users above a score get the feature) | Continuity: units just above and below the cutoff are comparable | (Fit? Why or why not?) |
| Propensity Score Matching (PSM) | Treatment is self-selected but you have rich covariates to model selection | No unmeasured confounders: all variables driving selection are observed and included | (Fit? Why or why not?) |
| Instrumental Variables (IV) | A variable affects treatment assignment but not the outcome directly | Exclusion restriction: the instrument only affects the outcome through the treatment | (Fit? Why or why not?) |
| Interrupted Time Series (ITS) | Single group, treatment at a known time, long pre/post time series | No concurrent events: nothing else changed at the same time that could explain the shift | (Fit? Why or why not?) |
| Synthetic Control | One treated unit (e.g., a market or country), multiple untreated units for comparison | Pre-treatment fit: the synthetic control accurately reproduces the treated unit's pre-treatment trajectory | (Fit? Why or why not?) |

### Selected method: {{method_name}}

**Rationale:** (Why this method fits the situation better than alternatives.)

**Key risk:** (The biggest threat to validity with this method in this context.)

Step 3: Audit assumptions

Every causal method requires assumptions. Make them explicit and testable where possible:

### Assumptions audit

| Assumption | Required by | Testable? | Test or evidence | Status |
|-----------|------------|-----------|-----------------|--------|
| (e.g., Parallel trends) | DiD | Yes | Plot pre-treatment trends for treated vs. control; run placebo test on pre-period | (Pass / Fail / Untested) |
| (e.g., No spillover) | DiD | Partially | Check if control group behavior changed after treatment started | (Pass / Fail / Untested) |
| (e.g., No anticipation) | DiD | Partially | Check for behavior changes before the treatment date | (Pass / Fail / Untested) |
| (e.g., Overlap/common support) | PSM | Yes | Plot propensity score distributions; check for non-overlapping regions | (Pass / Fail / Untested) |

### Threats to validity
1. (Specific threat -- e.g., "Marketing campaign launched the same week as the feature, confounding the treatment effect")
2. (Specific threat -- e.g., "Power users self-selected into the beta, making treated users systematically different")
3. (Specific threat -- e.g., "Only 3 months of pre-treatment data, making parallel trends hard to verify")

### Mitigation strategies
- (For each threat: what can you do to reduce the risk or bound the bias?)

Step 4: Design the analysis plan

### Analysis plan

**Treatment definition:**
- Treatment group: (Who/what is treated? How are they identified in the data?)
- Control group: (Who/what is the comparison? How are they identified?)
- Treatment timing: (When did the treatment start? Is it binary or staggered?)

**Outcome variable:**
- Primary: (The main metric you're measuring the effect on)
- Secondary: (Other metrics to check for consistency or mechanism)

**Covariates:**
- (Variables to control for or match on -- list all that are available and relevant)

**Estimation approach:**
- (Specific regression specification, matching algorithm, or time series model)
- (Software/tool: R/Python package, SQL approach)

**Effect size and precision:**
- Expected effect size: (What change would be meaningful? What's the minimum you need to detect?)
- Standard errors: (Clustered? Heteroscedasticity-robust? Bootstrap?)
- Confidence intervals: (Report 95% CI, not just p-values)

**Robustness checks:**
1. (e.g., Placebo test: run the same analysis on a pre-treatment period where no effect should exist)
2. (e.g., Sensitivity analysis: how large would an unmeasured confounder need to be to explain away the result?)
3. (e.g., Alternative control group: repeat with a different comparison group)
4. (e.g., Dose-response: if treatment intensity varies, does the effect scale with exposure?)

Step 5: Plan the communication

### Communicating causal findings

**For technical audiences:**
- Report the point estimate with 95% confidence interval
- State the method and key assumptions explicitly
- Present robustness checks alongside the main result
- Quantify what would need to be true for the result to be wrong (sensitivity analysis)

**For non-technical stakeholders:**
- Lead with the business question and the answer
- State the confidence level in plain language: "We're reasonably confident that X caused a Y% increase in Z, with the caveat that [main limitation]"
- Use one visual: before/after trend with treatment marked, or treated vs. control comparison
- Be explicit about what this analysis can and cannot tell you

Step 6: Review

Ask the user:

Is the treatment cleanly defined? (If users could partially adopt or if treatment leaked to the control group, estimates will be biased.)
Is the control group a fair comparison? (If treated users are systematically different, the method needs to address selection.)
What would change your mind? (If no robustness check could shake your confidence, you're not being rigorous enough.)
Who needs to trust this analysis? (A data scientist will want robustness checks; an executive wants a clear bottom line.)
Is a directional answer sufficient, or do you need a precise estimate? (This determines how much methodological investment is warranted.)

Output location

Present the causal inference design as formatted text in the conversation. The analysis plan can be exported for the data team to execute.

Example Output

Input

Company & causal question: Meridian Health, a regional hospital network — did the rollout of an AI-powered triage chatbot on their patient portal cause a reduction in unnecessary ER visits?
Why randomization failed: Hospital leadership made a system-wide decision to launch the chatbot across all 14 facilities simultaneously in March 2024; no holdout was created
Treatment assignment: Geographic/facility-based rollout — all patients registered at Meridian facilities received access; neighboring hospital systems (non-Meridian) did not
Available data: Monthly ER visit counts per patient cohort going back 36 months (Jan 2021–present); patient demographics, chronic condition flags, insurance type, prior utilization history; competitor hospital ER volumes from state health board reports
Confounders: RSV/flu seasonality, a concurrent Medicaid expansion in the state (Jan 2024), media coverage about ER wait times discouraging visits, aging patient mix
Decision context: CFO needs to justify a $2.3M annual contract renewal for the chatbot vendor by Q3 2024; needs a credible estimate with defensible assumptions, not just a correlation

Output

Causal Inference Design — AI Triage Chatbot → Unnecessary ER Visits, Meridian Health, June 2024

Method selection

Method	When to use	Key assumption	Your situation
Difference-in-Differences (DiD)	Treatment applied to one group at a known time; parallel pre-treatment trends available	Parallel trends: treated and control would have followed the same trajectory absent treatment	Strong fit. Meridian facilities launched at a known date (March 2024); non-Meridian hospitals in the same state serve as controls. 36 months of pre-treatment data allows trend verification.
Regression Discontinuity (RDD)	Treatment assigned by a threshold/cutoff	Continuity around cutoff	Not applicable. No eligibility cutoff — access was universal across all registered patients.
Propensity Score Matching (PSM)	Self-selected treatment with rich covariates	No unmeasured confounders	Weak fit as primary. Treatment was facility-level, not individual-level; matching on patient characteristics doesn't solve the facility-selection problem. Useful as a secondary check on patient-level covariates.
Instrumental Variables (IV)	Instrument affects treatment but not outcome directly	Exclusion restriction	No clean instrument available. No variable plausibly affects chatbot access without also affecting ER behavior.
Interrupted Time Series (ITS)	Single group, long pre/post series, known treatment time	No concurrent events	Partial fit, but risky here. The concurrent Medicaid expansion (Jan 2024, two months before launch) makes ITS unreliable as a standalone method — can't isolate chatbot effect from coverage expansion.
Synthetic Control	One treated unit, multiple untreated comparison units	Pre-treatment fit of synthetic control	Viable secondary method. Could construct a synthetic Meridian from a weighted combination of non-Meridian hospital systems. Best used to validate DiD finding.

Selected method: Difference-in-Differences (DiD) with Facility-Level Clustering

Rationale: Meridian's 14 facilities vs. comparable non-Meridian facilities in the same state creates a natural treated/control structure with a clean treatment date. Thirty-six months of pre-treatment data is sufficient to test and likely satisfy the parallel trends assumption. DiD directly controls for any time-invariant differences between Meridian and non-Meridian patient populations and absorbs shared seasonality through the control group — which is particularly important given the flu/RSV confound.

Key risk: The Medicaid expansion in January 2024 (two months before chatbot launch) may have independently reduced unnecessary ER visits if newly covered patients shifted to primary care. If the expansion affected Meridian and non-Meridian patients at different rates, parallel trends breaks down and DiD will attribute Medicaid effects to the chatbot.

Assumptions audit

Assumption	Required by	Testable?	Test or evidence	Status
Parallel trends	DiD	Yes	Plot monthly ER visit rates for Meridian vs. non-Meridian facilities Jan 2021–Feb 2024; run event-study regression showing no pre-trend divergence	Untested — run first
No spillover (SUTVA)	DiD	Partially	Check whether patients registered at Meridian facilities sought care at non-Meridian ERs post-launch (would deflate control group's visits artificially)	Untested — check state all-payer claims data
No anticipation	DiD	Partially	Check ER utilization in Jan–Feb 2024 for unusual pre-launch dips; chatbot was announced internally in Jan	Untested
Medicaid expansion absorbed equally	DiD	Yes	Compare Medicaid enrollment growth rates at Meridian vs. non-Meridian catchment areas; test for differential pre-post Medicaid visit trends	Untested — critical given timing
Stable facility composition	DiD	Yes	Confirm no Meridian facilities opened, closed, or changed service lines during the study period	Likely passes — verify with ops

Threats to validity

Medicaid expansion confound (high risk): The January 2024 Medicaid expansion expanded primary care access two months before the chatbot launched. If Meridian's patient mix has higher Medicaid penetration than comparison facilities, DiD will overstate the chatbot's effect.
Media-driven ER avoidance (moderate risk): Statewide news coverage about ER wait times peaked in Q1 2024. This is partially absorbed by DiD (affects both groups) but not fully if Meridian received disproportionate local coverage due to its brand.
Patient cross-registration (low-moderate risk): Some Meridian patients may also be registered at non-Meridian systems. If they shift ER behavior due to the chatbot but show up in the "control" ER data, the control group absorbs some treatment effect, compressing the DiD estimate toward zero.

Mitigation strategies

For Medicaid confound: Add Medicaid expansion uptake rate as a facility-level covariate in the DiD regression. Run a subsample analysis restricted to commercially insured patients (less affected by the expansion) as a robustness check.
For media confound: Use Google Trends data on "ER wait times [state]" as a control variable, or restrict the post-treatment window to months where media coverage had subsided.
For cross-registration: If all-payer claims are available, flag patients with visits at both Meridian and non-Meridian facilities and exclude them from the primary analysis; compare results with and without exclusion.

Analysis plan

Treatment definition:

Treatment group: All patients with ≥1 registered encounter at a Meridian facility as of March 1, 2024 (chatbot activation date)
Control group: Patients registered at non-Meridian hospital systems in the same state with comparable service areas (exclude facilities >50 miles from Meridian catchment zones to reduce population heterogeneity)
Treatment timing: Binary, March 2024; not staggered (simultaneous rollout across all 14 facilities)

Outcome variable:

Primary: Monthly rate of unnecessary ER visits per 1,000 patient-months (define "unnecessary" using existing CMS low-acuity triage codes: ESI levels 4–5)
Secondary: Total ER visits per 1,000 patient-months; same-day urgent care utilization (mechanism check — did chatbot redirect to urgent care?); 30-day readmission rate (ensure chatbot didn't cause under-triage)

Covariates:

Patient-level: Age, sex, chronic condition count (CCI score), insurance type, prior 12-month ER utilization
Facility-level: Urban/rural classification, facility size (beds), payer mix, Medicaid expansion uptake rate in catchment area
Time-level: Month fixed effects (captures seasonality), year fixed effects

Estimation approach:

Two-way fixed effects DiD: ER_rate_it = α_i + γ_t + β(Meridian_i × Post_t) + δX_it + ε_it where α_i = facility fixed effects, γ_t = month-year fixed effects, β = treatment effect estimate
Standard errors clustered at the facility level (14

Run this now

Try /causal-inference-guide on your own input

0/4000

Related Product Management skills

2x2 Prioritize A/B Test Planner Activation Optimization AI Prototype Guide Analytics Dashboard Design Audience Segmentation Backlog Craft Backlog Refine

Back to Skills Catalog