Use this when you're stuck on a bug and want a systematic approach instead of random guessing. Works for runtime errors, unexpected behavior, test failures, and "it works on my machine" problems.
Process
Step 1: Capture the problem
Ask the engineer:
- What did you expect to happen?
- What actually happened? (Error message, wrong output, crash, hang — be specific.)
- When did it start? (After a deploy, a dependency update, a code change, or "it's always been like this.")
- Can you reproduce it reliably? (Always, sometimes, only in production, only on Tuesdays.)
- What have you already tried?
Step 2: Narrow the problem space
Based on the answers, classify the bug:
| Category | Signals | First moves |
|---|---|---|
| Crash / exception | Stack trace, error message | Read the stack trace bottom-up. Identify the throwing line. Check inputs to that function. |
| Wrong output | Code runs but produces incorrect results | Identify the last point where data is correct. Trace forward to where it diverges. |
| Performance | Slow, hanging, timeouts | Check for N+1 queries, unbounded loops, missing indexes, blocking I/O. |
| Intermittent | Flaky, race condition, timing-dependent | Look for shared mutable state, missing locks, order-dependent operations, time-sensitive logic. |
| Environment | Works locally, fails in CI/staging/prod | Compare env vars, dependency versions, OS differences, network access, file paths. |
Step 3: Form hypotheses
Generate 2-4 hypotheses ranked by likelihood. For each:
- Hypothesis: One sentence — what might be causing this.
- Evidence for: What supports this theory.
- Evidence against: What doesn't fit.
- Test: The fastest way to confirm or eliminate this hypothesis (a log statement, a test case, a config change — not a rewrite).
Present hypotheses and ask the engineer which to investigate first, or recommend the fastest-to-test option.
Step 4: Investigate
For the chosen hypothesis:
- Read the relevant code.
- Suggest the minimal investigation step (add a log, write a failing test, check a config value, inspect state at a breakpoint).
- If the hypothesis is confirmed — propose the fix.
- If eliminated — move to the next hypothesis.
- If investigation reveals new information — update hypotheses and re-rank.
Repeat until the root cause is identified.
Step 5: Fix and verify
Once the root cause is found:
- Write a failing test that reproduces the bug (if one doesn't exist).
- Propose the fix.
- Run the test — confirm it passes.
- Run the full test suite — confirm nothing else broke.
- Briefly explain the root cause so the team can learn from it.
Step 5b: Escalation assessment
Before closing the debugging session, evaluate whether this bug signals something larger.
Incident indicators -- escalate if any are true:
- User-facing impact affecting more than one user or session
- Data integrity compromise (wrong data written, data loss, corruption)
- Security implications (unauthorized access, data exposure)
- SLA breach or error budget burn
- The fix requires coordination across multiple teams or services
- The same root cause could affect other features or services
If escalation is warranted, recommend who to notify, what communication is needed (status page, Slack, stakeholder update), and whether to trigger a formal postmortem using /incident-postmortem.
Pattern recognition -- check for systemic signals:
- Has this category of bug occurred before? (Check incident history, bug tracker.)
- Is the root cause in a shared component, library, or infrastructure layer?
- Would this bug have been caught by existing tests if they existed?
- Does the fix address the symptom or the underlying cause?
If a pattern emerges, recommend whether this warrants a /tech-debt-assessment for the affected area or a /dx-assessment if the pattern points to tooling or workflow friction.
Output location
Present hypotheses, investigation results, and the fix in the conversation. Apply code changes directly to the codebase when approved.
Example Output
Input
- Engineer's report: "Our checkout API is returning 500 errors for roughly 30% of requests in production. No errors locally or in staging."
- Error message:
NullPointerException: Cannot read property 'taxRate' of undefinedinPricingService.calculateTotal()at line 84 - When it started: After Friday's deploy of
pricing-servicev2.4.1, which added multi-currency support - Reproducibility: Intermittent in production only — seems to happen more on orders from international customers
- Already tried: Rolled back a feature flag for the new currency selector UI, but errors continued
Output (abbreviated)
Debug Session: Checkout 500 Errors — taxRate NullPointerException
Step 1: Problem Summary
| Field | Detail |
|---|---|
| Expected | calculateTotal() returns a valid price object for all checkout requests |
| Actual | NullPointerException on taxRate — crashing ~30% of prod requests |
| Started | After deploy of pricing-service v2.4.1 on Friday |
| Reproducibility | Intermittent, production only, skews toward international orders |
| Tried | Feature flag rollback — no effect |
Bug category: Environment + Wrong Output (data-dependent crash)
The flag rollback ruling out the UI layer is a strong signal. The fault likely lives in the new pricing-service backend logic or its data dependencies.
Step 2: Hypotheses
Hypothesis 1 — Missing tax region data for non-US locales (Most likely)
- What: v2.4.1 introduced currency lookups but may not have backfilled
taxRatefor all regions in thetax_regionstable or config map. - Evidence for: Skews international; NullPointerException specifically on
taxRate; feature was added in this version. - Evidence against: Staging has no errors — but staging may only have US test fixtures.
- Test: Run
SELECT region_code, tax_rate FROM tax_regions WHERE tax_rate IS NULL;against the prod database. Also check if the code has a null guard before accessingtaxRate.
Hypothesis 2 — Race condition in currency resolution cache
- What: The new multi-currency layer caches exchange rates; a cache miss returns
undefinedinstead of a fallback, andcalculateTotal()doesn't guard against it. - Evidence for: Intermittent failure rate (not 100%) fits a cache miss pattern.
- Evidence against: Would expect a different error path — cache misses usually return a typed empty object, not
undefined. - Test: Add a log line immediately before line 84:
console.log('regionConfig:', regionConfig)— confirm whether the entire object isundefinedor just thetaxRateproperty.
Hypothesis 3 — Environment config missing in prod
- What: A new
TAX_CONFIG_PATHor similar env var introduced in v2.4.1 was set in staging but not added to the prod deployment manifest. - Evidence for: Classic "works in staging, fails in prod" pattern; env vars are a common deployment gap.
- Evidence against: Would expect 100% failure rate, not 30% — unless config controls a subset of code paths.
- Test:
printenv | grep TAXon a prod container vs. staging. Check the v2.4.1 release notes and deployment checklist for new required vars.
Hypothesis 4 — Null not handled for orders with no shipping address country
- What: Guest checkout or saved addresses without a
countryfield causes the region lookup to returnundefined. - Evidence for: ~30% could correlate with guest checkout rate.
- Evidence against: International framing suggests country is present, just unsupported.
- Test: Filter 500 error logs by
user_type: guest— check if error rate is disproportionate.
Step 3: Recommended First Move
Start with Hypothesis 1 — fastest, highest-signal, zero-risk.
-- Run against prod (read-only)
SELECT region_code, tax_rate
FROM tax_regions
WHERE tax_rate IS NULL OR tax_rate = 0;
Simultaneously, inspect line 84 of PricingService.calculateTotal():
// Current (suspected)
const total = subtotal * (1 + regionConfig.taxRate);
// Is there a null guard? If not, that's a latent bug regardless of root cause.
const total = subtotal * (1 + (regionConfig?.taxRate ?? 0));
Step 4: If Hypothesis 1 Confirmed — Proposed Fix
Immediate (stop the bleeding):
// PricingService.js, line 84
const taxRate = regionConfig?.taxRate ?? DEFAULT_TAX_RATE;
const total = subtotal * (1 + taxRate);
Backfill missing data:
INSERT INTO tax_regions (region_code, tax_rate, currency)
VALUES ('GB', 0.20, 'GBP'), ('DE', 0.19, 'EUR'), ('FR', 0.20, 'EUR')
-- ... remaining missing regions from audit query
ON CONFLICT (region_code) DO UPDATE SET tax_rate = EXCLUDED.tax_rate;
Regression test:
it('should not throw when taxRate is missing for a region', () => {
const mockRegionConfig = { currency: 'GBP' }; // no taxRate
expect(() => calculateTotal(100, mockRegionConfig)).not.toThrow();
expect(calculateTotal(100, mockRegionConfig)).toBe(100); // fallback to 0% tax
});
Step 5b: Escalation Assessment
⚠️ Escalation warranted.
| Indicator | Status |
|---|---|
| User-facing impact (>1 user) | ✅ Yes — ~30% of checkout attempts failing |
| Data integrity risk | ✅ Monitor — orders that did complete may have incorrect tax applied |
| SLA / revenue impact | ✅ Yes — checkout failures directly block revenue |
| Multi-service coordination needed | Possibly — if tax_regions is shared with invoice-service |
Recommended actions:
- Notify: #incidents Slack, eng lead, and head of payments immediately
- Status page: Post degraded checkout experience for international users
- Check invoice-service — if it reads the same
tax_regionstable, invoices generated since Friday may have incorrect tax calculations (data integrity issue) - Postmortem: Run
/incident-postmortem— this is the second pricing-related regression in 90 days
Pattern flag: This bug would have been caught by integration tests using non-US fixtures. Recommend a /dx-assessment for the pricing-service test strategy — staging data coverage appears systematically US-only.