Skip to main content
Engineering/debug-assist

Debug Assist

You need a hypothesis-driven debugging flow.

Use this when you're stuck on a bug and want a systematic approach instead of random guessing. Works for runtime errors, unexpected behavior, test failures, and "it works on my machine" problems.

Process

Step 1: Capture the problem

Ask the engineer:

  1. What did you expect to happen?
  2. What actually happened? (Error message, wrong output, crash, hang — be specific.)
  3. When did it start? (After a deploy, a dependency update, a code change, or "it's always been like this.")
  4. Can you reproduce it reliably? (Always, sometimes, only in production, only on Tuesdays.)
  5. What have you already tried?

Step 2: Narrow the problem space

Based on the answers, classify the bug:

CategorySignalsFirst moves
Crash / exceptionStack trace, error messageRead the stack trace bottom-up. Identify the throwing line. Check inputs to that function.
Wrong outputCode runs but produces incorrect resultsIdentify the last point where data is correct. Trace forward to where it diverges.
PerformanceSlow, hanging, timeoutsCheck for N+1 queries, unbounded loops, missing indexes, blocking I/O.
IntermittentFlaky, race condition, timing-dependentLook for shared mutable state, missing locks, order-dependent operations, time-sensitive logic.
EnvironmentWorks locally, fails in CI/staging/prodCompare env vars, dependency versions, OS differences, network access, file paths.

Step 3: Form hypotheses

Generate 2-4 hypotheses ranked by likelihood. For each:

  • Hypothesis: One sentence — what might be causing this.
  • Evidence for: What supports this theory.
  • Evidence against: What doesn't fit.
  • Test: The fastest way to confirm or eliminate this hypothesis (a log statement, a test case, a config change — not a rewrite).

Present hypotheses and ask the engineer which to investigate first, or recommend the fastest-to-test option.

Step 4: Investigate

For the chosen hypothesis:

  1. Read the relevant code.
  2. Suggest the minimal investigation step (add a log, write a failing test, check a config value, inspect state at a breakpoint).
  3. If the hypothesis is confirmed — propose the fix.
  4. If eliminated — move to the next hypothesis.
  5. If investigation reveals new information — update hypotheses and re-rank.

Repeat until the root cause is identified.

Step 5: Fix and verify

Once the root cause is found:

  1. Write a failing test that reproduces the bug (if one doesn't exist).
  2. Propose the fix.
  3. Run the test — confirm it passes.
  4. Run the full test suite — confirm nothing else broke.
  5. Briefly explain the root cause so the team can learn from it.

Step 5b: Escalation assessment

Before closing the debugging session, evaluate whether this bug signals something larger.

Incident indicators -- escalate if any are true:

  • User-facing impact affecting more than one user or session
  • Data integrity compromise (wrong data written, data loss, corruption)
  • Security implications (unauthorized access, data exposure)
  • SLA breach or error budget burn
  • The fix requires coordination across multiple teams or services
  • The same root cause could affect other features or services

If escalation is warranted, recommend who to notify, what communication is needed (status page, Slack, stakeholder update), and whether to trigger a formal postmortem using /incident-postmortem.

Pattern recognition -- check for systemic signals:

  • Has this category of bug occurred before? (Check incident history, bug tracker.)
  • Is the root cause in a shared component, library, or infrastructure layer?
  • Would this bug have been caught by existing tests if they existed?
  • Does the fix address the symptom or the underlying cause?

If a pattern emerges, recommend whether this warrants a /tech-debt-assessment for the affected area or a /dx-assessment if the pattern points to tooling or workflow friction.

Output location

Present hypotheses, investigation results, and the fix in the conversation. Apply code changes directly to the codebase when approved.

Example Output

Input

  • Engineer's report: "Our checkout API is returning 500 errors for roughly 30% of requests in production. No errors locally or in staging."
  • Error message: NullPointerException: Cannot read property 'taxRate' of undefined in PricingService.calculateTotal() at line 84
  • When it started: After Friday's deploy of pricing-service v2.4.1, which added multi-currency support
  • Reproducibility: Intermittent in production only — seems to happen more on orders from international customers
  • Already tried: Rolled back a feature flag for the new currency selector UI, but errors continued

Output (abbreviated)

Debug Session: Checkout 500 Errors — taxRate NullPointerException

Step 1: Problem Summary

FieldDetail
ExpectedcalculateTotal() returns a valid price object for all checkout requests
ActualNullPointerException on taxRate — crashing ~30% of prod requests
StartedAfter deploy of pricing-service v2.4.1 on Friday
ReproducibilityIntermittent, production only, skews toward international orders
TriedFeature flag rollback — no effect

Bug category: Environment + Wrong Output (data-dependent crash) The flag rollback ruling out the UI layer is a strong signal. The fault likely lives in the new pricing-service backend logic or its data dependencies.


Step 2: Hypotheses

Hypothesis 1 — Missing tax region data for non-US locales (Most likely)

  • What: v2.4.1 introduced currency lookups but may not have backfilled taxRate for all regions in the tax_regions table or config map.
  • Evidence for: Skews international; NullPointerException specifically on taxRate; feature was added in this version.
  • Evidence against: Staging has no errors — but staging may only have US test fixtures.
  • Test: Run SELECT region_code, tax_rate FROM tax_regions WHERE tax_rate IS NULL; against the prod database. Also check if the code has a null guard before accessing taxRate.

Hypothesis 2 — Race condition in currency resolution cache

  • What: The new multi-currency layer caches exchange rates; a cache miss returns undefined instead of a fallback, and calculateTotal() doesn't guard against it.
  • Evidence for: Intermittent failure rate (not 100%) fits a cache miss pattern.
  • Evidence against: Would expect a different error path — cache misses usually return a typed empty object, not undefined.
  • Test: Add a log line immediately before line 84: console.log('regionConfig:', regionConfig) — confirm whether the entire object is undefined or just the taxRate property.

Hypothesis 3 — Environment config missing in prod

  • What: A new TAX_CONFIG_PATH or similar env var introduced in v2.4.1 was set in staging but not added to the prod deployment manifest.
  • Evidence for: Classic "works in staging, fails in prod" pattern; env vars are a common deployment gap.
  • Evidence against: Would expect 100% failure rate, not 30% — unless config controls a subset of code paths.
  • Test: printenv | grep TAX on a prod container vs. staging. Check the v2.4.1 release notes and deployment checklist for new required vars.

Hypothesis 4 — Null not handled for orders with no shipping address country

  • What: Guest checkout or saved addresses without a country field causes the region lookup to return undefined.
  • Evidence for: ~30% could correlate with guest checkout rate.
  • Evidence against: International framing suggests country is present, just unsupported.
  • Test: Filter 500 error logs by user_type: guest — check if error rate is disproportionate.

Step 3: Recommended First Move

Start with Hypothesis 1 — fastest, highest-signal, zero-risk.

-- Run against prod (read-only)
SELECT region_code, tax_rate
FROM tax_regions
WHERE tax_rate IS NULL OR tax_rate = 0;

Simultaneously, inspect line 84 of PricingService.calculateTotal():

// Current (suspected)
const total = subtotal * (1 + regionConfig.taxRate);

// Is there a null guard? If not, that's a latent bug regardless of root cause.
const total = subtotal * (1 + (regionConfig?.taxRate ?? 0));

Step 4: If Hypothesis 1 Confirmed — Proposed Fix

Immediate (stop the bleeding):

// PricingService.js, line 84
const taxRate = regionConfig?.taxRate ?? DEFAULT_TAX_RATE;
const total = subtotal * (1 + taxRate);

Backfill missing data:

INSERT INTO tax_regions (region_code, tax_rate, currency)
VALUES ('GB', 0.20, 'GBP'), ('DE', 0.19, 'EUR'), ('FR', 0.20, 'EUR')
-- ... remaining missing regions from audit query
ON CONFLICT (region_code) DO UPDATE SET tax_rate = EXCLUDED.tax_rate;

Regression test:

it('should not throw when taxRate is missing for a region', () => {
  const mockRegionConfig = { currency: 'GBP' }; // no taxRate
  expect(() => calculateTotal(100, mockRegionConfig)).not.toThrow();
  expect(calculateTotal(100, mockRegionConfig)).toBe(100); // fallback to 0% tax
});

Step 5b: Escalation Assessment

⚠️ Escalation warranted.

IndicatorStatus
User-facing impact (>1 user)✅ Yes — ~30% of checkout attempts failing
Data integrity risk✅ Monitor — orders that did complete may have incorrect tax applied
SLA / revenue impact✅ Yes — checkout failures directly block revenue
Multi-service coordination neededPossibly — if tax_regions is shared with invoice-service

Recommended actions:

  1. Notify: #incidents Slack, eng lead, and head of payments immediately
  2. Status page: Post degraded checkout experience for international users
  3. Check invoice-service — if it reads the same tax_regions table, invoices generated since Friday may have incorrect tax calculations (data integrity issue)
  4. Postmortem: Run /incident-postmortem — this is the second pricing-related regression in 90 days

Pattern flag: This bug would have been caught by integration tests using non-US fixtures. Recommend a /dx-assessment for the pricing-service test strategy — staging data coverage appears systematically US-only.