Skip to main content
Engineering/incident-postmortem

Incident Postmortem

You need a structured blameless postmortem for a production incident – timeline, contributing factors, action items, and facilitation guide.

Use this when a production incident has been resolved and the team needs to learn from it -- or when you're facilitating a postmortem for a client team. Also use proactively to establish postmortem culture before the first incident hits. Also works as a project or engagement postmortem -- reflecting on completed work to capture lessons learned without incident-specific framing. Produces a postmortem document with timeline, contributing factors, severity classification (or goal assessment), and tracked action items.

Related skills: Use /delivery-diagnose when the problem is chronic delivery issues rather than a discrete incident. Pair with /instrumentation-plan to improve detection for future incidents. Use /observability-plan if the incident revealed blind spots in product-level monitoring. For significant or recurring problems that need corrective + preventive action with effectiveness verification, feed postmortem findings into /capa-design.

Process

Step 1: Gather inputs

Ask the user to provide: 0. Postmortem type -- incident (default) or project/engagement? Project mode replaces severity classification and incident timeline with outcome reflection and goal achievement assessment.

  1. Incident summary -- what happened, in plain language?
  2. Severity -- how bad was it? (Use the team's existing severity levels, or classify below.)
  3. Timeline data -- alerts, Slack messages, on-call pages, deploy logs, status page updates, customer reports. Raw is fine -- we'll structure it.
  4. Resolution -- what fixed it? Who was involved?
  5. Impact -- users affected, revenue impact, data loss, SLA breach, reputational cost.
  6. Team context -- who was on call? What's the team's postmortem experience level?

If severity levels don't exist yet, use this default scale:

LevelLabelDefinition
SEV-1CriticalService down for all users, data loss, or security breach
SEV-2MajorSignificant degradation, large subset of users affected
SEV-3MinorLimited impact, workaround available, small user subset
SEV-4LowCosmetic or edge case, no user-facing impact

If project/engagement postmortem: Skip the severity table. Instead gather:

  • Original goals and success criteria -- what were we trying to achieve?
  • What was delivered vs. planned -- scope, timeline, quality
  • Stakeholder feedback -- what did clients/sponsors say?
  • Team composition and dynamics -- who was involved, how did collaboration work?
  • Timeline -- planned vs. actual milestones

Step 2: Reconstruct the timeline

Build a chronological timeline from the raw data. For each entry, capture:

### Timeline

| Time (UTC) | Event | Source | Actor |
|-----------|-------|--------|-------|
| (timestamp) | (what happened) | (alert / Slack / deploy log / customer report) | (person or system) |

**Key timestamps:**
- **Detection:** (when did we first know?) -- (how: alert / customer report / engineer noticed)
- **Escalation:** (when did the right people get involved?)
- **Mitigation:** (when was the bleeding stopped?)
- **Resolution:** (when was the root fix applied?)
- **Communication:** (when were customers/stakeholders informed?)

**Time gaps to investigate:**
- Detection to escalation: (N minutes) -- (acceptable? too slow?)
- Escalation to mitigation: (N minutes) -- (what slowed this down?)

Flag any gaps where "we don't know what happened" -- those are learning opportunities, not failures.

Step 3: Identify contributing factors

Use a contributing factors analysis, not "root cause analysis." Incidents rarely have a single root cause -- they emerge from the interaction of multiple contributing factors.

Categorize contributing factors:

### Contributing factors

**Process factors:**
- (e.g., "Deployment happened on Friday afternoon with no one watching metrics")
- (e.g., "Runbook was outdated -- referenced services that were decommissioned 6 months ago")

**Technical factors:**
- (e.g., "No circuit breaker on the payment service dependency")
- (e.g., "Database connection pool exhaustion under load -- no connection limit alerting")

**Human factors:**
- (e.g., "On-call engineer was unfamiliar with the payment service -- joined the team 2 weeks ago")
- (e.g., "Alert fatigue -- 47 low-priority alerts in the previous 24 hours masked the real signal")

**Organizational factors:**
- (e.g., "No staging environment that mirrors production load patterns")
- (e.g., "Team responsible for the failing service was dissolved last quarter -- no clear owner")

Blameless framing rules:

  • Name roles, not people. "The on-call engineer" not "Alex."
  • Describe what happened, not what someone should have done. "The alert was not actionable" not "They ignored the alert."
  • Ask "what made this outcome likely?" not "who caused this?"
  • Treat every human action as reasonable given what they knew at the time.

Step 4: Assess detection and response

Evaluate how well the system and team responded:

### Detection and response assessment

| Phase | What happened | What worked | What didn't |
|-------|-------------|-------------|-------------|
| Detection | (how was it discovered?) | (good: automated alert fired in < 2 min) | (bad: alert was noisy, buried in other alerts) |
| Triage | (how was severity assessed?) | (good: clear severity rubric existed) | (bad: took 20 min to determine customer impact) |
| Communication | (who was notified, when?) | (good: status page updated within 5 min) | (bad: internal Slack channel was wrong, half the team missed it) |
| Mitigation | (how was bleeding stopped?) | (good: feature flag killed the bad code path in 3 min) | (bad: rollback took 45 min because deploy pipeline was queued) |
| Resolution | (how was it fully fixed?) | (good: fix was straightforward once understood) | (bad: no tests existed for this code path) |

Step 5: Generate the postmortem document

## Postmortem: {{incident title}}

**Date of incident:** {{date}}
**Severity:** {{SEV level and label}}
**Duration:** {{detection to resolution}}
**Author:** {{who wrote this}}
**Postmortem date:** {{when this review happened}}

### Summary
(2-3 sentences: what happened, what the impact was, how it was resolved.)

### Impact
- **Users affected:** (number or percentage)
- **Duration of impact:** (time from user-facing impact start to resolution)
- **Revenue impact:** (if measurable)
- **SLA impact:** (any SLA breach?)
- **Data impact:** (any data loss or corruption?)

### Timeline
(From Step 2)

### Contributing factors
(From Step 3)

### Detection and response assessment
(From Step 4)

### What went well
- (Things that worked during the incident -- call them out explicitly)
- (Fast detection, clear communication, effective mitigation, good teamwork)

### Action items

| ID | Action | Type | Owner | Due date | Status |
|----|--------|------|-------|----------|--------|
| 1 | (specific action) | Prevent / Detect / Mitigate / Process | (team or role) | (date) | Open |
| 2 | (specific action) | (type) | (team or role) | (date) | Open |

**Action item types:**
- **Prevent:** Stop this class of incident from happening (e.g., add input validation)
- **Detect:** Catch it faster next time (e.g., add alerting on connection pool usage)
- **Mitigate:** Reduce impact when it does happen (e.g., add circuit breaker)
- **Process:** Improve how we respond (e.g., update runbook, rotate on-call training)

### Follow-up schedule
- **Action item review:** (date -- 2 weeks after postmortem)
- **Retro check-in:** (date -- next retro to confirm actions are progressing)

If project/engagement postmortem, use this template instead:

## Project Postmortem: {{project or engagement name}}

**Engagement period:** {{start date}} -- {{end date}}
**Team:** {{team members and roles}}
**Author:** {{who wrote this}}
**Postmortem date:** {{when this review happened}}

### Summary
(2-3 sentences: what the project was, what was delivered, overall assessment.)

### Goals vs. Outcomes

| Goal | Target | Actual | Assessment |
|------|--------|--------|------------|
| (goal 1) | (success criteria) | (what happened) | Met / Partially met / Not met |

### What went well
- (Practices, decisions, or dynamics that worked)

### What we would do differently
- (Specific changes, not blame -- what would we adjust if starting over?)

### Lessons learned

**Process:** (what we learned about how we work)
**Technical:** (what we learned about the technology or architecture)
**People:** (what we learned about team dynamics, communication, collaboration)
**Client:** (what we learned about stakeholder management, expectations, engagement)

### Recommendations for future engagements
- (Reusable patterns, warnings, or advice for the next team)

### Action items

| ID | Action | Owner | Due date | Status |
|----|--------|-------|----------|--------|
| 1 | (specific action) | (person or team) | (date) | Open |

Step 6: Facilitate the postmortem meeting (optional)

If the user is facilitating a live postmortem, provide this agenda:

  1. Set the tone (2 min) -- "This is a blameless postmortem. We're here to learn, not to assign blame. Every action taken during the incident was reasonable given what people knew at the time."
  2. Walk through the timeline (10 min) -- present the timeline, ask for corrections and additions.
  3. Discuss contributing factors (15 min) -- present the factors, ask "what else made this outcome likely?"
  4. Identify action items (10 min) -- for each contributing factor, ask "what would reduce the likelihood or impact of this factor?"
  5. Prioritize actions (5 min) -- rank by leverage (prevent > detect > mitigate > process) and effort.
  6. Assign owners and dates (5 min) -- every action item gets an owner and a due date. No orphan items.
  7. Close (3 min) -- confirm follow-up schedule, thank the team.

Facilitation tips:

  • If someone starts blaming, redirect: "Let's focus on what made that outcome likely, not who did what."
  • If the group fixates on a single root cause, ask: "What else had to be true for this to happen?"
  • If action items are vague, push for specificity: "What does 'improve monitoring' mean concretely? What metric, what threshold, what alert?"

Step 7: Discuss

Ask the user:

  • Does this capture the incident accurately?
  • Are there contributing factors we missed?
  • Are the action items concrete enough to execute?
  • Should I draft a customer-facing summary or internal announcement?
  • Want me to set up a follow-up action item review prompt?

Output location

Present the postmortem document as formatted text in the conversation or save to a file if requested.

Example Output

Input

  • Incident summary: "Checkout flow was completely broken for roughly 90 minutes on Black Friday morning. Customers couldn't complete purchases -- carts were freezing at the payment step. Eventually traced to a Redis connection pool exhaustion triggered by a last-minute promo banner deploy that introduced an unintentional cache stampede."
  • Company / team context: Meridian Home Goods, mid-size e-commerce company. 6-person platform team, postmortem culture is informal -- they do verbal debriefs but have never written a structured postmortem document. On-call engineer had been with the team for 3 weeks.
  • Timeline data (raw): "Deploy went out at 07:14 ET. First customer complaint in Zendesk at 07:22. Datadog alert for elevated 5xx rate fired at 07:31. On-call paged at 07:33. Senior engineer joined at 07:51 after Slack DM (pager didn't reach her). Rolled back deploy at 08:19. Checkout confirmed stable at 08:47."
  • Impact: Approximately 4,200 failed checkout attempts, estimated $340K in lost GMV, 3 SLA breach notifications sent to enterprise wholesale accounts.
  • Resolution: Rolled back the promo banner deploy; Redis connection limit raised as emergency config change; feature flag added to decouple promo content from checkout critical path.

Output

Postmortem: Checkout Outage — Black Friday Cache Stampede

Date of incident: November 24, 2023 Severity: SEV-1 — Critical (checkout unavailable for all users) Duration: 85 minutes (07:22 ET — 08:47 ET, user-facing) Author: Platform Team Postmortem date: November 27, 2023


Summary

On the morning of Black Friday, a promo banner deploy at 07:14 ET introduced a cache stampede against Meridian's Redis cluster, exhausting the connection pool and causing checkout to fail with 5xx errors across all users. Roughly 4,200 transactions failed over 85 minutes before a deploy rollback and emergency Redis configuration change restored service at 08:47 ET. Detection was delayed by 9 minutes and escalation was slowed by an unreliable paging setup, extending total impact duration unnecessarily.


Impact

  • Users affected: ~4,200 failed checkout attempts (all users attempting payment during the window)
  • Duration of impact: 85 minutes (07:22 — 08:47 ET)
  • Revenue impact: ~$340,000 in lost GMV (estimated from avg order value × failed checkouts)
  • SLA impact: 3 enterprise wholesale account SLA breach notifications triggered
  • Data impact: No data loss or corruption; no orders partially committed

Timeline

Time (ET)EventSourceActor
07:14Promo banner deploy pushed to productionDeploy logCI/CD pipeline
07:22First customer complaint: "cart is frozen"Zendesk ticketCustomer
07:31Datadog 5xx rate alert fired (threshold: >2% error rate)Automated alertDatadog
07:33On-call engineer pagedPagerDutyPagerDuty
07:33–07:51On-call engineer investigating alone; unable to identify Redis as causeSlackOn-call engineer
07:51Senior engineer joined after direct Slack DM — pager had not reached herSlack DMOn-call engineer
07:58Redis connection pool exhaustion identified in Datadog APM tracesDatadogSenior engineer
08:03Status page updated: "We are investigating checkout issues"Status pageSenior engineer
08:08Decision made to roll back promo banner deploySlack war roomPlatform team
08:19Rollback completed; Redis connection limit raised via emergency configDeploy logSenior engineer
08:47Checkout confirmed stable; status page updated to resolvedStatus pageSenior engineer

Key timestamps:

  • Detection: 07:31 (automated alert) — 9 minutes after first customer report
  • Escalation: 07:51 — 20 minutes after alert fired; delayed by paging failure
  • Mitigation / Resolution: 08:19 rollback; 08:47 confirmed stable
  • Customer communication: 08:03 — 41 minutes after first complaint

Time gaps to investigate:

  • Detection to escalation: 20 minutes — pager failed to reach senior engineer; on-call engineer working the problem alone for 18 minutes without domain expertise
  • Alert to first customer communication: 32 minutes — status page update was delayed while team was heads-down diagnosing

Contributing Factors

Process factors:

  • The promo banner deploy was executed on Black Friday morning with no deployment freeze or elevated review process in place for peak trading periods
  • No rollback rehearsal had been run; the deploy pipeline had a 11-minute queue at the time of rollback due to other jobs
  • The on-call runbook did not cover Redis connection pool exhaustion as a known failure mode
  • Status page update process was not assigned to a specific role — it fell to whoever was least busy diagnosing

Technical factors:

  • The promo banner feature made synchronous Redis calls inside the checkout critical path, creating tight coupling that allowed a content change to affect transaction processing
  • Redis connection pool limit was set to a default value from initial cluster setup 18 months ago — never revisited as traffic grew
  • No alerting existed on Redis connection pool utilization; the 5xx alert was the only signal, which fires downstream of the actual cause
  • Cache stampede protection (e.g., probabilistic early expiration, request coalescing) was not implemented on the banner cache key

Human factors:

  • The on-call engineer had been on the team for 3 weeks and had not been through a SEV-1 before; no shadow on-call rotation had preceded independent on-call duty
  • Alert fatigue context: Datadog showed 31 low-priority alerts in the 24 hours preceding the incident; the SEV-1 alert appeared in the same channel

Organizational factors:

  • No deployment freeze policy existed for peak commerce events despite Black Friday being the highest-traffic day of the year
  • The promo banner feature had been built by a contractor engagement that concluded in September; institutional knowledge of the Redis coupling was not documented

Detection and Response Assessment

PhaseWhat happenedWhat workedWhat didn't
DetectionDatadog 5xx alert fired at 07:31Automated alerting did fire9-minute lag behind customer reports; no upstream Redis alert
TriageOn-call engineer triaged alone for 18 minEngineer correctly identified checkout as the failing surfaceNo Redis-specific runbook; root cause identification took 27 min from alert
CommunicationStatus page updated at 08:03Update did go out41 minutes after first complaint; enterprise accounts not directly notified
MitigationRollback executed at 08:19Clear decision once senior engineer was involvedPipeline queue added 11 min to rollback; rollback path wasn't pre-validated
ResolutionRedis config change + rollback stabilized at 08:47Feature flag decoupling added as immediate hardeningFix required senior engineer domain knowledge not documented anywhere

What Went Well

  • Automated 5xx alerting did fire and page the on-call engineer within 2 minutes of threshold crossing
  • Once the senior engineer joined, root cause was identified in 7 minutes from APM trace data — Datadog instrumentation was sufficient to find the problem quickly
  • The decision to roll back was made quickly once the team aligned; no prolonged debate
  • Feature flag decoupling was implemented as part of the resolution, not deferred — immediate hardening in the middle of an incident

Action Items

IDActionTypeOwnerDue dateStatus
1Implement deployment freeze window covering 48 hours around peak commerce events (Black Friday, Cyber Monday, major sales); codify in deployment policyPreventPlatform + Eng LeadershipDec 8Open
2Add Redis connection pool utilization alert (threshold: >75% pool usage) in Datadog; route to same SEV-1 channelDetectPlatform TeamDec 1Open