Use this when a production incident has been resolved and the team needs to learn from it -- or when you're facilitating a postmortem for a client team. Also use proactively to establish postmortem culture before the first incident hits. Also works as a project or engagement postmortem -- reflecting on completed work to capture lessons learned without incident-specific framing. Produces a postmortem document with timeline, contributing factors, severity classification (or goal assessment), and tracked action items.
Related skills: Use
/delivery-diagnosewhen the problem is chronic delivery issues rather than a discrete incident. Pair with/instrumentation-planto improve detection for future incidents. Use/observability-planif the incident revealed blind spots in product-level monitoring. For significant or recurring problems that need corrective + preventive action with effectiveness verification, feed postmortem findings into/capa-design.
Process
Step 1: Gather inputs
Ask the user to provide: 0. Postmortem type -- incident (default) or project/engagement? Project mode replaces severity classification and incident timeline with outcome reflection and goal achievement assessment.
- Incident summary -- what happened, in plain language?
- Severity -- how bad was it? (Use the team's existing severity levels, or classify below.)
- Timeline data -- alerts, Slack messages, on-call pages, deploy logs, status page updates, customer reports. Raw is fine -- we'll structure it.
- Resolution -- what fixed it? Who was involved?
- Impact -- users affected, revenue impact, data loss, SLA breach, reputational cost.
- Team context -- who was on call? What's the team's postmortem experience level?
If severity levels don't exist yet, use this default scale:
| Level | Label | Definition |
|---|---|---|
| SEV-1 | Critical | Service down for all users, data loss, or security breach |
| SEV-2 | Major | Significant degradation, large subset of users affected |
| SEV-3 | Minor | Limited impact, workaround available, small user subset |
| SEV-4 | Low | Cosmetic or edge case, no user-facing impact |
If project/engagement postmortem: Skip the severity table. Instead gather:
- Original goals and success criteria -- what were we trying to achieve?
- What was delivered vs. planned -- scope, timeline, quality
- Stakeholder feedback -- what did clients/sponsors say?
- Team composition and dynamics -- who was involved, how did collaboration work?
- Timeline -- planned vs. actual milestones
Step 2: Reconstruct the timeline
Build a chronological timeline from the raw data. For each entry, capture:
### Timeline
| Time (UTC) | Event | Source | Actor |
|-----------|-------|--------|-------|
| (timestamp) | (what happened) | (alert / Slack / deploy log / customer report) | (person or system) |
**Key timestamps:**
- **Detection:** (when did we first know?) -- (how: alert / customer report / engineer noticed)
- **Escalation:** (when did the right people get involved?)
- **Mitigation:** (when was the bleeding stopped?)
- **Resolution:** (when was the root fix applied?)
- **Communication:** (when were customers/stakeholders informed?)
**Time gaps to investigate:**
- Detection to escalation: (N minutes) -- (acceptable? too slow?)
- Escalation to mitigation: (N minutes) -- (what slowed this down?)
Flag any gaps where "we don't know what happened" -- those are learning opportunities, not failures.
Step 3: Identify contributing factors
Use a contributing factors analysis, not "root cause analysis." Incidents rarely have a single root cause -- they emerge from the interaction of multiple contributing factors.
Categorize contributing factors:
### Contributing factors
**Process factors:**
- (e.g., "Deployment happened on Friday afternoon with no one watching metrics")
- (e.g., "Runbook was outdated -- referenced services that were decommissioned 6 months ago")
**Technical factors:**
- (e.g., "No circuit breaker on the payment service dependency")
- (e.g., "Database connection pool exhaustion under load -- no connection limit alerting")
**Human factors:**
- (e.g., "On-call engineer was unfamiliar with the payment service -- joined the team 2 weeks ago")
- (e.g., "Alert fatigue -- 47 low-priority alerts in the previous 24 hours masked the real signal")
**Organizational factors:**
- (e.g., "No staging environment that mirrors production load patterns")
- (e.g., "Team responsible for the failing service was dissolved last quarter -- no clear owner")
Blameless framing rules:
- Name roles, not people. "The on-call engineer" not "Alex."
- Describe what happened, not what someone should have done. "The alert was not actionable" not "They ignored the alert."
- Ask "what made this outcome likely?" not "who caused this?"
- Treat every human action as reasonable given what they knew at the time.
Step 4: Assess detection and response
Evaluate how well the system and team responded:
### Detection and response assessment
| Phase | What happened | What worked | What didn't |
|-------|-------------|-------------|-------------|
| Detection | (how was it discovered?) | (good: automated alert fired in < 2 min) | (bad: alert was noisy, buried in other alerts) |
| Triage | (how was severity assessed?) | (good: clear severity rubric existed) | (bad: took 20 min to determine customer impact) |
| Communication | (who was notified, when?) | (good: status page updated within 5 min) | (bad: internal Slack channel was wrong, half the team missed it) |
| Mitigation | (how was bleeding stopped?) | (good: feature flag killed the bad code path in 3 min) | (bad: rollback took 45 min because deploy pipeline was queued) |
| Resolution | (how was it fully fixed?) | (good: fix was straightforward once understood) | (bad: no tests existed for this code path) |
Step 5: Generate the postmortem document
## Postmortem: {{incident title}}
**Date of incident:** {{date}}
**Severity:** {{SEV level and label}}
**Duration:** {{detection to resolution}}
**Author:** {{who wrote this}}
**Postmortem date:** {{when this review happened}}
### Summary
(2-3 sentences: what happened, what the impact was, how it was resolved.)
### Impact
- **Users affected:** (number or percentage)
- **Duration of impact:** (time from user-facing impact start to resolution)
- **Revenue impact:** (if measurable)
- **SLA impact:** (any SLA breach?)
- **Data impact:** (any data loss or corruption?)
### Timeline
(From Step 2)
### Contributing factors
(From Step 3)
### Detection and response assessment
(From Step 4)
### What went well
- (Things that worked during the incident -- call them out explicitly)
- (Fast detection, clear communication, effective mitigation, good teamwork)
### Action items
| ID | Action | Type | Owner | Due date | Status |
|----|--------|------|-------|----------|--------|
| 1 | (specific action) | Prevent / Detect / Mitigate / Process | (team or role) | (date) | Open |
| 2 | (specific action) | (type) | (team or role) | (date) | Open |
**Action item types:**
- **Prevent:** Stop this class of incident from happening (e.g., add input validation)
- **Detect:** Catch it faster next time (e.g., add alerting on connection pool usage)
- **Mitigate:** Reduce impact when it does happen (e.g., add circuit breaker)
- **Process:** Improve how we respond (e.g., update runbook, rotate on-call training)
### Follow-up schedule
- **Action item review:** (date -- 2 weeks after postmortem)
- **Retro check-in:** (date -- next retro to confirm actions are progressing)
If project/engagement postmortem, use this template instead:
## Project Postmortem: {{project or engagement name}}
**Engagement period:** {{start date}} -- {{end date}}
**Team:** {{team members and roles}}
**Author:** {{who wrote this}}
**Postmortem date:** {{when this review happened}}
### Summary
(2-3 sentences: what the project was, what was delivered, overall assessment.)
### Goals vs. Outcomes
| Goal | Target | Actual | Assessment |
|------|--------|--------|------------|
| (goal 1) | (success criteria) | (what happened) | Met / Partially met / Not met |
### What went well
- (Practices, decisions, or dynamics that worked)
### What we would do differently
- (Specific changes, not blame -- what would we adjust if starting over?)
### Lessons learned
**Process:** (what we learned about how we work)
**Technical:** (what we learned about the technology or architecture)
**People:** (what we learned about team dynamics, communication, collaboration)
**Client:** (what we learned about stakeholder management, expectations, engagement)
### Recommendations for future engagements
- (Reusable patterns, warnings, or advice for the next team)
### Action items
| ID | Action | Owner | Due date | Status |
|----|--------|-------|----------|--------|
| 1 | (specific action) | (person or team) | (date) | Open |
Step 6: Facilitate the postmortem meeting (optional)
If the user is facilitating a live postmortem, provide this agenda:
- Set the tone (2 min) -- "This is a blameless postmortem. We're here to learn, not to assign blame. Every action taken during the incident was reasonable given what people knew at the time."
- Walk through the timeline (10 min) -- present the timeline, ask for corrections and additions.
- Discuss contributing factors (15 min) -- present the factors, ask "what else made this outcome likely?"
- Identify action items (10 min) -- for each contributing factor, ask "what would reduce the likelihood or impact of this factor?"
- Prioritize actions (5 min) -- rank by leverage (prevent > detect > mitigate > process) and effort.
- Assign owners and dates (5 min) -- every action item gets an owner and a due date. No orphan items.
- Close (3 min) -- confirm follow-up schedule, thank the team.
Facilitation tips:
- If someone starts blaming, redirect: "Let's focus on what made that outcome likely, not who did what."
- If the group fixates on a single root cause, ask: "What else had to be true for this to happen?"
- If action items are vague, push for specificity: "What does 'improve monitoring' mean concretely? What metric, what threshold, what alert?"
Step 7: Discuss
Ask the user:
- Does this capture the incident accurately?
- Are there contributing factors we missed?
- Are the action items concrete enough to execute?
- Should I draft a customer-facing summary or internal announcement?
- Want me to set up a follow-up action item review prompt?
Output location
Present the postmortem document as formatted text in the conversation or save to a file if requested.
Example Output
Input
- Incident summary: "Checkout flow was completely broken for roughly 90 minutes on Black Friday morning. Customers couldn't complete purchases -- carts were freezing at the payment step. Eventually traced to a Redis connection pool exhaustion triggered by a last-minute promo banner deploy that introduced an unintentional cache stampede."
- Company / team context: Meridian Home Goods, mid-size e-commerce company. 6-person platform team, postmortem culture is informal -- they do verbal debriefs but have never written a structured postmortem document. On-call engineer had been with the team for 3 weeks.
- Timeline data (raw): "Deploy went out at 07:14 ET. First customer complaint in Zendesk at 07:22. Datadog alert for elevated 5xx rate fired at 07:31. On-call paged at 07:33. Senior engineer joined at 07:51 after Slack DM (pager didn't reach her). Rolled back deploy at 08:19. Checkout confirmed stable at 08:47."
- Impact: Approximately 4,200 failed checkout attempts, estimated $340K in lost GMV, 3 SLA breach notifications sent to enterprise wholesale accounts.
- Resolution: Rolled back the promo banner deploy; Redis connection limit raised as emergency config change; feature flag added to decouple promo content from checkout critical path.
Output
Postmortem: Checkout Outage — Black Friday Cache Stampede
Date of incident: November 24, 2023 Severity: SEV-1 — Critical (checkout unavailable for all users) Duration: 85 minutes (07:22 ET — 08:47 ET, user-facing) Author: Platform Team Postmortem date: November 27, 2023
Summary
On the morning of Black Friday, a promo banner deploy at 07:14 ET introduced a cache stampede against Meridian's Redis cluster, exhausting the connection pool and causing checkout to fail with 5xx errors across all users. Roughly 4,200 transactions failed over 85 minutes before a deploy rollback and emergency Redis configuration change restored service at 08:47 ET. Detection was delayed by 9 minutes and escalation was slowed by an unreliable paging setup, extending total impact duration unnecessarily.
Impact
- Users affected: ~4,200 failed checkout attempts (all users attempting payment during the window)
- Duration of impact: 85 minutes (07:22 — 08:47 ET)
- Revenue impact: ~$340,000 in lost GMV (estimated from avg order value × failed checkouts)
- SLA impact: 3 enterprise wholesale account SLA breach notifications triggered
- Data impact: No data loss or corruption; no orders partially committed
Timeline
| Time (ET) | Event | Source | Actor |
|---|---|---|---|
| 07:14 | Promo banner deploy pushed to production | Deploy log | CI/CD pipeline |
| 07:22 | First customer complaint: "cart is frozen" | Zendesk ticket | Customer |
| 07:31 | Datadog 5xx rate alert fired (threshold: >2% error rate) | Automated alert | Datadog |
| 07:33 | On-call engineer paged | PagerDuty | PagerDuty |
| 07:33–07:51 | On-call engineer investigating alone; unable to identify Redis as cause | Slack | On-call engineer |
| 07:51 | Senior engineer joined after direct Slack DM — pager had not reached her | Slack DM | On-call engineer |
| 07:58 | Redis connection pool exhaustion identified in Datadog APM traces | Datadog | Senior engineer |
| 08:03 | Status page updated: "We are investigating checkout issues" | Status page | Senior engineer |
| 08:08 | Decision made to roll back promo banner deploy | Slack war room | Platform team |
| 08:19 | Rollback completed; Redis connection limit raised via emergency config | Deploy log | Senior engineer |
| 08:47 | Checkout confirmed stable; status page updated to resolved | Status page | Senior engineer |
Key timestamps:
- Detection: 07:31 (automated alert) — 9 minutes after first customer report
- Escalation: 07:51 — 20 minutes after alert fired; delayed by paging failure
- Mitigation / Resolution: 08:19 rollback; 08:47 confirmed stable
- Customer communication: 08:03 — 41 minutes after first complaint
Time gaps to investigate:
- Detection to escalation: 20 minutes — pager failed to reach senior engineer; on-call engineer working the problem alone for 18 minutes without domain expertise
- Alert to first customer communication: 32 minutes — status page update was delayed while team was heads-down diagnosing
Contributing Factors
Process factors:
- The promo banner deploy was executed on Black Friday morning with no deployment freeze or elevated review process in place for peak trading periods
- No rollback rehearsal had been run; the deploy pipeline had a 11-minute queue at the time of rollback due to other jobs
- The on-call runbook did not cover Redis connection pool exhaustion as a known failure mode
- Status page update process was not assigned to a specific role — it fell to whoever was least busy diagnosing
Technical factors:
- The promo banner feature made synchronous Redis calls inside the checkout critical path, creating tight coupling that allowed a content change to affect transaction processing
- Redis connection pool limit was set to a default value from initial cluster setup 18 months ago — never revisited as traffic grew
- No alerting existed on Redis connection pool utilization; the 5xx alert was the only signal, which fires downstream of the actual cause
- Cache stampede protection (e.g., probabilistic early expiration, request coalescing) was not implemented on the banner cache key
Human factors:
- The on-call engineer had been on the team for 3 weeks and had not been through a SEV-1 before; no shadow on-call rotation had preceded independent on-call duty
- Alert fatigue context: Datadog showed 31 low-priority alerts in the 24 hours preceding the incident; the SEV-1 alert appeared in the same channel
Organizational factors:
- No deployment freeze policy existed for peak commerce events despite Black Friday being the highest-traffic day of the year
- The promo banner feature had been built by a contractor engagement that concluded in September; institutional knowledge of the Redis coupling was not documented
Detection and Response Assessment
| Phase | What happened | What worked | What didn't |
|---|---|---|---|
| Detection | Datadog 5xx alert fired at 07:31 | Automated alerting did fire | 9-minute lag behind customer reports; no upstream Redis alert |
| Triage | On-call engineer triaged alone for 18 min | Engineer correctly identified checkout as the failing surface | No Redis-specific runbook; root cause identification took 27 min from alert |
| Communication | Status page updated at 08:03 | Update did go out | 41 minutes after first complaint; enterprise accounts not directly notified |
| Mitigation | Rollback executed at 08:19 | Clear decision once senior engineer was involved | Pipeline queue added 11 min to rollback; rollback path wasn't pre-validated |
| Resolution | Redis config change + rollback stabilized at 08:47 | Feature flag decoupling added as immediate hardening | Fix required senior engineer domain knowledge not documented anywhere |
What Went Well
- Automated 5xx alerting did fire and page the on-call engineer within 2 minutes of threshold crossing
- Once the senior engineer joined, root cause was identified in 7 minutes from APM trace data — Datadog instrumentation was sufficient to find the problem quickly
- The decision to roll back was made quickly once the team aligned; no prolonged debate
- Feature flag decoupling was implemented as part of the resolution, not deferred — immediate hardening in the middle of an incident
Action Items
| ID | Action | Type | Owner | Due date | Status |
|---|---|---|---|---|---|
| 1 | Implement deployment freeze window covering 48 hours around peak commerce events (Black Friday, Cyber Monday, major sales); codify in deployment policy | Prevent | Platform + Eng Leadership | Dec 8 | Open |
| 2 | Add Redis connection pool utilization alert (threshold: >75% pool usage) in Datadog; route to same SEV-1 channel | Detect | Platform Team | Dec 1 | Open |