Use this when you need to design, audit, or improve system-level instrumentation for a product or service. This covers the SRE side of measurement: uptime, availability, latency, deployment health, error budgets, and alerting. If you're looking to measure user behavior, event tracking, or product analytics, use /observability-plan instead.
The distinction: Instrumentation answers "Is the system healthy?" Observability answers "Are users successful?" Both are needed. Start here if the system isn't reliably measurable yet.
For AI-powered features: This skill covers system-level instrumentation. If you need to monitor LLM-specific concerns (prompt quality, token costs, hallucination drift, model regression), use
/llm-observability-plan.
Process
Step 1: Gather context
Ask the user to provide:
- System description — what does this system do? (services, APIs, background jobs, static sites, workers, etc.)
- Current instrumentation — what's already measured? (existing dashboards, alerts, logging, APM tools)
- Infrastructure stack — hosting, CI/CD, CDN, databases, message queues, third-party dependencies
- Deployment process — how code gets to production (manual, automated, frequency, rollback capability)
- Incident history — recent outages, degradations, or close calls (what broke, how was it detected, how long to recover?)
- Team context — who is on-call? Is there an existing SRE function or is this engineering-owned?
If the user doesn't have all of this, work with what's available. Flag gaps as assumptions.
Step 2: Define SLIs, SLOs, and error budgets
For each critical user journey or system capability, define:
Service Level Indicators (SLIs) — the measurable signals:
| SLI Category | What to measure | Example |
|---|---|---|
| Availability | Successful requests / total requests | 99.9% of HTTP requests return non-5xx |
| Latency | Response time at key percentiles | p50 < 200ms, p95 < 800ms, p99 < 2s |
| Throughput | Request volume over time | Sustained 500 req/s during peak |
| Error rate | Errors / total operations | < 0.1% of API calls return 5xx |
| Freshness | Data age or staleness | Cache refreshes within 5 minutes |
| Correctness | Accurate results / total results | 100% of calculations match expected output |
Service Level Objectives (SLOs) — the targets:
- Set SLOs based on user expectations, not aspirational perfection
- A 99.9% availability SLO means ~8.7 hours of allowed downtime per year
- Start conservative (lower targets), tighten as you build confidence
- Every SLO needs an explicit measurement window (rolling 30 days is standard)
Error budgets — the tolerance:
- Error budget = 1 - SLO (e.g., 99.9% SLO = 0.1% error budget)
- When budget is consumed, freeze feature releases and focus on reliability
- Track burn rate: how fast is the error budget being spent?
Present these in a table:
| Journey / Capability | SLI | SLO Target | Measurement Window | Error Budget | Current State |
|---|---|---|---|---|---|
| (User-facing API) | Availability | 99.9% | Rolling 30 days | 43.2 min/month | (measured or unknown) |
| (User-facing API) | Latency (p95) | < 500ms | Rolling 30 days | — | (measured or unknown) |
Step 3: DORA metrics baseline
Assess the team's delivery health using the four DORA metrics:
| Metric | What it measures | Elite | High | Medium | Low |
|---|---|---|---|---|---|
| Deployment frequency | How often code ships to production | On-demand (multiple/day) | Daily to weekly | Weekly to monthly | Monthly+ |
| Lead time for changes | Commit to production | < 1 hour | 1 day – 1 week | 1 week – 1 month | 1 month+ |
| Change failure rate | % of deployments causing incidents | < 5% | 5–10% | 10–15% | 15%+ |
| Mean time to recovery (MTTR) | How long to restore service | < 1 hour | < 1 day | 1 day – 1 week | 1 week+ |
For each metric:
- Current state — what's the team's actual performance? (Measure or estimate.)
- Target state — where should they be in 90 days?
- How to measure — specific data source (CI/CD logs, incident tracker, deployment pipeline)
- Biggest blocker — what's preventing improvement?
Step 4: Infrastructure health metrics
Define the standard infrastructure signals to monitor:
The Four Golden Signals (per Google SRE):
- Latency — time to serve a request (distinguish successful vs. failed request latency)
- Traffic — demand on the system (requests/sec, sessions, transactions)
- Errors — rate of failed requests (explicit 5xx, implicit timeout, wrong-answer errors)
- Saturation — how full the system is (CPU, memory, disk, queue depth, connection pool)
For each service or component, produce a monitoring table:
| Component | Latency Metric | Traffic Metric | Error Metric | Saturation Metric | Alert Threshold |
|---|---|---|---|---|---|
| (Web server) | p95 response time | req/sec | 5xx rate | CPU %, memory % | p95 > 1s for 5 min |
| (Database) | Query time p95 | Queries/sec | Failed queries | Connection pool %, disk I/O | Pool > 80% for 10 min |
| (Worker/queue) | Job duration p95 | Jobs enqueued/sec | Failed jobs | Queue depth | Depth > 1000 for 15 min |
| (CDN/static) | TTFB p95 | Bandwidth | 4xx/5xx rate | Cache hit ratio | Hit ratio < 90% |
Step 5: Alerting strategy
Design alerts that are actionable, not noisy:
Alert tiers:
| Tier | Meaning | Response | Notification | Example |
|---|---|---|---|---|
| P1 — Critical | User-facing service is down or severely degraded | Immediate response, wake people up | PagerDuty / phone | API availability < 99% for 5 min |
| P2 — Warning | Degradation detected, not yet user-impacting | Investigate within 1 hour | Slack alert channel | Error rate > 1% for 15 min |
| P3 — Info | Trend worth watching | Review next business day | Dashboard / email digest | Disk usage > 70% |
Alert design rules:
- Every alert must have a documented response action (or it's noise)
- Use burn-rate alerts for SLO monitoring (not raw threshold alerts)
- Set alerts on symptoms (user impact), not causes (CPU spike) — unless cause alerts are the only early warning
- Require 2+ consecutive breaches before firing (avoid flapping)
- Review alert fatigue quarterly: if an alert fires > 10x/month without action, fix or delete it
Step 6: Generate the instrumentation plan
Compile everything into a single document:
Instrumentation Plan — (Project name)
Generated: (date) System: (brief description) Current state: (summary of what's instrumented today)
SLIs, SLOs & Error Budgets
(Table from Step 2)
DORA Metrics Baseline
(Table from Step 3 — current state, targets, measurement sources)
Infrastructure Monitoring
(Table from Step 4 — golden signals per component)
Alerting Strategy
(Tiered alert definitions from Step 5)
Implementation Checklist
Priority-ordered list of what to instrument next:
- (P0) (Most critical gap — e.g., "No availability SLI exists for the primary API")
- (P0) (Second critical gap)
- (P1) (Important but not urgent — e.g., "DORA metrics not tracked; add deployment frequency counter")
- (P1) (Next important item)
- (P2) (Nice to have — e.g., "Add cache hit ratio monitoring for CDN")
Open Questions
(Anything that couldn't be resolved without more information)
Recommended Tools
(Based on the team's stack — only include if the user asked or if there's a clear gap)
Step 7: Review and refine
Ask the user:
- Are the SLOs realistic for your team's current maturity?
- Are any critical components missing from the monitoring table?
- Does the alerting strategy match your on-call setup? (No on-call = no P1 phone alerts.)
- Is the implementation checklist ordered correctly for your priorities?
- Any incidents in the last 6 months that this plan wouldn't have caught?
Adjust based on feedback.
Related skills
/observability-plan— plan product-level observability for user behavior, events, and task completion/security-review— generate attacker stories and security acceptance criteria/debug-assist— hypothesis-driven debugging when something goes wrong
Output location
Present the plan as formatted text in the conversation. The user can copy it into their project wiki, engineering docs, or incident response runbook.
Example Output
Input
- System description: Multi-tenant B2B SaaS platform for construction project management — includes a REST API (Node.js), a PDF report generation worker (Python), a PostgreSQL database, and a React frontend served via CDN. Background jobs handle nightly cost-rollup calculations for ~4,000 active projects.
- Current instrumentation: Datadog APM on the API layer only; no alerting on the worker or database; one dashboard showing p50 latency and HTTP status codes; no SLOs defined; PagerDuty is licensed but unconfigured.
- Infrastructure stack: AWS (ECS Fargate for API and workers), RDS PostgreSQL (Multi-AZ), CloudFront CDN, SQS for job queuing, GitHub Actions for CI/CD, deploys ~3x per week via manual approval gate.
- Incident history: Two incidents in the past 90 days — (1) cost-rollup worker silently failed for 36 hours; customers saw stale data but no alert fired; (2) a bad deploy caused API p99 latency to spike to 12s for 45 minutes before a customer reported it on Slack.
- Team context: 6-person engineering team, no dedicated SRE; one engineer rotates on-call per week; on-call engineer carries a phone but alert fatigue from a previous job makes the team wary of noisy alerting.
Output (abbreviated)
Instrumentation Plan — Clearpath Construction Platform
Generated: 2025-07-14 System: Multi-tenant B2B SaaS — REST API, PDF report worker, cost-rollup background jobs, PostgreSQL, CloudFront CDN Current state: Partial. Datadog APM covers API latency at p50 only. No SLOs, no worker monitoring, no database alerting, no deployment health tracking. PagerDuty licensed but inactive. Two recent incidents were detected by customers, not by tooling.
SLIs, SLOs & Error Budgets
| Journey / Capability | SLI | SLO Target | Window | Error Budget | Current State |
|---|---|---|---|---|---|
| API (all endpoints) | Availability (non-5xx / total) | 99.5% | Rolling 30 days | 3.6 hrs/month | Unknown — no SLI defined |
| API (all endpoints) | Latency p95 | < 800ms | Rolling 30 days | — | Unknown — only p50 tracked |
| API (all endpoints) | Latency p99 | < 2s | Rolling 30 days | — | Unknown — incident showed 12s spike undetected |
| Cost-rollup jobs | Freshness (data age) | < 2 hrs stale | Rolling 30 days | — | Unknown — silent 36-hr failure demonstrates zero coverage |
| Cost-rollup jobs | Error rate (failed jobs / total) | < 1% | Rolling 30 days | — | Unknown |
| PDF report generation | Availability (successful completions) | 99.0% | Rolling 30 days | 7.2 hrs/month | Unknown |
| PDF report generation | Latency p95 | < 30s | Rolling 30 days | — | Unknown |
Note: SLO targets are set conservatively given current measurement gaps. Recommend tightening API availability to 99.9% once you have 60 days of baseline data.
DORA Metrics Baseline
| Metric | Current State | Target (90 days) | How to Measure | Biggest Blocker |
|---|---|---|---|---|
| Deployment frequency | ~3x/week (estimated) | Daily | Tag GitHub Actions deploy events; count per day in Datadog | Manual approval gate adds friction; no blocker to counting |
| Lead time for changes | Unknown — not tracked | < 1 day | Timestamp commit SHA at merge; compare to ECS deploy completion event | No tooling linking commits to deploy events today |
| Change failure rate | Unknown — estimated ~15–20% given 2 incidents in 90 days across ~36 deploys | < 10% | Flag deploys followed by P1/P2 alert within 1 hour as failures; track in GitHub Actions | No automated linkage between deploy events and incident signals |
| MTTR | ~45 min for latency spike (customer-reported); 36+ hrs for worker failure (not detected) | < 1 hour for all P1s | Track incident open → resolved timestamps in PagerDuty once configured | Silent failures have infinite MTTR; freshness monitoring is prerequisite |
Infrastructure Monitoring — Golden Signals
| Component | Latency Metric | Traffic Metric | Error Metric | Saturation Metric | Alert Threshold |
|---|---|---|---|---|---|
| API (ECS Fargate) | p95, p99 response time per endpoint | req/sec by endpoint | 5xx rate, timeout rate | CPU %, memory %, ECS task restarts | p99 > 2s for 5 min; 5xx > 1% for 10 min |
| PostgreSQL (RDS) | Query time p95 (by query type) | Queries/sec, active connections | Failed queries, deadlocks | Connection pool %, IOPS, disk % | Pool > 80% for 5 min; disk > 75% |
| Cost-rollup worker (SQS + ECS) | Job duration p95 | Jobs enqueued/sec, jobs processed/sec | Failed jobs, DLQ depth | SQS queue depth, ECS task count | DLQ depth > 0 immediately; queue depth > 500 for 15 min; last successful run age > 2 hrs |
| PDF worker (ECS) | Render time p95 | Jobs submitted/sec | Failed renders | ECS task restarts, memory % | Render p95 > 60s for 10 min; task restart > 2 in 5 min |
| CloudFront CDN | TTFB p95 | Bandwidth, request count | 4xx rate, 5xx origin error rate | Cache hit ratio | Hit ratio < 85% for 30 min; 5xx origin rate > 2% for 5 min |
Alerting Strategy
Team constraint: 6-person rotating on-call, high alert-fatigue sensitivity. Every P1 must be genuinely wake-up-worthy. P2s go to Slack only.
| Tier | Alert | Trigger | Response | Routing |
|---|---|---|---|---|
| P1 — Critical | API availability SLO breach | 6x burn rate over 1-hr window (consuming 5% of monthly budget) | Page on-call immediately; initiate incident | PagerDuty → phone |
| P1 — Critical | API p99 latency breach | p99 > 2s sustained for 5 min | Page on-call; check recent deploy, DB query times | PagerDuty → phone |
| P1 — Critical | Cost-rollup data stale | No successful job completion in 2 hrs | Page on-call; check DLQ, ECS task health | PagerDuty → phone |
| P1 — Critical | DLQ message received | Any message lands in SQS dead-letter queue | Page on-call; job failed silently — inspect immediately | PagerDuty → phone |
| P2 — Warning | API error rate elevated | 5xx > 1% for 10 min (below SLO breach) | Investigate within 1 hour | #alerts-engineering Slack |
| P2 — Warning | RDS connection pool high | Pool utilization > 80% for 5 min | Review query patterns, connection leak risk | #alerts-engineering Slack |
| P2 — Warning | PDF worker task restarts | > 2 restarts in 5 min | Check memory limits, inspect failed render logs | #alerts-engineering Slack |
| P2 — Warning | Deploy change failure signal | P1 alert fires within 60 min of deploy | Evaluate rollback; notify on-call | #deploys Slack |
| P3 — Info | RDS disk usage | > 75% | Review retention policy; plan capacity | Daily digest / Datadog dashboard |
| P3 — Info | CDN cache hit ratio drop | < 85% for 30 min | Review cache headers; not user-impacting yet |