Skip to main content
Engineering/instrumentation-plan

Instrumentation Plan

You need to plan or audit SRE instrumentation for reliability and operational health.

Use this when you need to design, audit, or improve system-level instrumentation for a product or service. This covers the SRE side of measurement: uptime, availability, latency, deployment health, error budgets, and alerting. If you're looking to measure user behavior, event tracking, or product analytics, use /observability-plan instead.

The distinction: Instrumentation answers "Is the system healthy?" Observability answers "Are users successful?" Both are needed. Start here if the system isn't reliably measurable yet.

For AI-powered features: This skill covers system-level instrumentation. If you need to monitor LLM-specific concerns (prompt quality, token costs, hallucination drift, model regression), use /llm-observability-plan.

Process

Step 1: Gather context

Ask the user to provide:

  1. System description — what does this system do? (services, APIs, background jobs, static sites, workers, etc.)
  2. Current instrumentation — what's already measured? (existing dashboards, alerts, logging, APM tools)
  3. Infrastructure stack — hosting, CI/CD, CDN, databases, message queues, third-party dependencies
  4. Deployment process — how code gets to production (manual, automated, frequency, rollback capability)
  5. Incident history — recent outages, degradations, or close calls (what broke, how was it detected, how long to recover?)
  6. Team context — who is on-call? Is there an existing SRE function or is this engineering-owned?

If the user doesn't have all of this, work with what's available. Flag gaps as assumptions.

Step 2: Define SLIs, SLOs, and error budgets

For each critical user journey or system capability, define:

Service Level Indicators (SLIs) — the measurable signals:

SLI CategoryWhat to measureExample
AvailabilitySuccessful requests / total requests99.9% of HTTP requests return non-5xx
LatencyResponse time at key percentilesp50 < 200ms, p95 < 800ms, p99 < 2s
ThroughputRequest volume over timeSustained 500 req/s during peak
Error rateErrors / total operations< 0.1% of API calls return 5xx
FreshnessData age or stalenessCache refreshes within 5 minutes
CorrectnessAccurate results / total results100% of calculations match expected output

Service Level Objectives (SLOs) — the targets:

  • Set SLOs based on user expectations, not aspirational perfection
  • A 99.9% availability SLO means ~8.7 hours of allowed downtime per year
  • Start conservative (lower targets), tighten as you build confidence
  • Every SLO needs an explicit measurement window (rolling 30 days is standard)

Error budgets — the tolerance:

  • Error budget = 1 - SLO (e.g., 99.9% SLO = 0.1% error budget)
  • When budget is consumed, freeze feature releases and focus on reliability
  • Track burn rate: how fast is the error budget being spent?

Present these in a table:

Journey / CapabilitySLISLO TargetMeasurement WindowError BudgetCurrent State
(User-facing API)Availability99.9%Rolling 30 days43.2 min/month(measured or unknown)
(User-facing API)Latency (p95)< 500msRolling 30 days(measured or unknown)

Step 3: DORA metrics baseline

Assess the team's delivery health using the four DORA metrics:

MetricWhat it measuresEliteHighMediumLow
Deployment frequencyHow often code ships to productionOn-demand (multiple/day)Daily to weeklyWeekly to monthlyMonthly+
Lead time for changesCommit to production< 1 hour1 day – 1 week1 week – 1 month1 month+
Change failure rate% of deployments causing incidents< 5%5–10%10–15%15%+
Mean time to recovery (MTTR)How long to restore service< 1 hour< 1 day1 day – 1 week1 week+

For each metric:

  1. Current state — what's the team's actual performance? (Measure or estimate.)
  2. Target state — where should they be in 90 days?
  3. How to measure — specific data source (CI/CD logs, incident tracker, deployment pipeline)
  4. Biggest blocker — what's preventing improvement?

Step 4: Infrastructure health metrics

Define the standard infrastructure signals to monitor:

The Four Golden Signals (per Google SRE):

  • Latency — time to serve a request (distinguish successful vs. failed request latency)
  • Traffic — demand on the system (requests/sec, sessions, transactions)
  • Errors — rate of failed requests (explicit 5xx, implicit timeout, wrong-answer errors)
  • Saturation — how full the system is (CPU, memory, disk, queue depth, connection pool)

For each service or component, produce a monitoring table:

ComponentLatency MetricTraffic MetricError MetricSaturation MetricAlert Threshold
(Web server)p95 response timereq/sec5xx rateCPU %, memory %p95 > 1s for 5 min
(Database)Query time p95Queries/secFailed queriesConnection pool %, disk I/OPool > 80% for 10 min
(Worker/queue)Job duration p95Jobs enqueued/secFailed jobsQueue depthDepth > 1000 for 15 min
(CDN/static)TTFB p95Bandwidth4xx/5xx rateCache hit ratioHit ratio < 90%

Step 5: Alerting strategy

Design alerts that are actionable, not noisy:

Alert tiers:

TierMeaningResponseNotificationExample
P1 — CriticalUser-facing service is down or severely degradedImmediate response, wake people upPagerDuty / phoneAPI availability < 99% for 5 min
P2 — WarningDegradation detected, not yet user-impactingInvestigate within 1 hourSlack alert channelError rate > 1% for 15 min
P3 — InfoTrend worth watchingReview next business dayDashboard / email digestDisk usage > 70%

Alert design rules:

  • Every alert must have a documented response action (or it's noise)
  • Use burn-rate alerts for SLO monitoring (not raw threshold alerts)
  • Set alerts on symptoms (user impact), not causes (CPU spike) — unless cause alerts are the only early warning
  • Require 2+ consecutive breaches before firing (avoid flapping)
  • Review alert fatigue quarterly: if an alert fires > 10x/month without action, fix or delete it

Step 6: Generate the instrumentation plan

Compile everything into a single document:


Instrumentation Plan — (Project name)

Generated: (date) System: (brief description) Current state: (summary of what's instrumented today)

SLIs, SLOs & Error Budgets

(Table from Step 2)

DORA Metrics Baseline

(Table from Step 3 — current state, targets, measurement sources)

Infrastructure Monitoring

(Table from Step 4 — golden signals per component)

Alerting Strategy

(Tiered alert definitions from Step 5)

Implementation Checklist

Priority-ordered list of what to instrument next:

  • (P0) (Most critical gap — e.g., "No availability SLI exists for the primary API")
  • (P0) (Second critical gap)
  • (P1) (Important but not urgent — e.g., "DORA metrics not tracked; add deployment frequency counter")
  • (P1) (Next important item)
  • (P2) (Nice to have — e.g., "Add cache hit ratio monitoring for CDN")

Open Questions

(Anything that couldn't be resolved without more information)

Recommended Tools

(Based on the team's stack — only include if the user asked or if there's a clear gap)


Step 7: Review and refine

Ask the user:

  • Are the SLOs realistic for your team's current maturity?
  • Are any critical components missing from the monitoring table?
  • Does the alerting strategy match your on-call setup? (No on-call = no P1 phone alerts.)
  • Is the implementation checklist ordered correctly for your priorities?
  • Any incidents in the last 6 months that this plan wouldn't have caught?

Adjust based on feedback.

Related skills

  • /observability-plan — plan product-level observability for user behavior, events, and task completion
  • /security-review — generate attacker stories and security acceptance criteria
  • /debug-assist — hypothesis-driven debugging when something goes wrong

Output location

Present the plan as formatted text in the conversation. The user can copy it into their project wiki, engineering docs, or incident response runbook.

Example Output

Input

  • System description: Multi-tenant B2B SaaS platform for construction project management — includes a REST API (Node.js), a PDF report generation worker (Python), a PostgreSQL database, and a React frontend served via CDN. Background jobs handle nightly cost-rollup calculations for ~4,000 active projects.
  • Current instrumentation: Datadog APM on the API layer only; no alerting on the worker or database; one dashboard showing p50 latency and HTTP status codes; no SLOs defined; PagerDuty is licensed but unconfigured.
  • Infrastructure stack: AWS (ECS Fargate for API and workers), RDS PostgreSQL (Multi-AZ), CloudFront CDN, SQS for job queuing, GitHub Actions for CI/CD, deploys ~3x per week via manual approval gate.
  • Incident history: Two incidents in the past 90 days — (1) cost-rollup worker silently failed for 36 hours; customers saw stale data but no alert fired; (2) a bad deploy caused API p99 latency to spike to 12s for 45 minutes before a customer reported it on Slack.
  • Team context: 6-person engineering team, no dedicated SRE; one engineer rotates on-call per week; on-call engineer carries a phone but alert fatigue from a previous job makes the team wary of noisy alerting.

Output (abbreviated)

Instrumentation Plan — Clearpath Construction Platform

Generated: 2025-07-14 System: Multi-tenant B2B SaaS — REST API, PDF report worker, cost-rollup background jobs, PostgreSQL, CloudFront CDN Current state: Partial. Datadog APM covers API latency at p50 only. No SLOs, no worker monitoring, no database alerting, no deployment health tracking. PagerDuty licensed but inactive. Two recent incidents were detected by customers, not by tooling.


SLIs, SLOs & Error Budgets

Journey / CapabilitySLISLO TargetWindowError BudgetCurrent State
API (all endpoints)Availability (non-5xx / total)99.5%Rolling 30 days3.6 hrs/monthUnknown — no SLI defined
API (all endpoints)Latency p95< 800msRolling 30 daysUnknown — only p50 tracked
API (all endpoints)Latency p99< 2sRolling 30 daysUnknown — incident showed 12s spike undetected
Cost-rollup jobsFreshness (data age)< 2 hrs staleRolling 30 daysUnknown — silent 36-hr failure demonstrates zero coverage
Cost-rollup jobsError rate (failed jobs / total)< 1%Rolling 30 daysUnknown
PDF report generationAvailability (successful completions)99.0%Rolling 30 days7.2 hrs/monthUnknown
PDF report generationLatency p95< 30sRolling 30 daysUnknown

Note: SLO targets are set conservatively given current measurement gaps. Recommend tightening API availability to 99.9% once you have 60 days of baseline data.


DORA Metrics Baseline

MetricCurrent StateTarget (90 days)How to MeasureBiggest Blocker
Deployment frequency~3x/week (estimated)DailyTag GitHub Actions deploy events; count per day in DatadogManual approval gate adds friction; no blocker to counting
Lead time for changesUnknown — not tracked< 1 dayTimestamp commit SHA at merge; compare to ECS deploy completion eventNo tooling linking commits to deploy events today
Change failure rateUnknown — estimated ~15–20% given 2 incidents in 90 days across ~36 deploys< 10%Flag deploys followed by P1/P2 alert within 1 hour as failures; track in GitHub ActionsNo automated linkage between deploy events and incident signals
MTTR~45 min for latency spike (customer-reported); 36+ hrs for worker failure (not detected)< 1 hour for all P1sTrack incident open → resolved timestamps in PagerDuty once configuredSilent failures have infinite MTTR; freshness monitoring is prerequisite

Infrastructure Monitoring — Golden Signals

ComponentLatency MetricTraffic MetricError MetricSaturation MetricAlert Threshold
API (ECS Fargate)p95, p99 response time per endpointreq/sec by endpoint5xx rate, timeout rateCPU %, memory %, ECS task restartsp99 > 2s for 5 min; 5xx > 1% for 10 min
PostgreSQL (RDS)Query time p95 (by query type)Queries/sec, active connectionsFailed queries, deadlocksConnection pool %, IOPS, disk %Pool > 80% for 5 min; disk > 75%
Cost-rollup worker (SQS + ECS)Job duration p95Jobs enqueued/sec, jobs processed/secFailed jobs, DLQ depthSQS queue depth, ECS task countDLQ depth > 0 immediately; queue depth > 500 for 15 min; last successful run age > 2 hrs
PDF worker (ECS)Render time p95Jobs submitted/secFailed rendersECS task restarts, memory %Render p95 > 60s for 10 min; task restart > 2 in 5 min
CloudFront CDNTTFB p95Bandwidth, request count4xx rate, 5xx origin error rateCache hit ratioHit ratio < 85% for 30 min; 5xx origin rate > 2% for 5 min

Alerting Strategy

Team constraint: 6-person rotating on-call, high alert-fatigue sensitivity. Every P1 must be genuinely wake-up-worthy. P2s go to Slack only.

TierAlertTriggerResponseRouting
P1 — CriticalAPI availability SLO breach6x burn rate over 1-hr window (consuming 5% of monthly budget)Page on-call immediately; initiate incidentPagerDuty → phone
P1 — CriticalAPI p99 latency breachp99 > 2s sustained for 5 minPage on-call; check recent deploy, DB query timesPagerDuty → phone
P1 — CriticalCost-rollup data staleNo successful job completion in 2 hrsPage on-call; check DLQ, ECS task healthPagerDuty → phone
P1 — CriticalDLQ message receivedAny message lands in SQS dead-letter queuePage on-call; job failed silently — inspect immediatelyPagerDuty → phone
P2 — WarningAPI error rate elevated5xx > 1% for 10 min (below SLO breach)Investigate within 1 hour#alerts-engineering Slack
P2 — WarningRDS connection pool highPool utilization > 80% for 5 minReview query patterns, connection leak risk#alerts-engineering Slack
P2 — WarningPDF worker task restarts> 2 restarts in 5 minCheck memory limits, inspect failed render logs#alerts-engineering Slack
P2 — WarningDeploy change failure signalP1 alert fires within 60 min of deployEvaluate rollback; notify on-call#deploys Slack
P3 — InfoRDS disk usage> 75%Review retention policy; plan capacityDaily digest / Datadog dashboard
P3 — InfoCDN cache hit ratio drop< 85% for 30 minReview cache headers; not user-impacting yet