Instrumentation Plan - AI Agent Skill

Use this when you need to design, audit, or improve system-level instrumentation for a product or service. This covers the SRE side of measurement: uptime, availability, latency, deployment health, error budgets, and alerting. If you're looking to measure user behavior, event tracking, or product analytics, use /observability-plan instead.

The distinction: Instrumentation answers "Is the system healthy?" Observability answers "Are users successful?" Both are needed. Start here if the system isn't reliably measurable yet.

For AI-powered features: This skill covers system-level instrumentation. If you need to monitor LLM-specific concerns (prompt quality, token costs, hallucination drift, model regression), use /llm-observability-plan.

Process

Step 1: Gather context

Ask the user to provide:

System description -- what does this system do? (services, APIs, background jobs, static sites, workers, etc.)
Current instrumentation -- what's already measured? (existing dashboards, alerts, logging, APM tools)
Infrastructure stack -- hosting, CI/CD, CDN, databases, message queues, third-party dependencies
Deployment process -- how code gets to production (manual, automated, frequency, rollback capability)
Incident history -- recent outages, degradations, or close calls (what broke, how was it detected, how long to recover?)
Team context -- who is on-call? Is there an existing SRE function or is this engineering-owned?

If the user doesn't have all of this, work with what's available. Flag gaps as assumptions.

Step 2: Define SLIs, SLOs, and error budgets

For each critical user journey or system capability, define:

Service Level Indicators (SLIs) -- the measurable signals:

SLI Category	What to measure	Example
Availability	Successful requests / total requests	99.9% of HTTP requests return non-5xx
Latency	Response time at key percentiles	p50 < 200ms, p95 < 800ms, p99 < 2s
Throughput	Request volume over time	Sustained 500 req/s during peak
Error rate	Errors / total operations	< 0.1% of API calls return 5xx
Freshness	Data age or staleness	Cache refreshes within 5 minutes
Correctness	Accurate results / total results	100% of calculations match expected output

Service Level Objectives (SLOs) -- the targets:

Set SLOs based on user expectations, not aspirational perfection
A 99.9% availability SLO means ~8.7 hours of allowed downtime per year
Start conservative (lower targets), tighten as you build confidence
Every SLO needs an explicit measurement window (rolling 30 days is standard)

Error budgets -- the tolerance:

Error budget = 1 - SLO (e.g., 99.9% SLO = 0.1% error budget)
When budget is consumed, freeze feature releases and focus on reliability
Track burn rate: how fast is the error budget being spent?

Present these in a table:

Journey / Capability	SLI	SLO Target	Measurement Window	Error Budget	Current State
(User-facing API)	Availability	99.9%	Rolling 30 days	43.2 min/month	(measured or unknown)
(User-facing API)	Latency (p95)	< 500ms	Rolling 30 days	--	(measured or unknown)

Step 3: DORA metrics baseline

Assess the team's delivery health using the four DORA metrics:

Metric	What it measures	Elite	High	Medium	Low
Deployment frequency	How often code ships to production	On-demand (multiple/day)	Daily to weekly	Weekly to monthly	Monthly+
Lead time for changes	Commit to production	< 1 hour	1 day – 1 week	1 week – 1 month	1 month+
Change failure rate	% of deployments causing incidents	< 5%	5–10%	10–15%	15%+
Mean time to recovery (MTTR)	How long to restore service	< 1 hour	< 1 day	1 day – 1 week	1 week+

For each metric:

Current state -- what's the team's actual performance? (Measure or estimate.)
Target state -- where should they be in 90 days?
How to measure -- specific data source (CI/CD logs, incident tracker, deployment pipeline)
Biggest blocker -- what's preventing improvement?

Step 4: Infrastructure health metrics

Define the standard infrastructure signals to monitor:

The Four Golden Signals (per Google SRE):

Latency -- time to serve a request (distinguish successful vs. failed request latency)
Traffic -- demand on the system (requests/sec, sessions, transactions)
Errors -- rate of failed requests (explicit 5xx, implicit timeout, wrong-answer errors)
Saturation -- how full the system is (CPU, memory, disk, queue depth, connection pool)

For each service or component, produce a monitoring table:

Component	Latency Metric	Traffic Metric	Error Metric	Saturation Metric	Alert Threshold
(Web server)	p95 response time	req/sec	5xx rate	CPU %, memory %	p95 > 1s for 5 min
(Database)	Query time p95	Queries/sec	Failed queries	Connection pool %, disk I/O	Pool > 80% for 10 min
(Worker/queue)	Job duration p95	Jobs enqueued/sec	Failed jobs	Queue depth	Depth > 1000 for 15 min
(CDN/static)	TTFB p95	Bandwidth	4xx/5xx rate	Cache hit ratio	Hit ratio < 90%

Step 5: Alerting strategy

Design alerts that are actionable, not noisy:

Alert tiers:

Tier	Meaning	Response	Notification	Example
P1 -- Critical	User-facing service is down or severely degraded	Immediate response, wake people up	PagerDuty / phone	API availability < 99% for 5 min
P2 -- Warning	Degradation detected, not yet user-impacting	Investigate within 1 hour	Slack alert channel	Error rate > 1% for 15 min
P3 -- Info	Trend worth watching	Review next business day	Dashboard / email digest	Disk usage > 70%

Alert design rules:

Every alert must have a documented response action (or it's noise)
Use burn-rate alerts for SLO monitoring (not raw threshold alerts)
Set alerts on symptoms (user impact), not causes (CPU spike) -- unless cause alerts are the only early warning
Require 2+ consecutive breaches before firing (avoid flapping)
Review alert fatigue quarterly: if an alert fires > 10x/month without action, fix or delete it

Step 6: Generate the instrumentation plan

Compile everything into a single document:

Instrumentation Plan -- (Project name)

Generated: (date) System: (brief description) Current state: (summary of what's instrumented today)

SLIs, SLOs & Error Budgets

(Table from Step 2)

DORA Metrics Baseline

(Table from Step 3 -- current state, targets, measurement sources)

Infrastructure Monitoring

(Table from Step 4 -- golden signals per component)

Alerting Strategy

(Tiered alert definitions from Step 5)

Implementation Checklist

Priority-ordered list of what to instrument next:

(P0) (Most critical gap -- e.g., "No availability SLI exists for the primary API")
(P0) (Second critical gap)
(P1) (Important but not urgent -- e.g., "DORA metrics not tracked; add deployment frequency counter")
(P1) (Next important item)
(P2) (Nice to have -- e.g., "Add cache hit ratio monitoring for CDN")

Open Questions

(Anything that couldn't be resolved without more information)

Recommended Tools

(Based on the team's stack -- only include if the user asked or if there's a clear gap)

Step 7: Review and refine

Ask the user:

Are the SLOs realistic for your team's current maturity?
Are any critical components missing from the monitoring table?
Does the alerting strategy match your on-call setup? (No on-call = no P1 phone alerts.)
Is the implementation checklist ordered correctly for your priorities?
Any incidents in the last 6 months that this plan wouldn't have caught?

Adjust based on feedback.

Related skills

/observability-plan -- plan product-level observability for user behavior, events, and task completion
/security-review -- generate attacker stories and security acceptance criteria
/debug-assist -- hypothesis-driven debugging when something goes wrong

Output location

Present the plan as formatted text in the conversation. The user can copy it into their project wiki, engineering docs, or incident response runbook.

Example Output

Input

System description: Multi-tenant B2B SaaS platform for construction project management — includes a REST API (Node.js), a PDF report generation worker (Python), a PostgreSQL database, and a React frontend served via CDN. Background jobs handle nightly cost-rollup calculations for ~4,000 active projects.
Current instrumentation: Datadog APM on the API layer only; no alerting on the worker or database; one dashboard showing p50 latency and HTTP status codes; no SLOs defined; PagerDuty is licensed but unconfigured.
Infrastructure stack: AWS (ECS Fargate for API and workers), RDS PostgreSQL (Multi-AZ), CloudFront CDN, SQS for job queuing, GitHub Actions for CI/CD, deploys ~3x per week via manual approval gate.
Incident history: Two incidents in the past 90 days — (1) cost-rollup worker silently failed for 36 hours; customers saw stale data but no alert fired; (2) a bad deploy caused API p99 latency to spike to 12s for 45 minutes before a customer reported it on Slack.
Team context: 6-person engineering team, no dedicated SRE; one engineer rotates on-call per week; on-call engineer carries a phone but alert fatigue from a previous job makes the team wary of noisy alerting.

Output (abbreviated)

Instrumentation Plan — Clearpath Construction Platform

Generated: 2025-07-14 System: Multi-tenant B2B SaaS — REST API, PDF report worker, cost-rollup background jobs, PostgreSQL, CloudFront CDN Current state: Partial. Datadog APM covers API latency at p50 only. No SLOs, no worker monitoring, no database alerting, no deployment health tracking. PagerDuty licensed but inactive. Two recent incidents were detected by customers, not by tooling.

SLIs, SLOs & Error Budgets

Journey / Capability	SLI	SLO Target	Window	Error Budget	Current State
API (all endpoints)	Availability (non-5xx / total)	99.5%	Rolling 30 days	3.6 hrs/month	Unknown — no SLI defined
API (all endpoints)	Latency p95	< 800ms	Rolling 30 days	—	Unknown — only p50 tracked
API (all endpoints)	Latency p99	< 2s	Rolling 30 days	—	Unknown — incident showed 12s spike undetected
Cost-rollup jobs	Freshness (data age)	< 2 hrs stale	Rolling 30 days	—	Unknown — silent 36-hr failure demonstrates zero coverage
Cost-rollup jobs	Error rate (failed jobs / total)	< 1%	Rolling 30 days	—	Unknown
PDF report generation	Availability (successful completions)	99.0%	Rolling 30 days	7.2 hrs/month	Unknown
PDF report generation	Latency p95	< 30s	Rolling 30 days	—	Unknown

Note: SLO targets are set conservatively given current measurement gaps. Recommend tightening API availability to 99.9% once you have 60 days of baseline data.

DORA Metrics Baseline

Metric	Current State	Target (90 days)	How to Measure	Biggest Blocker
Deployment frequency	~3x/week (estimated)	Daily	Tag GitHub Actions deploy events; count per day in Datadog	Manual approval gate adds friction; no blocker to counting
Lead time for changes	Unknown — not tracked	< 1 day	Timestamp commit SHA at merge; compare to ECS deploy completion event	No tooling linking commits to deploy events today
Change failure rate	Unknown — estimated ~15–20% given 2 incidents in 90 days across ~36 deploys	< 10%	Flag deploys followed by P1/P2 alert within 1 hour as failures; track in GitHub Actions	No automated linkage between deploy events and incident signals
MTTR	~45 min for latency spike (customer-reported); 36+ hrs for worker failure (not detected)	< 1 hour for all P1s	Track incident open → resolved timestamps in PagerDuty once configured	Silent failures have infinite MTTR; freshness monitoring is prerequisite

Infrastructure Monitoring — Golden Signals

Component	Latency Metric	Traffic Metric	Error Metric	Saturation Metric	Alert Threshold
API (ECS Fargate)	p95, p99 response time per endpoint	req/sec by endpoint	5xx rate, timeout rate	CPU %, memory %, ECS task restarts	p99 > 2s for 5 min; 5xx > 1% for 10 min
PostgreSQL (RDS)	Query time p95 (by query type)	Queries/sec, active connections	Failed queries, deadlocks	Connection pool %, IOPS, disk %	Pool > 80% for 5 min; disk > 75%
Cost-rollup worker (SQS + ECS)	Job duration p95	Jobs enqueued/sec, jobs processed/sec	Failed jobs, DLQ depth	SQS queue depth, ECS task count	DLQ depth > 0 immediately; queue depth > 500 for 15 min; last successful run age > 2 hrs
PDF worker (ECS)	Render time p95	Jobs submitted/sec	Failed renders	ECS task restarts, memory %	Render p95 > 60s for 10 min; task restart > 2 in 5 min
CloudFront CDN	TTFB p95	Bandwidth, request count	4xx rate, 5xx origin error rate	Cache hit ratio	Hit ratio < 85% for 30 min; 5xx origin rate > 2% for 5 min

Alerting Strategy

Team constraint: 6-person rotating on-call, high alert-fatigue sensitivity. Every P1 must be genuinely wake-up-worthy. P2s go to Slack only.

Tier	Alert	Trigger	Response	Routing
P1 — Critical	API availability SLO breach	6x burn rate over 1-hr window (consuming 5% of monthly budget)	Page on-call immediately; initiate incident	PagerDuty → phone
P1 — Critical	API p99 latency breach	p99 > 2s sustained for 5 min	Page on-call; check recent deploy, DB query times	PagerDuty → phone
P1 — Critical	Cost-rollup data stale	No successful job completion in 2 hrs	Page on-call; check DLQ, ECS task health	PagerDuty → phone
P1 — Critical	DLQ message received	Any message lands in SQS dead-letter queue	Page on-call; job failed silently — inspect immediately	PagerDuty → phone
P2 — Warning	API error rate elevated	5xx > 1% for 10 min (below SLO breach)	Investigate within 1 hour	#alerts-engineering Slack
P2 — Warning	RDS connection pool high	Pool utilization > 80% for 5 min	Review query patterns, connection leak risk	#alerts-engineering Slack
P2 — Warning	PDF worker task restarts	> 2 restarts in 5 min	Check memory limits, inspect failed render logs	#alerts-engineering Slack
P2 — Warning	Deploy change failure signal	P1 alert fires within 60 min of deploy	Evaluate rollback; notify on-call	#deploys Slack
P3 — Info	RDS disk usage	> 75%	Review retention policy; plan capacity	Daily digest / Datadog dashboard
P3 — Info	CDN cache hit ratio drop	< 85% for 30 min	Review cache headers; not user-impacting yet

Run this now

Try /instrumentation-plan on your own input

0/4000

Part of these Playbook topics

CI/CD

Related Engineering skills

ADR Generate AI Testing Strategy Architecture Context Reviewer Architecture Discovery Boris Model Build vs Buy Code Review Codependency Analyzer

Back to Skills Catalog