Use this when a team or organization needs to define SLAs for a service -- internal platform, customer-facing product, or vendor contract -- and needs to set appropriate targets, establish measurement methodology, and build governance for monitoring and enforcement.

Related skills: Use /instrumentation-plan for SLI/SLO technical implementation. Use /observability-plan for monitoring strategy. Use /stakeholder-comms for communicating SLA changes. Use /incident-postmortem when SLA breaches occur. SLA outputs feed into /vendor-evaluation contract terms.

Process

Step 1: Gather inputs

Collect from the service owner and stakeholders:

Service description -- what the service does, at a technical and business level
Consumers -- who depends on this service (internal teams, external customers, partners)
Current reliability data -- existing uptime, latency, and error rate numbers (if available)
Business criticality -- what happens when this service is down (revenue impact, user impact, compliance impact)
Existing informal expectations -- what consumers already expect, even if not documented
Contractual obligations -- any existing commitments in customer contracts or partner agreements
Budget for reliability -- engineering and infrastructure budget available for reliability improvements

Step 2: Define SLA tiers

Classify services by criticality. Each tier gets different target levels:

Tier	Description	Examples	Typical availability target
Tier 1	Revenue-critical, customer-facing	Payment processing, core API, auth	99.95% - 99.99%
Tier 2	Important internal or secondary services	Internal dashboards, async processing, search	99.9% - 99.95%
Tier 3	Non-critical services	Dev tools, internal wikis, batch jobs	99% - 99.9%

Assign the service being designed to the appropriate tier. If consumers disagree on the tier, that disagreement is a useful signal -- resolve it before setting targets.

Step 3: Select metrics

For the service, choose 3-5 SLIs (Service Level Indicators) that actually matter. Don't measure everything -- measure what consumers care about.

Common SLIs by category:

Category	Metric	Good for
Availability	Uptime percentage	Request-serving systems
Latency	p50, p95, p99 response times	User-facing APIs and UIs
Throughput	Requests per second capacity	Systems with known load patterns
Error rate	Percentage of failed requests	APIs and data pipelines
Durability	Data loss probability	Storage and database systems
Freshness	Data staleness (max age)	Async systems, caches, search indexes
Completeness	Percentage of data processed	ETL pipelines, event processing

For each selected metric, document:

What exactly is measured (e.g., "HTTP 5xx responses as a percentage of total requests, excluding health checks")
Where the measurement happens (at the load balancer, application, or client)
Why this metric matters to consumers

Step 4: Set targets

For each metric, set targets using current baseline data and industry benchmarks:

## SLA targets -- {{service_name}}

### Availability
- **Target:** {{99.9%}} measured over {{monthly}} windows
- **Current baseline:** {{X%}} over the last {{N}} months
- **What this means:** {{43 minutes}} of allowable downtime per month
- **Exclusions:** Planned maintenance (with 72-hour notice), force majeure

### Latency
- **p50 target:** {{X}}ms
- **p95 target:** {{X}}ms
- **p99 target:** {{X}}ms
- **Current baseline:** p50={{X}}ms, p95={{X}}ms, p99={{X}}ms
- **Measurement point:** {{load balancer / application / client}}

### Error rate
- **Target:** < {{0.1%}} of requests over {{monthly}} windows
- **Current baseline:** {{X%}}
- **Definition:** HTTP 5xx responses / total responses, excluding {{exclusions}}

The cost of nines: Make this explicit in every SLA discussion.

Availability	Downtime/month	Downtime/year	Relative cost
99%	7.3 hours	3.65 days	Baseline
99.9%	43 minutes	8.77 hours	~3-5x baseline
99.95%	22 minutes	4.38 hours	~5-10x baseline
99.99%	4.3 minutes	52.6 minutes	~10-30x baseline
99.999%	26 seconds	5.26 minutes	~100x+ baseline

Each additional nine costs exponentially more. The target should match the business value, not engineering pride.

Step 5: Design measurement methodology

For each metric, define:

## Measurement methodology

### {{Metric name}}
- **Data source:** {{monitoring system, logs, synthetic checks}}
- **Calculation:** {{exact formula}}
- **Reporting cadence:** {{real-time dashboard + monthly report}}
- **Measurement window:** {{calendar month, rolling 30 days}}

### Edge case handling
- **Planned maintenance:** Excluded if announced {{72}} hours in advance via {{channel}}
- **Dependent service failures:** {{Excluded / Included with note}}
- **Partial outages:** {{How degraded service is counted -- binary or proportional}}
- **Dispute resolution:** {{Who arbitrates disagreements about whether the SLA was met}}

The measurement methodology must be agreed upon before the targets are set. An SLA without clear measurement is just a hope.

Step 6: Build governance processes

## Governance

### Breach notification
- **Detection:** Automated alerting when {{metric}} crosses {{warning threshold}} (before SLA breach)
- **Notification:** {{Who is notified, through what channel, within what timeframe}}
- **Escalation:** If breach continues beyond {{X}} minutes, escalate to {{role/team}}

### Breach response
- **Incident process:** Follow standard incident management process
- **Root cause analysis:** Required for any SLA breach exceeding {{X}} minutes
- **Improvement plan:** Required if SLA is breached {{N}} times in a {{quarter}}

### Credit/penalty mechanisms (for external SLAs)
- **Breach of availability target:** {{X%}} credit on monthly invoice
- **Breach of latency target:** {{X%}} credit on monthly invoice
- **Credit cap:** {{Maximum credit per month, typically 30% of monthly fee}}
- **Claim process:** {{How customers submit claims, response timeline}}

### Regular reviews
- **Monthly:** SLA performance report to {{stakeholders}}
- **Quarterly:** SLA review meeting -- assess targets, adjust if needed
- **Annual:** Full SLA redesign review

Step 7: Generate the SLA document

# Service Level Agreement -- {{service_name}}

**Version:** {{1.0}}
**Effective date:** {{date}}
**Review date:** {{date + 1 year}}
**Owner:** {{team/person}}

## 1. Service description
{{What the service does, who it serves}}

## 2. Metrics and targets
{{From Step 4 -- metrics, targets, measurement windows}}

## 3. Measurement methodology
{{From Step 5 -- how each metric is measured}}

## 4. Exclusions
{{Planned maintenance, force majeure, dependent services}}

## 5. Breach procedures
{{From Step 6 -- notification, escalation, response}}

## 6. Credits and remedies
{{From Step 6 -- credit terms, claim process, caps}}

## 7. Review schedule
{{Monthly reporting, quarterly review, annual redesign}}

## 8. Approval
{{Signatures/approvals from service owner and key consumers}}

Step 8: Review

Before finalizing, ask:

Are the targets achievable with current infrastructure and team capacity?
Who pays for the additional reliability? (More nines = exponentially more cost in engineering and infrastructure.)
Are the exclusions reasonable from the consumer's perspective?
Does legal need to review the contract language for external SLAs?
Is the measurement methodology automated, or does it depend on manual data collection?
Can you actually alert on these metrics before a breach occurs?
What happens during the first breach -- is the governance process clear enough that people know what to do?

Output location

Save to deliverables/sla-{{service_name}}-{{date}}.md. For external SLAs, the final version will likely need legal review and a different format -- this output serves as the technical foundation.

Example Output

Input

Service description: Meridian Health's patient-facing appointment booking API — handles scheduling, cancellation, and real-time slot availability for 340+ clinic locations across the Pacific Northwest; consumed by the patient web portal, mobile app, and third-party EHR integrations
Consumers: 1.2M registered patients (external), 14 internal engineering teams, 3 EHR vendor partners (Epic, Athenahealth, eClinicalWorks)
Current reliability data: 99.71% uptime over last 6 months; p50 latency 210ms, p95 latency 890ms, p99 latency 2,400ms; error rate averaging 0.43%
Business criticality: Missed appointments cost $180–$340 each in no-show revenue; booking failures during peak hours (7–9am, 4–6pm) directly affect patient care access; HIPAA audit obligations require data completeness
Contractual obligations: EHR vendor contracts currently have no formal SLA; patient-facing terms of service reference "best efforts" availability only

Output (abbreviated)

Service Level Agreement — Meridian Health Appointment Booking API

Version: 1.0 Effective date: 2025-02-01 Review date: 2026-02-01 Owner: Platform Engineering, Meridian Digital Health Classification: Tier 1 — Revenue-critical, patient-facing

Tier Assignment

Factor	Assessment
Revenue impact	~$2,100–$4,080 lost per hour of booking downtime at avg. volume
Patient impact	Direct care access disruption; appointments missed or duplicated
Compliance impact	HIPAA availability and audit trail requirements apply
EHR partner dependency	Epic/Athenahealth integrations block clinic workflows if API fails
Assigned tier	Tier 1 — Target range 99.9% – 99.95%

Current baseline (99.71%) falls short of Tier 1 thresholds. A reliability improvement roadmap is required before 99.95% can be committed externally. Initial external SLA set at 99.9%; internal engineering SLO set at 99.95% as a buffer.

Metrics and Targets

Availability

Target: 99.9% measured over calendar month windows
Current baseline: 99.71% (6-month average)
Allowable downtime: 43.2 minutes per month
Exclusions: Scheduled maintenance with ≥72-hour notice via status page and email to EHR partners; force majeure

Latency

p50 target: ≤ 250ms
p95 target: ≤ 750ms (current p95 at 890ms — requires optimization sprint before target is contractually committed)
p99 target: ≤ 1,500ms (current p99 at 2,400ms — phased target: ≤2,000ms at launch, ≤1,500ms by Q3 2025)
Measurement point: AWS ALB (application load balancer), excluding health check endpoints
Peak-hour note: Targets apply at all times, including 7–9am and 4–6pm PT peak windows

Error Rate

Target: < 0.1% of requests per calendar month
Current baseline: 0.43% — improvement required
Definition: HTTP 5xx responses ÷ total responses, excluding /health, /status, and client-side 4xx errors
EHR integration note: Integration-originated errors tracked separately; included in overall rate but broken out in monthly report

Data Freshness (Slot Availability)

Target: Slot availability data ≤ 30 seconds stale at p95
Rationale: Double-booking risk if cache is not refreshed; directly affects patient experience and clinic ops
Measurement: Timestamp delta between source-of-truth DB write and API response cache, sampled every 60 seconds

Measurement Methodology

Metric	Data Source	Calculation	Window	Cadence
Availability	AWS ALB access logs + synthetic checks (every 2 min from 3 regions)	(Total minutes − downtime minutes) ÷ total minutes	Calendar month	Real-time dashboard + monthly report
Latency	ALB request logs, percentile aggregation	p50/p95/p99 of response time field, all non-health endpoints	Rolling 24h (dashboard), calendar month (SLA)	Real-time + monthly
Error rate	ALB logs filtered to 5xx	Count(5xx) ÷ Count(all requests)	Calendar month	Real-time + monthly
Freshness	Custom CloudWatch metric emitted by cache invalidation service	p95 of (response_timestamp − db_write_timestamp)	Rolling 1h	Real-time

Edge case handling:

Planned maintenance: Excluded from availability calculation if announced ≥72h in advance on status.meridianhealth.com and via email to registered EHR API contacts
Upstream dependencies: AWS RDS or ElastiCache outages are included in availability calculation (Meridian owns the stack); noted separately in breach reports
Partial outages: Degraded service (>10% of requests affected, non-zero success rate) counted proportionally using the formula: downtime_equivalent = (error_rate − baseline) × affected_duration
Dispute resolution: Engineering VP and EHR partner technical lead review raw ALB logs jointly within 5 business days of dispute filing

Governance

Breach Detection and Notification

Condition	Warning Threshold	SLA Breach Threshold	Who is Notified	Channel	SLA
Availability	< 99.97% in rolling 1h	< 99.9% in calendar month	On-call engineer, Platform EM	PagerDuty + Slack #incidents	Immediate (auto-alert)
p95 latency	> 600ms rolling 15-min avg	> 750ms sustained 30+ min	On-call engineer	PagerDuty	Immediate
Error rate	> 0.07% rolling 1h	> 0.1% in calendar month	On-call + Product lead	PagerDuty + Slack	Immediate
EHR partner impact	Any Tier 1 breach affecting integration	Same	EHR partner technical contacts	Email + status page	Within 15 minutes of confirmation

Breach Response

RCA requirement: Any breach exceeding 15 minutes of downtime or 0.2% error rate sustained for 1+ hours
RCA delivery: Within 5 business days of incident resolution, shared with EHR partners
Improvement plan: Required if SLA is breached 2+ times in a rolling quarter; reviewed with EHR partners within 10 business days

Credit and Penalty Mechanism (EHR Vendor Contracts)

Availability Achieved	Monthly Credit
99.5% – 99.9%	10% of monthly API access fee
99.0% – 99.5%	20% of monthly API access fee
< 99.0%	30% of monthly API access fee

Latency breach credit: 5% of monthly fee if p95 exceeds target for >4 cumulative hours in a month
Credit cap: 30% of monthly fee per calendar month
Claim process: EHR partners submit claims via vendor portal within 30 days of month end; Meridian responds within 10 business days with ALB log evidence

Cost of Nines — Decision Rationale

Target	Downtime/month	Engineering cost estimate	Decision
99.71% (current)	~2.1 hours	Baseline	Not contractually acceptable
99.9%	43 min	~1.5x current infra + on-call	External SLA commitment
99.95%	22 min	~2.5x	Internal SLO (engineering buffer)
99.99%	4.3 min	~8x	Not justified at current revenue scale

*Moving from 99.71% to 99.9% requires: ALB redundancy improvements, read

Run this now

Try /sla-design on your own input

0/4000

Related Engineering skills

ADR Generate AI Testing Strategy Architecture Context Reviewer Architecture Discovery Boris Model Build vs Buy Code Review Codependency Analyzer

Back to Skills Catalog

SLA Design