Use this when a team or organization needs to define SLAs for a service -- internal platform, customer-facing product, or vendor contract -- and needs to set appropriate targets, establish measurement methodology, and build governance for monitoring and enforcement.
Related skills: Use
/instrumentation-planfor SLI/SLO technical implementation. Use/observability-planfor monitoring strategy. Use/stakeholder-commsfor communicating SLA changes. Use/incident-postmortemwhen SLA breaches occur. SLA outputs feed into/vendor-evaluationcontract terms.
Process
Step 1: Gather inputs
Collect from the service owner and stakeholders:
- Service description -- what the service does, at a technical and business level
- Consumers -- who depends on this service (internal teams, external customers, partners)
- Current reliability data -- existing uptime, latency, and error rate numbers (if available)
- Business criticality -- what happens when this service is down (revenue impact, user impact, compliance impact)
- Existing informal expectations -- what consumers already expect, even if not documented
- Contractual obligations -- any existing commitments in customer contracts or partner agreements
- Budget for reliability -- engineering and infrastructure budget available for reliability improvements
Step 2: Define SLA tiers
Classify services by criticality. Each tier gets different target levels:
| Tier | Description | Examples | Typical availability target |
|---|---|---|---|
| Tier 1 | Revenue-critical, customer-facing | Payment processing, core API, auth | 99.95% - 99.99% |
| Tier 2 | Important internal or secondary services | Internal dashboards, async processing, search | 99.9% - 99.95% |
| Tier 3 | Non-critical services | Dev tools, internal wikis, batch jobs | 99% - 99.9% |
Assign the service being designed to the appropriate tier. If consumers disagree on the tier, that disagreement is a useful signal -- resolve it before setting targets.
Step 3: Select metrics
For the service, choose 3-5 SLIs (Service Level Indicators) that actually matter. Don't measure everything -- measure what consumers care about.
Common SLIs by category:
| Category | Metric | Good for |
|---|---|---|
| Availability | Uptime percentage | Request-serving systems |
| Latency | p50, p95, p99 response times | User-facing APIs and UIs |
| Throughput | Requests per second capacity | Systems with known load patterns |
| Error rate | Percentage of failed requests | APIs and data pipelines |
| Durability | Data loss probability | Storage and database systems |
| Freshness | Data staleness (max age) | Async systems, caches, search indexes |
| Completeness | Percentage of data processed | ETL pipelines, event processing |
For each selected metric, document:
- What exactly is measured (e.g., "HTTP 5xx responses as a percentage of total requests, excluding health checks")
- Where the measurement happens (at the load balancer, application, or client)
- Why this metric matters to consumers
Step 4: Set targets
For each metric, set targets using current baseline data and industry benchmarks:
## SLA targets -- {{service_name}}
### Availability
- **Target:** {{99.9%}} measured over {{monthly}} windows
- **Current baseline:** {{X%}} over the last {{N}} months
- **What this means:** {{43 minutes}} of allowable downtime per month
- **Exclusions:** Planned maintenance (with 72-hour notice), force majeure
### Latency
- **p50 target:** {{X}}ms
- **p95 target:** {{X}}ms
- **p99 target:** {{X}}ms
- **Current baseline:** p50={{X}}ms, p95={{X}}ms, p99={{X}}ms
- **Measurement point:** {{load balancer / application / client}}
### Error rate
- **Target:** < {{0.1%}} of requests over {{monthly}} windows
- **Current baseline:** {{X%}}
- **Definition:** HTTP 5xx responses / total responses, excluding {{exclusions}}
The cost of nines: Make this explicit in every SLA discussion.
| Availability | Downtime/month | Downtime/year | Relative cost |
|---|---|---|---|
| 99% | 7.3 hours | 3.65 days | Baseline |
| 99.9% | 43 minutes | 8.77 hours | ~3-5x baseline |
| 99.95% | 22 minutes | 4.38 hours | ~5-10x baseline |
| 99.99% | 4.3 minutes | 52.6 minutes | ~10-30x baseline |
| 99.999% | 26 seconds | 5.26 minutes | ~100x+ baseline |
Each additional nine costs exponentially more. The target should match the business value, not engineering pride.
Step 5: Design measurement methodology
For each metric, define:
## Measurement methodology
### {{Metric name}}
- **Data source:** {{monitoring system, logs, synthetic checks}}
- **Calculation:** {{exact formula}}
- **Reporting cadence:** {{real-time dashboard + monthly report}}
- **Measurement window:** {{calendar month, rolling 30 days}}
### Edge case handling
- **Planned maintenance:** Excluded if announced {{72}} hours in advance via {{channel}}
- **Dependent service failures:** {{Excluded / Included with note}}
- **Partial outages:** {{How degraded service is counted -- binary or proportional}}
- **Dispute resolution:** {{Who arbitrates disagreements about whether the SLA was met}}
The measurement methodology must be agreed upon before the targets are set. An SLA without clear measurement is just a hope.
Step 6: Build governance processes
## Governance
### Breach notification
- **Detection:** Automated alerting when {{metric}} crosses {{warning threshold}} (before SLA breach)
- **Notification:** {{Who is notified, through what channel, within what timeframe}}
- **Escalation:** If breach continues beyond {{X}} minutes, escalate to {{role/team}}
### Breach response
- **Incident process:** Follow standard incident management process
- **Root cause analysis:** Required for any SLA breach exceeding {{X}} minutes
- **Improvement plan:** Required if SLA is breached {{N}} times in a {{quarter}}
### Credit/penalty mechanisms (for external SLAs)
- **Breach of availability target:** {{X%}} credit on monthly invoice
- **Breach of latency target:** {{X%}} credit on monthly invoice
- **Credit cap:** {{Maximum credit per month, typically 30% of monthly fee}}
- **Claim process:** {{How customers submit claims, response timeline}}
### Regular reviews
- **Monthly:** SLA performance report to {{stakeholders}}
- **Quarterly:** SLA review meeting -- assess targets, adjust if needed
- **Annual:** Full SLA redesign review
Step 7: Generate the SLA document
# Service Level Agreement -- {{service_name}}
**Version:** {{1.0}}
**Effective date:** {{date}}
**Review date:** {{date + 1 year}}
**Owner:** {{team/person}}
## 1. Service description
{{What the service does, who it serves}}
## 2. Metrics and targets
{{From Step 4 -- metrics, targets, measurement windows}}
## 3. Measurement methodology
{{From Step 5 -- how each metric is measured}}
## 4. Exclusions
{{Planned maintenance, force majeure, dependent services}}
## 5. Breach procedures
{{From Step 6 -- notification, escalation, response}}
## 6. Credits and remedies
{{From Step 6 -- credit terms, claim process, caps}}
## 7. Review schedule
{{Monthly reporting, quarterly review, annual redesign}}
## 8. Approval
{{Signatures/approvals from service owner and key consumers}}
Step 8: Review
Before finalizing, ask:
- Are the targets achievable with current infrastructure and team capacity?
- Who pays for the additional reliability? (More nines = exponentially more cost in engineering and infrastructure.)
- Are the exclusions reasonable from the consumer's perspective?
- Does legal need to review the contract language for external SLAs?
- Is the measurement methodology automated, or does it depend on manual data collection?
- Can you actually alert on these metrics before a breach occurs?
- What happens during the first breach -- is the governance process clear enough that people know what to do?
Output location
Save to deliverables/sla-{{service_name}}-{{date}}.md. For external SLAs, the final version will likely need legal review and a different format -- this output serves as the technical foundation.
Example Output
Input
- Service description: Meridian Health's patient-facing appointment booking API — handles scheduling, cancellation, and real-time slot availability for 340+ clinic locations across the Pacific Northwest; consumed by the patient web portal, mobile app, and third-party EHR integrations
- Consumers: 1.2M registered patients (external), 14 internal engineering teams, 3 EHR vendor partners (Epic, Athenahealth, eClinicalWorks)
- Current reliability data: 99.71% uptime over last 6 months; p50 latency 210ms, p95 latency 890ms, p99 latency 2,400ms; error rate averaging 0.43%
- Business criticality: Missed appointments cost $180–$340 each in no-show revenue; booking failures during peak hours (7–9am, 4–6pm) directly affect patient care access; HIPAA audit obligations require data completeness
- Contractual obligations: EHR vendor contracts currently have no formal SLA; patient-facing terms of service reference "best efforts" availability only
Output (abbreviated)
Service Level Agreement — Meridian Health Appointment Booking API
Version: 1.0 Effective date: 2025-02-01 Review date: 2026-02-01 Owner: Platform Engineering, Meridian Digital Health Classification: Tier 1 — Revenue-critical, patient-facing
Tier Assignment
| Factor | Assessment |
|---|---|
| Revenue impact | ~$2,100–$4,080 lost per hour of booking downtime at avg. volume |
| Patient impact | Direct care access disruption; appointments missed or duplicated |
| Compliance impact | HIPAA availability and audit trail requirements apply |
| EHR partner dependency | Epic/Athenahealth integrations block clinic workflows if API fails |
| Assigned tier | Tier 1 — Target range 99.9% – 99.95% |
Current baseline (99.71%) falls short of Tier 1 thresholds. A reliability improvement roadmap is required before 99.95% can be committed externally. Initial external SLA set at 99.9%; internal engineering SLO set at 99.95% as a buffer.
Metrics and Targets
Availability
- Target: 99.9% measured over calendar month windows
- Current baseline: 99.71% (6-month average)
- Allowable downtime: 43.2 minutes per month
- Exclusions: Scheduled maintenance with ≥72-hour notice via status page and email to EHR partners; force majeure
Latency
- p50 target: ≤ 250ms
- p95 target: ≤ 750ms (current p95 at 890ms — requires optimization sprint before target is contractually committed)
- p99 target: ≤ 1,500ms (current p99 at 2,400ms — phased target: ≤2,000ms at launch, ≤1,500ms by Q3 2025)
- Measurement point: AWS ALB (application load balancer), excluding health check endpoints
- Peak-hour note: Targets apply at all times, including 7–9am and 4–6pm PT peak windows
Error Rate
- Target: < 0.1% of requests per calendar month
- Current baseline: 0.43% — improvement required
- Definition: HTTP 5xx responses ÷ total responses, excluding
/health,/status, and client-side 4xx errors - EHR integration note: Integration-originated errors tracked separately; included in overall rate but broken out in monthly report
Data Freshness (Slot Availability)
- Target: Slot availability data ≤ 30 seconds stale at p95
- Rationale: Double-booking risk if cache is not refreshed; directly affects patient experience and clinic ops
- Measurement: Timestamp delta between source-of-truth DB write and API response cache, sampled every 60 seconds
Measurement Methodology
| Metric | Data Source | Calculation | Window | Cadence |
|---|---|---|---|---|
| Availability | AWS ALB access logs + synthetic checks (every 2 min from 3 regions) | (Total minutes − downtime minutes) ÷ total minutes | Calendar month | Real-time dashboard + monthly report |
| Latency | ALB request logs, percentile aggregation | p50/p95/p99 of response time field, all non-health endpoints | Rolling 24h (dashboard), calendar month (SLA) | Real-time + monthly |
| Error rate | ALB logs filtered to 5xx | Count(5xx) ÷ Count(all requests) | Calendar month | Real-time + monthly |
| Freshness | Custom CloudWatch metric emitted by cache invalidation service | p95 of (response_timestamp − db_write_timestamp) | Rolling 1h | Real-time |
Edge case handling:
- Planned maintenance: Excluded from availability calculation if announced ≥72h in advance on status.meridianhealth.com and via email to registered EHR API contacts
- Upstream dependencies: AWS RDS or ElastiCache outages are included in availability calculation (Meridian owns the stack); noted separately in breach reports
- Partial outages: Degraded service (>10% of requests affected, non-zero success rate) counted proportionally using the formula:
downtime_equivalent = (error_rate − baseline) × affected_duration - Dispute resolution: Engineering VP and EHR partner technical lead review raw ALB logs jointly within 5 business days of dispute filing
Governance
Breach Detection and Notification
| Condition | Warning Threshold | SLA Breach Threshold | Who is Notified | Channel | SLA |
|---|---|---|---|---|---|
| Availability | < 99.97% in rolling 1h | < 99.9% in calendar month | On-call engineer, Platform EM | PagerDuty + Slack #incidents | Immediate (auto-alert) |
| p95 latency | > 600ms rolling 15-min avg | > 750ms sustained 30+ min | On-call engineer | PagerDuty | Immediate |
| Error rate | > 0.07% rolling 1h | > 0.1% in calendar month | On-call + Product lead | PagerDuty + Slack | Immediate |
| EHR partner impact | Any Tier 1 breach affecting integration | Same | EHR partner technical contacts | Email + status page | Within 15 minutes of confirmation |
Breach Response
- RCA requirement: Any breach exceeding 15 minutes of downtime or 0.2% error rate sustained for 1+ hours
- RCA delivery: Within 5 business days of incident resolution, shared with EHR partners
- Improvement plan: Required if SLA is breached 2+ times in a rolling quarter; reviewed with EHR partners within 10 business days
Credit and Penalty Mechanism (EHR Vendor Contracts)
| Availability Achieved | Monthly Credit |
|---|---|
| 99.5% – 99.9% | 10% of monthly API access fee |
| 99.0% – 99.5% | 20% of monthly API access fee |
| < 99.0% | 30% of monthly API access fee |
- Latency breach credit: 5% of monthly fee if p95 exceeds target for >4 cumulative hours in a month
- Credit cap: 30% of monthly fee per calendar month
- Claim process: EHR partners submit claims via vendor portal within 30 days of month end; Meridian responds within 10 business days with ALB log evidence
Cost of Nines — Decision Rationale
| Target | Downtime/month | Engineering cost estimate | Decision |
|---|---|---|---|
| 99.71% (current) | ~2.1 hours | Baseline | Not contractually acceptable |
| 99.9% | 43 min | ~1.5x current infra + on-call | External SLA commitment |
| 99.95% | 22 min | ~2.5x | Internal SLO (engineering buffer) |
| 99.99% | 4.3 min | ~8x | Not justified at current revenue scale |
*Moving from 99.71% to 99.9% requires: ALB redundancy improvements, read