Skip to main content
Engineering/sla-design

SLA Design

You need SLA frameworks with metrics, targets, measurement methodology, and governance.

Use this when a team or organization needs to define SLAs for a service -- internal platform, customer-facing product, or vendor contract -- and needs to set appropriate targets, establish measurement methodology, and build governance for monitoring and enforcement.

Related skills: Use /instrumentation-plan for SLI/SLO technical implementation. Use /observability-plan for monitoring strategy. Use /stakeholder-comms for communicating SLA changes. Use /incident-postmortem when SLA breaches occur. SLA outputs feed into /vendor-evaluation contract terms.

Process

Step 1: Gather inputs

Collect from the service owner and stakeholders:

  • Service description -- what the service does, at a technical and business level
  • Consumers -- who depends on this service (internal teams, external customers, partners)
  • Current reliability data -- existing uptime, latency, and error rate numbers (if available)
  • Business criticality -- what happens when this service is down (revenue impact, user impact, compliance impact)
  • Existing informal expectations -- what consumers already expect, even if not documented
  • Contractual obligations -- any existing commitments in customer contracts or partner agreements
  • Budget for reliability -- engineering and infrastructure budget available for reliability improvements

Step 2: Define SLA tiers

Classify services by criticality. Each tier gets different target levels:

TierDescriptionExamplesTypical availability target
Tier 1Revenue-critical, customer-facingPayment processing, core API, auth99.95% - 99.99%
Tier 2Important internal or secondary servicesInternal dashboards, async processing, search99.9% - 99.95%
Tier 3Non-critical servicesDev tools, internal wikis, batch jobs99% - 99.9%

Assign the service being designed to the appropriate tier. If consumers disagree on the tier, that disagreement is a useful signal -- resolve it before setting targets.

Step 3: Select metrics

For the service, choose 3-5 SLIs (Service Level Indicators) that actually matter. Don't measure everything -- measure what consumers care about.

Common SLIs by category:

CategoryMetricGood for
AvailabilityUptime percentageRequest-serving systems
Latencyp50, p95, p99 response timesUser-facing APIs and UIs
ThroughputRequests per second capacitySystems with known load patterns
Error ratePercentage of failed requestsAPIs and data pipelines
DurabilityData loss probabilityStorage and database systems
FreshnessData staleness (max age)Async systems, caches, search indexes
CompletenessPercentage of data processedETL pipelines, event processing

For each selected metric, document:

  • What exactly is measured (e.g., "HTTP 5xx responses as a percentage of total requests, excluding health checks")
  • Where the measurement happens (at the load balancer, application, or client)
  • Why this metric matters to consumers

Step 4: Set targets

For each metric, set targets using current baseline data and industry benchmarks:

## SLA targets -- {{service_name}}

### Availability
- **Target:** {{99.9%}} measured over {{monthly}} windows
- **Current baseline:** {{X%}} over the last {{N}} months
- **What this means:** {{43 minutes}} of allowable downtime per month
- **Exclusions:** Planned maintenance (with 72-hour notice), force majeure

### Latency
- **p50 target:** {{X}}ms
- **p95 target:** {{X}}ms
- **p99 target:** {{X}}ms
- **Current baseline:** p50={{X}}ms, p95={{X}}ms, p99={{X}}ms
- **Measurement point:** {{load balancer / application / client}}

### Error rate
- **Target:** < {{0.1%}} of requests over {{monthly}} windows
- **Current baseline:** {{X%}}
- **Definition:** HTTP 5xx responses / total responses, excluding {{exclusions}}

The cost of nines: Make this explicit in every SLA discussion.

AvailabilityDowntime/monthDowntime/yearRelative cost
99%7.3 hours3.65 daysBaseline
99.9%43 minutes8.77 hours~3-5x baseline
99.95%22 minutes4.38 hours~5-10x baseline
99.99%4.3 minutes52.6 minutes~10-30x baseline
99.999%26 seconds5.26 minutes~100x+ baseline

Each additional nine costs exponentially more. The target should match the business value, not engineering pride.

Step 5: Design measurement methodology

For each metric, define:

## Measurement methodology

### {{Metric name}}
- **Data source:** {{monitoring system, logs, synthetic checks}}
- **Calculation:** {{exact formula}}
- **Reporting cadence:** {{real-time dashboard + monthly report}}
- **Measurement window:** {{calendar month, rolling 30 days}}

### Edge case handling
- **Planned maintenance:** Excluded if announced {{72}} hours in advance via {{channel}}
- **Dependent service failures:** {{Excluded / Included with note}}
- **Partial outages:** {{How degraded service is counted -- binary or proportional}}
- **Dispute resolution:** {{Who arbitrates disagreements about whether the SLA was met}}

The measurement methodology must be agreed upon before the targets are set. An SLA without clear measurement is just a hope.

Step 6: Build governance processes

## Governance

### Breach notification
- **Detection:** Automated alerting when {{metric}} crosses {{warning threshold}} (before SLA breach)
- **Notification:** {{Who is notified, through what channel, within what timeframe}}
- **Escalation:** If breach continues beyond {{X}} minutes, escalate to {{role/team}}

### Breach response
- **Incident process:** Follow standard incident management process
- **Root cause analysis:** Required for any SLA breach exceeding {{X}} minutes
- **Improvement plan:** Required if SLA is breached {{N}} times in a {{quarter}}

### Credit/penalty mechanisms (for external SLAs)
- **Breach of availability target:** {{X%}} credit on monthly invoice
- **Breach of latency target:** {{X%}} credit on monthly invoice
- **Credit cap:** {{Maximum credit per month, typically 30% of monthly fee}}
- **Claim process:** {{How customers submit claims, response timeline}}

### Regular reviews
- **Monthly:** SLA performance report to {{stakeholders}}
- **Quarterly:** SLA review meeting -- assess targets, adjust if needed
- **Annual:** Full SLA redesign review

Step 7: Generate the SLA document

# Service Level Agreement -- {{service_name}}

**Version:** {{1.0}}
**Effective date:** {{date}}
**Review date:** {{date + 1 year}}
**Owner:** {{team/person}}

## 1. Service description
{{What the service does, who it serves}}

## 2. Metrics and targets
{{From Step 4 -- metrics, targets, measurement windows}}

## 3. Measurement methodology
{{From Step 5 -- how each metric is measured}}

## 4. Exclusions
{{Planned maintenance, force majeure, dependent services}}

## 5. Breach procedures
{{From Step 6 -- notification, escalation, response}}

## 6. Credits and remedies
{{From Step 6 -- credit terms, claim process, caps}}

## 7. Review schedule
{{Monthly reporting, quarterly review, annual redesign}}

## 8. Approval
{{Signatures/approvals from service owner and key consumers}}

Step 8: Review

Before finalizing, ask:

  • Are the targets achievable with current infrastructure and team capacity?
  • Who pays for the additional reliability? (More nines = exponentially more cost in engineering and infrastructure.)
  • Are the exclusions reasonable from the consumer's perspective?
  • Does legal need to review the contract language for external SLAs?
  • Is the measurement methodology automated, or does it depend on manual data collection?
  • Can you actually alert on these metrics before a breach occurs?
  • What happens during the first breach -- is the governance process clear enough that people know what to do?

Output location

Save to deliverables/sla-{{service_name}}-{{date}}.md. For external SLAs, the final version will likely need legal review and a different format -- this output serves as the technical foundation.

Example Output

Input

  • Service description: Meridian Health's patient-facing appointment booking API — handles scheduling, cancellation, and real-time slot availability for 340+ clinic locations across the Pacific Northwest; consumed by the patient web portal, mobile app, and third-party EHR integrations
  • Consumers: 1.2M registered patients (external), 14 internal engineering teams, 3 EHR vendor partners (Epic, Athenahealth, eClinicalWorks)
  • Current reliability data: 99.71% uptime over last 6 months; p50 latency 210ms, p95 latency 890ms, p99 latency 2,400ms; error rate averaging 0.43%
  • Business criticality: Missed appointments cost $180–$340 each in no-show revenue; booking failures during peak hours (7–9am, 4–6pm) directly affect patient care access; HIPAA audit obligations require data completeness
  • Contractual obligations: EHR vendor contracts currently have no formal SLA; patient-facing terms of service reference "best efforts" availability only

Output (abbreviated)

Service Level Agreement — Meridian Health Appointment Booking API

Version: 1.0 Effective date: 2025-02-01 Review date: 2026-02-01 Owner: Platform Engineering, Meridian Digital Health Classification: Tier 1 — Revenue-critical, patient-facing


Tier Assignment

FactorAssessment
Revenue impact~$2,100–$4,080 lost per hour of booking downtime at avg. volume
Patient impactDirect care access disruption; appointments missed or duplicated
Compliance impactHIPAA availability and audit trail requirements apply
EHR partner dependencyEpic/Athenahealth integrations block clinic workflows if API fails
Assigned tierTier 1 — Target range 99.9% – 99.95%

Current baseline (99.71%) falls short of Tier 1 thresholds. A reliability improvement roadmap is required before 99.95% can be committed externally. Initial external SLA set at 99.9%; internal engineering SLO set at 99.95% as a buffer.


Metrics and Targets

Availability

  • Target: 99.9% measured over calendar month windows
  • Current baseline: 99.71% (6-month average)
  • Allowable downtime: 43.2 minutes per month
  • Exclusions: Scheduled maintenance with ≥72-hour notice via status page and email to EHR partners; force majeure

Latency

  • p50 target: ≤ 250ms
  • p95 target: ≤ 750ms (current p95 at 890ms — requires optimization sprint before target is contractually committed)
  • p99 target: ≤ 1,500ms (current p99 at 2,400ms — phased target: ≤2,000ms at launch, ≤1,500ms by Q3 2025)
  • Measurement point: AWS ALB (application load balancer), excluding health check endpoints
  • Peak-hour note: Targets apply at all times, including 7–9am and 4–6pm PT peak windows

Error Rate

  • Target: < 0.1% of requests per calendar month
  • Current baseline: 0.43% — improvement required
  • Definition: HTTP 5xx responses ÷ total responses, excluding /health, /status, and client-side 4xx errors
  • EHR integration note: Integration-originated errors tracked separately; included in overall rate but broken out in monthly report

Data Freshness (Slot Availability)

  • Target: Slot availability data ≤ 30 seconds stale at p95
  • Rationale: Double-booking risk if cache is not refreshed; directly affects patient experience and clinic ops
  • Measurement: Timestamp delta between source-of-truth DB write and API response cache, sampled every 60 seconds

Measurement Methodology

MetricData SourceCalculationWindowCadence
AvailabilityAWS ALB access logs + synthetic checks (every 2 min from 3 regions)(Total minutes − downtime minutes) ÷ total minutesCalendar monthReal-time dashboard + monthly report
LatencyALB request logs, percentile aggregationp50/p95/p99 of response time field, all non-health endpointsRolling 24h (dashboard), calendar month (SLA)Real-time + monthly
Error rateALB logs filtered to 5xxCount(5xx) ÷ Count(all requests)Calendar monthReal-time + monthly
FreshnessCustom CloudWatch metric emitted by cache invalidation servicep95 of (response_timestamp − db_write_timestamp)Rolling 1hReal-time

Edge case handling:

  • Planned maintenance: Excluded from availability calculation if announced ≥72h in advance on status.meridianhealth.com and via email to registered EHR API contacts
  • Upstream dependencies: AWS RDS or ElastiCache outages are included in availability calculation (Meridian owns the stack); noted separately in breach reports
  • Partial outages: Degraded service (>10% of requests affected, non-zero success rate) counted proportionally using the formula: downtime_equivalent = (error_rate − baseline) × affected_duration
  • Dispute resolution: Engineering VP and EHR partner technical lead review raw ALB logs jointly within 5 business days of dispute filing

Governance

Breach Detection and Notification

ConditionWarning ThresholdSLA Breach ThresholdWho is NotifiedChannelSLA
Availability< 99.97% in rolling 1h< 99.9% in calendar monthOn-call engineer, Platform EMPagerDuty + Slack #incidentsImmediate (auto-alert)
p95 latency> 600ms rolling 15-min avg> 750ms sustained 30+ minOn-call engineerPagerDutyImmediate
Error rate> 0.07% rolling 1h> 0.1% in calendar monthOn-call + Product leadPagerDuty + SlackImmediate
EHR partner impactAny Tier 1 breach affecting integrationSameEHR partner technical contactsEmail + status pageWithin 15 minutes of confirmation

Breach Response

  • RCA requirement: Any breach exceeding 15 minutes of downtime or 0.2% error rate sustained for 1+ hours
  • RCA delivery: Within 5 business days of incident resolution, shared with EHR partners
  • Improvement plan: Required if SLA is breached 2+ times in a rolling quarter; reviewed with EHR partners within 10 business days

Credit and Penalty Mechanism (EHR Vendor Contracts)

Availability AchievedMonthly Credit
99.5% – 99.9%10% of monthly API access fee
99.0% – 99.5%20% of monthly API access fee
< 99.0%30% of monthly API access fee
  • Latency breach credit: 5% of monthly fee if p95 exceeds target for >4 cumulative hours in a month
  • Credit cap: 30% of monthly fee per calendar month
  • Claim process: EHR partners submit claims via vendor portal within 30 days of month end; Meridian responds within 10 business days with ALB log evidence

Cost of Nines — Decision Rationale

TargetDowntime/monthEngineering cost estimateDecision
99.71% (current)~2.1 hoursBaselineNot contractually acceptable
99.9%43 min~1.5x current infra + on-callExternal SLA commitment
99.95%22 min~2.5xInternal SLO (engineering buffer)
99.99%4.3 min~8xNot justified at current revenue scale

*Moving from 99.71% to 99.9% requires: ALB redundancy improvements, read