Skip to main content
Engineering/migration-plan

Migration Plan

You need to plan a system migration – pattern selection, risk mitigation, data validation, cutover checklist, and rollback strategy.

Use this when the team needs to migrate a database, service, API, infrastructure component, or data model and wants a structured plan that minimizes risk and downtime. Also use when evaluating whether a migration is worth the investment or when a migration is already in trouble and needs a recovery plan.

Related skills: Use /tech-debt-assessment to evaluate whether a migration is the right remediation for a debt item. Use /system-diagram to visualize before and after states. Use /pre-mortem to stress-test the migration plan before execution. Use /architecture-discovery for complex migrations that require understanding bounded contexts.

Process

Step 1: Gather inputs

Ask the user to provide:

  1. What's being migrated -- database, service, API, infrastructure, data model, or combination?
  2. From/to -- current state and target state. Be specific: versions, platforms, architectures.
  3. Why -- what's driving the migration? (End of life, scaling limits, cost, compliance, tech debt, acquisition.)
  4. Scope -- everything at once, or can it be phased?
  5. Consumers -- who and what depends on the thing being migrated? (Services, teams, external clients, integrations.)
  6. Constraints -- downtime budget, compliance requirements, data retention rules, team capacity, deadline.
  7. Current state health -- is the existing system stable or already failing? (Affects urgency and risk tolerance.)

Step 2: Select the migration pattern

Evaluate which migration pattern fits the situation:

### Migration pattern evaluation -- {{system}}, {{date}}

| Pattern | Description | Best when | Risk level | Duration |
|---------|-----------|-----------|-----------|----------|
| **Strangler Fig** | Incrementally replace old with new, routing traffic gradually | Large systems, many consumers, low downtime tolerance | Low | Weeks to months |
| **Parallel Run** | Run old and new simultaneously, compare outputs, switch when confident | Data integrity is critical, complex business logic | Low-Medium | Weeks to months |
| **Blue-Green** | Stand up complete new environment, switch all traffic at once | Infrastructure migrations, stateless services | Medium | Days to weeks |
| **Big Bang** | Replace everything at once during a maintenance window | Small scope, acceptable downtime, simple dependencies | High | Hours to days |
| **Trickle Migration** | Move data/traffic incrementally by segment (by customer, region, entity type) | Large data sets, heterogeneous consumers | Low | Weeks to months |

**Selected pattern:** {{pattern}}
**Rationale:** (why this pattern fits the constraints and risk tolerance)
**Rejected alternatives:** (which patterns were considered and why they don't fit)

Step 3: Map dependencies and blast radius

### Dependency map

| Dependent | Type | Coupling | Impact if migration breaks | Notification needed |
|-----------|------|---------|--------------------------|-------------------|
| (service/team/integration) | Direct consumer / Indirect / Data reader | Tight / Loose | (what happens?) | (who needs to know, when?) |

### Blast radius assessment
- **Direct impact:** (systems that will stop working if migration fails)
- **Indirect impact:** (systems that may degrade or produce incorrect results)
- **Data impact:** (risk of data loss, corruption, or inconsistency)
- **Customer impact:** (who sees what during migration and during failure)

Step 4: Design the data migration (if applicable)

### Data migration plan

**Volume:** (rows, GB, number of records)
**Migration approach:** (bulk export/import, streaming replication, dual-write, CDC)

**Validation strategy:**
| Check | Method | Tolerance | Action if failed |
|-------|--------|-----------|-----------------|
| Row count match | COUNT(*) comparison | 0% tolerance | STOP -- investigate |
| Data integrity | Checksum or sample comparison | (acceptable error rate) | (action) |
| Referential integrity | FK validation on target | 0% tolerance | STOP -- investigate |
| Business logic validation | Run known-output queries on both | Exact match | STOP -- investigate |

**Handling the gap:**
- How do you handle writes to the old system during migration?
- (Dual-write / queue and replay / maintenance window / CDC stream)

Step 5: Define the rollback strategy

### Rollback plan

**Rollback trigger:** (what conditions trigger a rollback?)
- (e.g., "Error rate > 1% on migrated traffic")
- (e.g., "Data validation fails on > 0.1% of records")
- (e.g., "Customer-reported issues within first 30 minutes")

**Rollback procedure:**
1. (Step-by-step: what to do, in what order, who does it)
2. (Include estimated time for each step)
3. (Include verification after rollback)

**Rollback window:** (how long after cutover can we still roll back?)
**What makes rollback impossible:** (at what point is rollback no longer viable? Why?)

**Rollback testing:** (when and how will the rollback procedure be tested before the real migration?)

Step 6: Build the cutover plan

### Cutover checklist

**Pre-cutover (T minus 1 week):**
- [ ] All data migration validation passing
- [ ] Rollback procedure tested
- [ ] Communication sent to affected teams/customers
- [ ] Monitoring dashboards configured for migration-specific metrics
- [ ] On-call schedule confirmed for cutover window
- [ ] Runbook reviewed and updated

**Cutover (T zero):**
- [ ] (Step 1: specific action -- who, what, expected duration)
- [ ] (Step 2: verification check -- what to look for)
- [ ] (Step 3: traffic shift -- how much, how fast)
- [ ] (Step 4: monitoring checkpoint -- what metrics to watch, for how long)
- [ ] (Step 5: go/no-go decision point -- criteria for proceeding vs. rolling back)

**Post-cutover (T plus 1 hour / 1 day / 1 week):**
- [ ] Error rates within normal bounds
- [ ] Performance metrics stable
- [ ] Data consistency checks passing
- [ ] Old system decommission scheduled (don't rush this)
- [ ] Retrospective scheduled

Step 7: Communication plan

### Communication plan

| Audience | Message | When | Channel | Owner |
|----------|---------|------|---------|-------|
| Engineering teams | Migration plan + timeline + what they need to do | T minus 2 weeks | (Slack/email/meeting) | (who) |
| Customer-facing teams | What customers might see, FAQ for support | T minus 1 week | (channel) | (who) |
| Customers (if needed) | Maintenance window, expected impact, what to do if issues | T minus 3 days | (email/status page) | (who) |
| Leadership | Status update, risk summary, go/no-go | T minus 1 day | (channel) | (who) |
| All | Cutover started / completed / issues | T zero | (status page/Slack) | (who) |

Step 8: Define success criteria

### Success criteria

**Migration is successful when:**
- [ ] All data validated and reconciled (from Step 4 checks)
- [ ] Error rate at or below pre-migration baseline for 48 hours
- [ ] Performance metrics at or below pre-migration latency for 48 hours
- [ ] No customer-reported issues related to migration for 1 week
- [ ] Old system safely decommissioned (or decommission scheduled)

**Monitoring during migration:**
| Metric | Pre-migration baseline | Alert threshold | Dashboard |
|--------|----------------------|-----------------|-----------|
| (error rate) | (current value) | (threshold) | (link) |
| (latency p99) | (current value) | (threshold) | (link) |
| (data consistency) | 100% | < 99.9% | (link) |

Step 9: Discuss

Ask the user:

  • Does the pattern selection match your constraints?
  • Are there dependencies I missed?
  • Is the rollback window realistic?
  • Who needs to approve the go/no-go decision?
  • Want me to break the cutover steps into sprint stories?
  • Should I run a /pre-mortem on this plan?

Output location

Present the migration plan as formatted text in the conversation or save to a file if requested.

Example Output

Input

  • What's being migrated: PostgreSQL 11 database (primary transactional store) to Aurora PostgreSQL 15, including a legacy stored-procedure-heavy schema with ~40 custom functions
  • From/to: Self-managed PostgreSQL 11 on EC2 (r5.4xlarge, single AZ) → Amazon Aurora PostgreSQL 15 (Multi-AZ cluster, 2 read replicas)
  • Why: PostgreSQL 11 reaches end of life in November 2024; current setup has had two unplanned outages this quarter due to storage I/O saturation; insurance compliance audit flagged single-AZ as a DR gap
  • Consumers: 3 internal microservices (OrderService, InventoryService, BillingService), 1 third-party ERP integration (SAP), nightly ETL pipeline to Redshift
  • Constraints: Max 15-minute downtime window (SLA with enterprise customers), migration must complete before November 1 audit deadline, 2 engineers available (mid-sprint), 4.2 TB database
  • Company: Vanterra Logistics, B2B freight management platform

Output (abbreviated)

Migration Plan: PostgreSQL 11 (EC2) → Aurora PostgreSQL 15

System: Vanterra Logistics — Primary Transactional DB Date: October 2024 Authors: Platform Engineering


Migration Pattern Evaluation

PatternDescriptionBest whenRisk LevelDuration
Strangler FigIncrementally replace old with newLarge systems, many consumersLowWeeks–months
Parallel RunRun both simultaneously, compare outputsData integrity criticalLow–MediumWeeks–months
Blue-GreenStand up full new environment, cut over at onceInfrastructure migrations, low-statefulnessMediumDays–weeks
Big BangFull replacement in maintenance windowSmall scope, acceptable downtimeHighHours
Trickle MigrationMove data by segment incrementallyLarge data setsLowWeeks–months

Selected pattern: Parallel Run → Blue-Green cutover Rationale: 4.2 TB volume and 40 stored procedures require extended validation before cutover. AWS DMS continuous replication enables parallel run with near-zero lag, then a Blue-Green switch inside the 15-minute downtime window once parity is confirmed over 2 weeks. Rejected alternatives:

  • Big Bang — 4.2 TB cannot be reliably exported/imported in 15 minutes; too high risk given compliance deadline
  • Strangler Fig — all three services share a monolithic schema with no clean seam to route incrementally; would require months of refactoring not available in sprint capacity

Dependency Map

DependentTypeCouplingImpact if Migration BreaksNotification Needed
OrderServiceDirect consumer (read/write)TightOrder creation and status updates fail; customer-facingPlatform Eng + Product, T−2 weeks
InventoryServiceDirect consumer (read/write)TightStock availability queries fail; cascades to OrderServicePlatform Eng, T−2 weeks
BillingServiceDirect consumer (read)MediumInvoice generation delayed; not real-timePlatform Eng + Finance, T−2 weeks
SAP ERP integrationDirect consumer (read via JDBC)TightFreight cost sync breaks; SAP team must update connection stringSAP team lead, T−3 weeks
Redshift ETL pipelineIndirect (nightly batch)LooseNext-day reporting delayed; recoverable by re-runData Engineering, T−1 week

Blast Radius Assessment

  • Direct impact: OrderService and InventoryService go down if Aurora endpoint is unreachable or stored procedures fail post-migration
  • Indirect impact: SAP sync produces stale freight cost data; Redshift reports lag by one day
  • Data impact: 4.2 TB — risk of replication lag gap during final cutover if DMS task falls behind; PostgreSQL 15 behavior changes in 3 stored procedures (identified in pre-migration audit) could silently corrupt calculation outputs
  • Customer impact: Enterprise customers on freight booking portal see errors during cutover window; status page update required

Data Migration Plan

Volume: 4.2 TB, ~1.8 billion rows across 14 core tables Migration approach: AWS DMS continuous replication (CDC) for ongoing sync after initial full load via pg_dump parallel jobs; dual-endpoint validation before cutover

Validation Strategy

CheckMethodToleranceAction if Failed
Row count matchCOUNT(*) on all 14 tables, both endpoints0%STOP — investigate DMS task lag or missed transactions
Stored procedure outputRun 12 known-query fixtures against both DBsExact matchSTOP — audit PG15 behavior changes in affected functions
Referential integrityFK constraint validation script on Aurora0% violationsSTOP — trace DMS ordering issue
Data checksum (spot)MD5 on 10K random row samples, 3 largest tables<0.001% varianceInvestigate before proceeding
BillingService invoice totalsRun last 30 days of billing queries on bothExact matchSTOP — escalate to Finance

Handling the gap:

  • DMS replication lag target: < 5 seconds during parallel run phase
  • At T−0, OrderService and InventoryService are put into read-only mode (feature flag) for 8 minutes while final DMS flush completes and row counts are verified
  • Writes queue in SQS; drained to Aurora immediately after cutover confirmed

Rollback Plan

Rollback triggers:

  • Error rate on any service > 1% for 5 consecutive minutes post-cutover
  • Any stored procedure returning results that differ from baseline fixtures
  • DMS replication lag > 30 seconds at time of cutover decision
  • SAP JDBC connection failure not resolved within 10 minutes of cutover

Rollback procedure:

  1. Flip OrderService, InventoryService, BillingService connection strings back to EC2 PostgreSQL 11 endpoint via Parameter Store (estimated: 2 minutes)
  2. Drain SQS write queue back to EC2 instance (estimated: 1–3 minutes depending on volume)
  3. Verify row counts on EC2 DB match pre-cutover snapshot (estimated: 2 minutes)
  4. Notify SAP team to revert JDBC connection string (estimated: 5 minutes with their on-call)
  5. Post status page update; alert on-call rotation (estimated: 1 minute)

Rollback window: Up to 60 minutes post-cutover (before Redshift ETL job triggers at 02:00 UTC) What makes rollback impossible: If Redshift ETL has already ingested Aurora data and downstream reports have been generated, rollback creates a reporting split that requires manual reconciliation — avoid cutover after 00:00 UTC

Rollback testing: Full dry-run during off-hours maintenance window, October 12, including deliberate failure injection and rollback execution; target < 12 minutes end-to-end


Cutover Checklist (Condensed)

Pre-cutover (T−1 week):

  • DMS replication lag < 5 seconds sustained for 48 hours
  • All 12 stored procedure fixtures passing on Aurora
  • SAP team has new JDBC connection string and tested on staging
  • Read-only feature flag tested on all three services
  • Rollback dry-run completed October 12 — results documented
  • Status page incident template drafted
  • On-call: 2 platform engineers + 1 SAP contact confirmed for cutover night

Cutover (T−0, target: Saturday October 26, 22:00 UTC):

  • Enable read-only mode on OrderService + InventoryService via feature flag (T+0:00, Eng 1)
  • Confirm DMS lag = 0, run final row count validation (T+0:02, Eng 2 — must pass in < 4 min or abort)
  • Flip connection strings in Parameter Store for all three services (T+0:06, Eng 1)
  • SAP team updates JDBC endpoint (T+0:07, SAP contact)