Use this when the team needs to migrate a database, service, API, infrastructure component, or data model and wants a structured plan that minimizes risk and downtime. Also use when evaluating whether a migration is worth the investment or when a migration is already in trouble and needs a recovery plan.
Related skills: Use
/tech-debt-assessmentto evaluate whether a migration is the right remediation for a debt item. Use/system-diagramto visualize before and after states. Use/pre-mortemto stress-test the migration plan before execution. Use/architecture-discoveryfor complex migrations that require understanding bounded contexts.
Process
Step 1: Gather inputs
Ask the user to provide:
- What's being migrated -- database, service, API, infrastructure, data model, or combination?
- From/to -- current state and target state. Be specific: versions, platforms, architectures.
- Why -- what's driving the migration? (End of life, scaling limits, cost, compliance, tech debt, acquisition.)
- Scope -- everything at once, or can it be phased?
- Consumers -- who and what depends on the thing being migrated? (Services, teams, external clients, integrations.)
- Constraints -- downtime budget, compliance requirements, data retention rules, team capacity, deadline.
- Current state health -- is the existing system stable or already failing? (Affects urgency and risk tolerance.)
Step 2: Select the migration pattern
Evaluate which migration pattern fits the situation:
### Migration pattern evaluation -- {{system}}, {{date}}
| Pattern | Description | Best when | Risk level | Duration |
|---------|-----------|-----------|-----------|----------|
| **Strangler Fig** | Incrementally replace old with new, routing traffic gradually | Large systems, many consumers, low downtime tolerance | Low | Weeks to months |
| **Parallel Run** | Run old and new simultaneously, compare outputs, switch when confident | Data integrity is critical, complex business logic | Low-Medium | Weeks to months |
| **Blue-Green** | Stand up complete new environment, switch all traffic at once | Infrastructure migrations, stateless services | Medium | Days to weeks |
| **Big Bang** | Replace everything at once during a maintenance window | Small scope, acceptable downtime, simple dependencies | High | Hours to days |
| **Trickle Migration** | Move data/traffic incrementally by segment (by customer, region, entity type) | Large data sets, heterogeneous consumers | Low | Weeks to months |
**Selected pattern:** {{pattern}}
**Rationale:** (why this pattern fits the constraints and risk tolerance)
**Rejected alternatives:** (which patterns were considered and why they don't fit)
Step 3: Map dependencies and blast radius
### Dependency map
| Dependent | Type | Coupling | Impact if migration breaks | Notification needed |
|-----------|------|---------|--------------------------|-------------------|
| (service/team/integration) | Direct consumer / Indirect / Data reader | Tight / Loose | (what happens?) | (who needs to know, when?) |
### Blast radius assessment
- **Direct impact:** (systems that will stop working if migration fails)
- **Indirect impact:** (systems that may degrade or produce incorrect results)
- **Data impact:** (risk of data loss, corruption, or inconsistency)
- **Customer impact:** (who sees what during migration and during failure)
Step 4: Design the data migration (if applicable)
### Data migration plan
**Volume:** (rows, GB, number of records)
**Migration approach:** (bulk export/import, streaming replication, dual-write, CDC)
**Validation strategy:**
| Check | Method | Tolerance | Action if failed |
|-------|--------|-----------|-----------------|
| Row count match | COUNT(*) comparison | 0% tolerance | STOP -- investigate |
| Data integrity | Checksum or sample comparison | (acceptable error rate) | (action) |
| Referential integrity | FK validation on target | 0% tolerance | STOP -- investigate |
| Business logic validation | Run known-output queries on both | Exact match | STOP -- investigate |
**Handling the gap:**
- How do you handle writes to the old system during migration?
- (Dual-write / queue and replay / maintenance window / CDC stream)
Step 5: Define the rollback strategy
### Rollback plan
**Rollback trigger:** (what conditions trigger a rollback?)
- (e.g., "Error rate > 1% on migrated traffic")
- (e.g., "Data validation fails on > 0.1% of records")
- (e.g., "Customer-reported issues within first 30 minutes")
**Rollback procedure:**
1. (Step-by-step: what to do, in what order, who does it)
2. (Include estimated time for each step)
3. (Include verification after rollback)
**Rollback window:** (how long after cutover can we still roll back?)
**What makes rollback impossible:** (at what point is rollback no longer viable? Why?)
**Rollback testing:** (when and how will the rollback procedure be tested before the real migration?)
Step 6: Build the cutover plan
### Cutover checklist
**Pre-cutover (T minus 1 week):**
- [ ] All data migration validation passing
- [ ] Rollback procedure tested
- [ ] Communication sent to affected teams/customers
- [ ] Monitoring dashboards configured for migration-specific metrics
- [ ] On-call schedule confirmed for cutover window
- [ ] Runbook reviewed and updated
**Cutover (T zero):**
- [ ] (Step 1: specific action -- who, what, expected duration)
- [ ] (Step 2: verification check -- what to look for)
- [ ] (Step 3: traffic shift -- how much, how fast)
- [ ] (Step 4: monitoring checkpoint -- what metrics to watch, for how long)
- [ ] (Step 5: go/no-go decision point -- criteria for proceeding vs. rolling back)
**Post-cutover (T plus 1 hour / 1 day / 1 week):**
- [ ] Error rates within normal bounds
- [ ] Performance metrics stable
- [ ] Data consistency checks passing
- [ ] Old system decommission scheduled (don't rush this)
- [ ] Retrospective scheduled
Step 7: Communication plan
### Communication plan
| Audience | Message | When | Channel | Owner |
|----------|---------|------|---------|-------|
| Engineering teams | Migration plan + timeline + what they need to do | T minus 2 weeks | (Slack/email/meeting) | (who) |
| Customer-facing teams | What customers might see, FAQ for support | T minus 1 week | (channel) | (who) |
| Customers (if needed) | Maintenance window, expected impact, what to do if issues | T minus 3 days | (email/status page) | (who) |
| Leadership | Status update, risk summary, go/no-go | T minus 1 day | (channel) | (who) |
| All | Cutover started / completed / issues | T zero | (status page/Slack) | (who) |
Step 8: Define success criteria
### Success criteria
**Migration is successful when:**
- [ ] All data validated and reconciled (from Step 4 checks)
- [ ] Error rate at or below pre-migration baseline for 48 hours
- [ ] Performance metrics at or below pre-migration latency for 48 hours
- [ ] No customer-reported issues related to migration for 1 week
- [ ] Old system safely decommissioned (or decommission scheduled)
**Monitoring during migration:**
| Metric | Pre-migration baseline | Alert threshold | Dashboard |
|--------|----------------------|-----------------|-----------|
| (error rate) | (current value) | (threshold) | (link) |
| (latency p99) | (current value) | (threshold) | (link) |
| (data consistency) | 100% | < 99.9% | (link) |
Step 9: Discuss
Ask the user:
- Does the pattern selection match your constraints?
- Are there dependencies I missed?
- Is the rollback window realistic?
- Who needs to approve the go/no-go decision?
- Want me to break the cutover steps into sprint stories?
- Should I run a
/pre-mortemon this plan?
Output location
Present the migration plan as formatted text in the conversation or save to a file if requested.
Example Output
Input
- What's being migrated: PostgreSQL 11 database (primary transactional store) to Aurora PostgreSQL 15, including a legacy stored-procedure-heavy schema with ~40 custom functions
- From/to: Self-managed PostgreSQL 11 on EC2 (r5.4xlarge, single AZ) → Amazon Aurora PostgreSQL 15 (Multi-AZ cluster, 2 read replicas)
- Why: PostgreSQL 11 reaches end of life in November 2024; current setup has had two unplanned outages this quarter due to storage I/O saturation; insurance compliance audit flagged single-AZ as a DR gap
- Consumers: 3 internal microservices (OrderService, InventoryService, BillingService), 1 third-party ERP integration (SAP), nightly ETL pipeline to Redshift
- Constraints: Max 15-minute downtime window (SLA with enterprise customers), migration must complete before November 1 audit deadline, 2 engineers available (mid-sprint), 4.2 TB database
- Company: Vanterra Logistics, B2B freight management platform
Output (abbreviated)
Migration Plan: PostgreSQL 11 (EC2) → Aurora PostgreSQL 15
System: Vanterra Logistics — Primary Transactional DB Date: October 2024 Authors: Platform Engineering
Migration Pattern Evaluation
| Pattern | Description | Best when | Risk Level | Duration |
|---|---|---|---|---|
| Strangler Fig | Incrementally replace old with new | Large systems, many consumers | Low | Weeks–months |
| Parallel Run | Run both simultaneously, compare outputs | Data integrity critical | Low–Medium | Weeks–months |
| Blue-Green | Stand up full new environment, cut over at once | Infrastructure migrations, low-statefulness | Medium | Days–weeks |
| Big Bang | Full replacement in maintenance window | Small scope, acceptable downtime | High | Hours |
| Trickle Migration | Move data by segment incrementally | Large data sets | Low | Weeks–months |
Selected pattern: Parallel Run → Blue-Green cutover Rationale: 4.2 TB volume and 40 stored procedures require extended validation before cutover. AWS DMS continuous replication enables parallel run with near-zero lag, then a Blue-Green switch inside the 15-minute downtime window once parity is confirmed over 2 weeks. Rejected alternatives:
- Big Bang — 4.2 TB cannot be reliably exported/imported in 15 minutes; too high risk given compliance deadline
- Strangler Fig — all three services share a monolithic schema with no clean seam to route incrementally; would require months of refactoring not available in sprint capacity
Dependency Map
| Dependent | Type | Coupling | Impact if Migration Breaks | Notification Needed |
|---|---|---|---|---|
| OrderService | Direct consumer (read/write) | Tight | Order creation and status updates fail; customer-facing | Platform Eng + Product, T−2 weeks |
| InventoryService | Direct consumer (read/write) | Tight | Stock availability queries fail; cascades to OrderService | Platform Eng, T−2 weeks |
| BillingService | Direct consumer (read) | Medium | Invoice generation delayed; not real-time | Platform Eng + Finance, T−2 weeks |
| SAP ERP integration | Direct consumer (read via JDBC) | Tight | Freight cost sync breaks; SAP team must update connection string | SAP team lead, T−3 weeks |
| Redshift ETL pipeline | Indirect (nightly batch) | Loose | Next-day reporting delayed; recoverable by re-run | Data Engineering, T−1 week |
Blast Radius Assessment
- Direct impact: OrderService and InventoryService go down if Aurora endpoint is unreachable or stored procedures fail post-migration
- Indirect impact: SAP sync produces stale freight cost data; Redshift reports lag by one day
- Data impact: 4.2 TB — risk of replication lag gap during final cutover if DMS task falls behind; PostgreSQL 15 behavior changes in 3 stored procedures (identified in pre-migration audit) could silently corrupt calculation outputs
- Customer impact: Enterprise customers on freight booking portal see errors during cutover window; status page update required
Data Migration Plan
Volume: 4.2 TB, ~1.8 billion rows across 14 core tables
Migration approach: AWS DMS continuous replication (CDC) for ongoing sync after initial full load via pg_dump parallel jobs; dual-endpoint validation before cutover
Validation Strategy
| Check | Method | Tolerance | Action if Failed |
|---|---|---|---|
| Row count match | COUNT(*) on all 14 tables, both endpoints | 0% | STOP — investigate DMS task lag or missed transactions |
| Stored procedure output | Run 12 known-query fixtures against both DBs | Exact match | STOP — audit PG15 behavior changes in affected functions |
| Referential integrity | FK constraint validation script on Aurora | 0% violations | STOP — trace DMS ordering issue |
| Data checksum (spot) | MD5 on 10K random row samples, 3 largest tables | <0.001% variance | Investigate before proceeding |
| BillingService invoice totals | Run last 30 days of billing queries on both | Exact match | STOP — escalate to Finance |
Handling the gap:
- DMS replication lag target: < 5 seconds during parallel run phase
- At T−0, OrderService and InventoryService are put into read-only mode (feature flag) for 8 minutes while final DMS flush completes and row counts are verified
- Writes queue in SQS; drained to Aurora immediately after cutover confirmed
Rollback Plan
Rollback triggers:
- Error rate on any service > 1% for 5 consecutive minutes post-cutover
- Any stored procedure returning results that differ from baseline fixtures
- DMS replication lag > 30 seconds at time of cutover decision
- SAP JDBC connection failure not resolved within 10 minutes of cutover
Rollback procedure:
- Flip OrderService, InventoryService, BillingService connection strings back to EC2 PostgreSQL 11 endpoint via Parameter Store (estimated: 2 minutes)
- Drain SQS write queue back to EC2 instance (estimated: 1–3 minutes depending on volume)
- Verify row counts on EC2 DB match pre-cutover snapshot (estimated: 2 minutes)
- Notify SAP team to revert JDBC connection string (estimated: 5 minutes with their on-call)
- Post status page update; alert on-call rotation (estimated: 1 minute)
Rollback window: Up to 60 minutes post-cutover (before Redshift ETL job triggers at 02:00 UTC) What makes rollback impossible: If Redshift ETL has already ingested Aurora data and downstream reports have been generated, rollback creates a reporting split that requires manual reconciliation — avoid cutover after 00:00 UTC
Rollback testing: Full dry-run during off-hours maintenance window, October 12, including deliberate failure injection and rollback execution; target < 12 minutes end-to-end
Cutover Checklist (Condensed)
Pre-cutover (T−1 week):
- DMS replication lag < 5 seconds sustained for 48 hours
- All 12 stored procedure fixtures passing on Aurora
- SAP team has new JDBC connection string and tested on staging
- Read-only feature flag tested on all three services
- Rollback dry-run completed October 12 — results documented
- Status page incident template drafted
- On-call: 2 platform engineers + 1 SAP contact confirmed for cutover night
Cutover (T−0, target: Saturday October 26, 22:00 UTC):
- Enable read-only mode on OrderService + InventoryService via feature flag (T+0:00, Eng 1)
- Confirm DMS lag = 0, run final row count validation (T+0:02, Eng 2 — must pass in < 4 min or abort)
- Flip connection strings in Parameter Store for all three services (T+0:06, Eng 1)
- SAP team updates JDBC endpoint (T+0:07, SAP contact)