Migration Plan

Use this when the team needs to migrate a database, service, API, infrastructure component, or data model and wants a structured plan that minimizes risk and downtime. Also use when evaluating whether a migration is worth the investment or when a migration is already in trouble and needs a recovery plan.

Related skills: Use /tech-debt-assessment to evaluate whether a migration is the right remediation for a debt item. Use /system-diagram to visualize before and after states. Use /pre-mortem to stress-test the migration plan before execution. Use /architecture-discovery for complex migrations that require understanding bounded contexts.

Process

Step 1: Gather inputs

Ask the user to provide:

What's being migrated -- database, service, API, infrastructure, data model, or combination?
From/to -- current state and target state. Be specific: versions, platforms, architectures.
Why -- what's driving the migration? (End of life, scaling limits, cost, compliance, tech debt, acquisition.)
Scope -- everything at once, or can it be phased?
Consumers -- who and what depends on the thing being migrated? (Services, teams, external clients, integrations.)
Constraints -- downtime budget, compliance requirements, data retention rules, team capacity, deadline.
Current state health -- is the existing system stable or already failing? (Affects urgency and risk tolerance.)

Step 2: Select the migration pattern

Evaluate which migration pattern fits the situation:

### Migration pattern evaluation -- {{system}}, {{date}}

| Pattern | Description | Best when | Risk level | Duration |
|---------|-----------|-----------|-----------|----------|
| **Strangler Fig** | Incrementally replace old with new, routing traffic gradually | Large systems, many consumers, low downtime tolerance | Low | Weeks to months |
| **Parallel Run** | Run old and new simultaneously, compare outputs, switch when confident | Data integrity is critical, complex business logic | Low-Medium | Weeks to months |
| **Blue-Green** | Stand up complete new environment, switch all traffic at once | Infrastructure migrations, stateless services | Medium | Days to weeks |
| **Big Bang** | Replace everything at once during a maintenance window | Small scope, acceptable downtime, simple dependencies | High | Hours to days |
| **Trickle Migration** | Move data/traffic incrementally by segment (by customer, region, entity type) | Large data sets, heterogeneous consumers | Low | Weeks to months |

**Selected pattern:** {{pattern}}
**Rationale:** (why this pattern fits the constraints and risk tolerance)
**Rejected alternatives:** (which patterns were considered and why they don't fit)

Step 3: Map dependencies and blast radius

### Dependency map

| Dependent | Type | Coupling | Impact if migration breaks | Notification needed |
|-----------|------|---------|--------------------------|-------------------|
| (service/team/integration) | Direct consumer / Indirect / Data reader | Tight / Loose | (what happens?) | (who needs to know, when?) |

### Blast radius assessment
- **Direct impact:** (systems that will stop working if migration fails)
- **Indirect impact:** (systems that may degrade or produce incorrect results)
- **Data impact:** (risk of data loss, corruption, or inconsistency)
- **Customer impact:** (who sees what during migration and during failure)

Step 4: Design the data migration (if applicable)

### Data migration plan

**Volume:** (rows, GB, number of records)
**Migration approach:** (bulk export/import, streaming replication, dual-write, CDC)

**Validation strategy:**
| Check | Method | Tolerance | Action if failed |
|-------|--------|-----------|-----------------|
| Row count match | COUNT(*) comparison | 0% tolerance | STOP -- investigate |
| Data integrity | Checksum or sample comparison | (acceptable error rate) | (action) |
| Referential integrity | FK validation on target | 0% tolerance | STOP -- investigate |
| Business logic validation | Run known-output queries on both | Exact match | STOP -- investigate |

**Handling the gap:**
- How do you handle writes to the old system during migration?
- (Dual-write / queue and replay / maintenance window / CDC stream)

Step 5: Define the rollback strategy

### Rollback plan

**Rollback trigger:** (what conditions trigger a rollback?)
- (e.g., "Error rate > 1% on migrated traffic")
- (e.g., "Data validation fails on > 0.1% of records")
- (e.g., "Customer-reported issues within first 30 minutes")

**Rollback procedure:**
1. (Step-by-step: what to do, in what order, who does it)
2. (Include estimated time for each step)
3. (Include verification after rollback)

**Rollback window:** (how long after cutover can we still roll back?)
**What makes rollback impossible:** (at what point is rollback no longer viable? Why?)

**Rollback testing:** (when and how will the rollback procedure be tested before the real migration?)

Step 6: Build the cutover plan

### Cutover checklist

**Pre-cutover (T minus 1 week):**
- [ ] All data migration validation passing
- [ ] Rollback procedure tested
- [ ] Communication sent to affected teams/customers
- [ ] Monitoring dashboards configured for migration-specific metrics
- [ ] On-call schedule confirmed for cutover window
- [ ] Runbook reviewed and updated

**Cutover (T zero):**
- [ ] (Step 1: specific action -- who, what, expected duration)
- [ ] (Step 2: verification check -- what to look for)
- [ ] (Step 3: traffic shift -- how much, how fast)
- [ ] (Step 4: monitoring checkpoint -- what metrics to watch, for how long)
- [ ] (Step 5: go/no-go decision point -- criteria for proceeding vs. rolling back)

**Post-cutover (T plus 1 hour / 1 day / 1 week):**
- [ ] Error rates within normal bounds
- [ ] Performance metrics stable
- [ ] Data consistency checks passing
- [ ] Old system decommission scheduled (don't rush this)
- [ ] Retrospective scheduled

Step 7: Communication plan

### Communication plan

| Audience | Message | When | Channel | Owner |
|----------|---------|------|---------|-------|
| Engineering teams | Migration plan + timeline + what they need to do | T minus 2 weeks | (Slack/email/meeting) | (who) |
| Customer-facing teams | What customers might see, FAQ for support | T minus 1 week | (channel) | (who) |
| Customers (if needed) | Maintenance window, expected impact, what to do if issues | T minus 3 days | (email/status page) | (who) |
| Leadership | Status update, risk summary, go/no-go | T minus 1 day | (channel) | (who) |
| All | Cutover started / completed / issues | T zero | (status page/Slack) | (who) |

Step 8: Define success criteria

### Success criteria

**Migration is successful when:**
- [ ] All data validated and reconciled (from Step 4 checks)
- [ ] Error rate at or below pre-migration baseline for 48 hours
- [ ] Performance metrics at or below pre-migration latency for 48 hours
- [ ] No customer-reported issues related to migration for 1 week
- [ ] Old system safely decommissioned (or decommission scheduled)

**Monitoring during migration:**
| Metric | Pre-migration baseline | Alert threshold | Dashboard |
|--------|----------------------|-----------------|-----------|
| (error rate) | (current value) | (threshold) | (link) |
| (latency p99) | (current value) | (threshold) | (link) |
| (data consistency) | 100% | < 99.9% | (link) |

Step 9: Discuss

Ask the user:

Does the pattern selection match your constraints?
Are there dependencies I missed?
Is the rollback window realistic?
Who needs to approve the go/no-go decision?
Want me to break the cutover steps into sprint stories?
Should I run a /pre-mortem on this plan?

Output location

Present the migration plan as formatted text in the conversation or save to a file if requested.

Pattern	Description	Best when	Risk Level	Duration
Strangler Fig	Incrementally replace old with new	Large systems, many consumers	Low	Weeks–months
Parallel Run	Run both simultaneously, compare outputs	Data integrity critical	Low–Medium	Weeks–months
Blue-Green	Stand up full new environment, cut over at once	Infrastructure migrations, low-statefulness	Medium	Days–weeks
Big Bang	Full replacement in maintenance window	Small scope, acceptable downtime	High	Hours
Trickle Migration	Move data by segment incrementally	Large data sets	Low	Weeks–months

Dependent	Type	Coupling	Impact if Migration Breaks	Notification Needed
OrderService	Direct consumer (read/write)	Tight	Order creation and status updates fail; customer-facing	Platform Eng + Product, T−2 weeks
InventoryService	Direct consumer (read/write)	Tight	Stock availability queries fail; cascades to OrderService	Platform Eng, T−2 weeks
BillingService	Direct consumer (read)	Medium	Invoice generation delayed; not real-time	Platform Eng + Finance, T−2 weeks
SAP ERP integration	Direct consumer (read via JDBC)	Tight	Freight cost sync breaks; SAP team must update connection string	SAP team lead, T−3 weeks
Redshift ETL pipeline	Indirect (nightly batch)	Loose	Next-day reporting delayed; recoverable by re-run	Data Engineering, T−1 week

Check	Method	Tolerance	Action if Failed
Row count match	`COUNT(*)` on all 14 tables, both endpoints	0%	STOP — investigate DMS task lag or missed transactions
Stored procedure output	Run 12 known-query fixtures against both DBs	Exact match	STOP — audit PG15 behavior changes in affected functions
Referential integrity	FK constraint validation script on Aurora	0% violations	STOP — trace DMS ordering issue
Data checksum (spot)	MD5 on 10K random row samples, 3 largest tables	<0.001% variance	Investigate before proceeding
BillingService invoice totals	Run last 30 days of billing queries on both	Exact match	STOP — escalate to Finance

Process

Step 1: Gather inputs

Step 2: Select the migration pattern

Step 3: Map dependencies and blast radius

Step 4: Design the data migration (if applicable)

Step 5: Define the rollback strategy

Step 6: Build the cutover plan

Step 7: Communication plan

Step 8: Define success criteria

Step 9: Discuss

Output location

Example Output

Input

Output (abbreviated)

Migration Plan: PostgreSQL 11 (EC2) → Aurora PostgreSQL 15

Migration Pattern Evaluation

Dependency Map

Blast Radius Assessment

Data Migration Plan

Validation Strategy

Rollback Plan

Cutover Checklist (Condensed)

Run this now

Related Engineering skills