What This Template Is For
A blue-green deployment is a release strategy that runs two identical production environments (blue and green) side by side. At any point, one environment serves live traffic while the other sits idle or runs the next release. Deploying means switching the router to send traffic from the current environment to the updated one. If the new version has problems, you switch back in seconds.
This approach eliminates the maintenance window. Users never see downtime during a deployment because the switch between environments is nearly instantaneous. It also gives teams a reliable rollback path. If monitoring detects errors after the cutover, reverting traffic to the previous environment takes one DNS or load balancer change instead of a full redeploy.
The template below helps teams plan and execute blue-green deployments with confidence. It covers environment topology, health check configuration, traffic routing, database migration strategy, and a step-by-step cutover runbook. For teams evaluating deployment strategies more broadly, the Technical PM Handbook covers release engineering patterns in depth. If you are also considering progressive rollouts, the canary release template provides a complementary approach where traffic shifts gradually instead of all at once.
How to Use This Template
- Start with the Environment Inventory section. Document both environments (blue and green) with their infrastructure details. If you do not have a second environment yet, the template guides you through what to provision.
- Fill in the Health Check Configuration. Blue-green deployments depend on automated health checks to verify the new environment before routing traffic. Define what "healthy" means for each service.
- Complete the Database Migration Strategy. This is the hardest part of blue-green deployments. The template walks through backward-compatible migration patterns that allow both environments to share a database during the transition.
- Write the Cutover Runbook. This is the step-by-step procedure your team follows during the actual deployment. Each step should be executable by any on-call engineer, not just the person who wrote the plan.
- Define the Rollback Criteria and procedure. Specify the metrics that trigger a rollback and how long to monitor before considering the deployment stable.
- Review the plan with your team. Use the architecture decision record template to document why blue-green was chosen over other strategies.
The Template
Deployment Overview
| Field | Details |
|---|---|
| Service Name | [Name of the service being deployed] |
| Deployment Date | [Planned deployment date and time, in UTC] |
| Deployment Owner | [Name and contact info of the deployment lead] |
| Release Version | [Version being deployed, e.g., v2.14.0] |
| Previous Version | [Currently running version, e.g., v2.13.2] |
| Estimated Cutover Time | [Time from start to traffic switch, e.g., 15 minutes] |
| Rollback Time | [Time to revert to previous version, e.g., 30 seconds] |
Environment Inventory
Blue Environment (Currently Active)
| Component | Details |
|---|---|
| Environment ID | [e.g., prod-blue-us-east-1] |
| Running Version | [e.g., v2.13.2] |
| Instance Type / Size | [e.g., 4x c6i.xlarge] |
| Load Balancer | [e.g., ALB arn:aws:...] |
| Database | [Shared or dedicated, connection string] |
| Last Health Check | [Date and result] |
Green Environment (Deployment Target)
| Component | Details |
|---|---|
| Environment ID | [e.g., prod-green-us-east-1] |
| Target Version | [e.g., v2.14.0] |
| Instance Type / Size | [Must match blue, e.g., 4x c6i.xlarge] |
| Load Balancer | [e.g., ALB arn:aws:...] |
| Database | [Shared or dedicated, connection string] |
| Provisioning Status | [Ready / In Progress / Not Started] |
Health Check Configuration
| Check | Endpoint | Expected Response | Timeout | Interval |
|---|---|---|---|---|
| HTTP readiness | [e.g., /healthz] | [200 OK, body: {"status":"ready"}] | [5s] | [10s] |
| Database connectivity | [e.g., /healthz/db] | [200 OK] | [3s] | [30s] |
| Dependency reachability | [e.g., /healthz/deps] | [200 OK, all dependencies up] | [10s] | [60s] |
| Smoke test suite | [e.g., POST /api/v1/test/smoke] | [All assertions pass] | [30s] | [On deploy] |
| Synthetic transaction | [e.g., end-to-end checkout flow] | [Order created successfully] | [60s] | [On deploy] |
Health check pass criteria for cutover:
- ☐ All HTTP health checks return 200 for at least [5 consecutive checks]
- ☐ Smoke test suite passes with 100% success rate
- ☐ Synthetic transaction completes within [acceptable latency threshold]
- ☐ No error log entries in the past [5 minutes]
- ☐ Memory and CPU usage within normal range
Database Migration Strategy
Migration approach: [Choose one]
- ☐ Shared database, backward-compatible migrations. Both environments connect to the same database. All schema changes are backward-compatible (additive columns, new tables). No destructive changes until the old environment is decommissioned.
- ☐ Expand-and-contract pattern. Step 1: Deploy schema expansion (new columns, dual-write logic). Step 2: Cutover traffic. Step 3: Contract (remove old columns, stop dual-write). Each step is a separate deployment.
- ☐ Separate databases with replication. Each environment has its own database. Changes are replicated during the transition period. Higher complexity but full isolation.
Migration details:
| Step | Description | Reversible? | Estimated Time |
|---|---|---|---|
| 1 | [e.g., Add new_column to orders table, nullable, default null] | [Yes] | [2 min] |
| 2 | [e.g., Deploy application code that writes to both old and new columns] | [Yes] | [5 min] |
| 3 | [e.g., Backfill new_column for existing rows] | [Yes] | [15 min] |
| 4 | [e.g., After stable cutover: drop old column, remove dual-write code] | [No] | [Next release] |
Traffic Routing Configuration
Routing mechanism: [Choose one]
- ☐ DNS-based (Route 53 weighted records, Cloudflare load balancing)
- ☐ Load balancer target group swap (ALB, NLB)
- ☐ Service mesh routing (Istio, Linkerd, Consul Connect)
- ☐ CDN origin switch (CloudFront, Fastly)
Routing details:
| Parameter | Value |
|---|---|
| Router / LB | [e.g., AWS ALB prod-api-alb] |
| Current target | [e.g., Target group: tg-blue-api] |
| New target | [e.g., Target group: tg-green-api] |
| TTL (if DNS) | [e.g., 60 seconds] |
| Draining timeout | [e.g., 30 seconds for in-flight requests] |
Pre-Cutover Checklist
- ☐ Green environment provisioned and matches blue environment spec
- ☐ New version deployed to green environment
- ☐ All health checks passing on green environment
- ☐ Smoke test suite passing on green environment
- ☐ Database migrations applied and verified backward-compatible
- ☐ Monitoring dashboards open and shared with deployment team
- ☐ Alerting thresholds reviewed and appropriate for deployment
- ☐ Rollback procedure documented and accessible
- ☐ Communication sent to stakeholders (if applicable)
- ☐ On-call engineer confirmed available during cutover window
Cutover Runbook
| Step | Action | Owner | Expected Duration | Verification |
|---|---|---|---|---|
| 1 | Announce deployment start in [#deployments channel] | Deployment lead | 1 min | Message confirmed |
| 2 | Run final health check on green environment | Deployment lead | 2 min | All checks pass |
| 3 | Switch traffic routing from blue to green | Deployment lead | [1 min for LB / 1-5 min for DNS] | Traffic visible on green dashboards |
| 4 | Monitor error rates for [10 minutes] | On-call engineer | 10 min | Error rate < [0.1%] |
| 5 | Monitor latency P50/P95/P99 for [10 minutes] | On-call engineer | 10 min | Within [baseline + 10%] |
| 6 | Run synthetic transaction against production | QA engineer | 5 min | Transaction succeeds |
| 7 | Confirm blue environment is receiving zero traffic | Deployment lead | 2 min | Request count = 0 |
| 8 | Announce deployment complete | Deployment lead | 1 min | Message confirmed |
Rollback Criteria
Trigger an immediate rollback if any of the following occur:
- ☐ Error rate exceeds [0.5%] for more than [2 minutes]
- ☐ P99 latency exceeds [baseline x 3] for more than [3 minutes]
- ☐ Health checks fail on green environment
- ☐ Critical alerts fire on any monitored metric
- ☐ Customer-facing functionality is broken (confirmed by synthetic tests)
Rollback procedure:
| Step | Action | Owner | Expected Duration |
|---|---|---|---|
| 1 | Switch traffic routing back to blue environment | Deployment lead | [30 seconds for LB / 1-5 min for DNS] |
| 2 | Verify traffic flowing to blue environment | On-call engineer | 2 min |
| 3 | Verify error rates returning to baseline | On-call engineer | 5 min |
| 4 | Announce rollback in [#deployments channel] | Deployment lead | 1 min |
| 5 | Create incident report (if customer impact occurred) | Deployment lead | Within 24 hours |
Post-Deployment Monitoring
| Metric | Baseline | Alert Threshold | Monitoring Duration |
|---|---|---|---|
| Error rate (5xx) | [e.g., 0.02%] | [e.g., > 0.1%] | [24 hours] |
| Latency P99 | [e.g., 180ms] | [e.g., > 500ms] | [24 hours] |
| CPU utilization | [e.g., 35%] | [e.g., > 80%] | [24 hours] |
| Memory utilization | [e.g., 60%] | [e.g., > 90%] | [24 hours] |
| Request throughput | [e.g., 2,400 rps] | [e.g., < 2,000 rps] | [24 hours] |
Stabilization period: [24-48 hours post-cutover before decommissioning old environment]
Old environment decommission date: [Date, typically 1-2 weeks after stable cutover]
Filled Example: E-Commerce Checkout Service
Deployment Overview
| Field | Details |
|---|---|
| Service Name | checkout-service |
| Deployment Date | 2026-03-12, 14:00 UTC (Tuesday, lowest traffic window) |
| Deployment Owner | Priya Sharma, Senior SRE |
| Release Version | v3.8.0 (new payment processor integration) |
| Previous Version | v3.7.4 |
| Estimated Cutover Time | 20 minutes |
| Rollback Time | 45 seconds (ALB target group swap) |
Why Blue-Green for This Release
The v3.8.0 release introduces a new payment processor (Adyen) alongside the existing Stripe integration. This is a high-risk change because payment processing failures directly affect revenue. The team chose blue-green over a canary release because the payment processor switch needs to be all-or-nothing per environment. Routing 10% of checkout traffic to a different payment processor would create reconciliation complexity that exceeds the risk benefit of a gradual rollout.
Database Migration (Expand-and-Contract)
| Step | Description | Reversible? | Timing |
|---|---|---|---|
| 1 | Add payment_processor column to orders table (nullable, default 'stripe') | Yes | Pre-deploy, 3 min |
| 2 | Add adyen_transaction_id column to payments table (nullable) | Yes | Pre-deploy, 2 min |
| 3 | Deploy v3.8.0 to green. Dual-write: populate both stripe_charge_id and adyen_transaction_id based on processor used | Yes | Deploy step |
| 4 | After 1 week stable: make payment_processor NOT NULL, remove dual-write fallback | No | Post-deploy release |
Cutover Results (Post-Deployment Notes)
| Metric | Baseline (Blue) | After Cutover (Green) | Status |
|---|---|---|---|
| Error rate | 0.018% | 0.022% | Within threshold |
| P99 latency | 165ms | 172ms | Within threshold |
| Payment success rate | 99.4% | 99.3% | Within threshold |
| Checkout conversion | 68.2% | 68.0% | Within threshold |
Outcome: Successful cutover. Blue environment kept running for 10 days as rollback target. Decommissioned on 2026-03-22 after payment reconciliation confirmed zero discrepancies.
Common Mistakes to Avoid
- Skipping the database migration strategy. The number one reason blue-green deployments fail is database schema incompatibility between the two environments. If the new version requires a schema change that breaks the old version, you lose your rollback path. Always use expand-and-contract migrations.
- Using DNS with long TTLs for the traffic switch. If your DNS records have a 300-second TTL, your "instant" cutover actually takes 5 minutes as caches expire worldwide. For fast rollback, use load balancer target group swaps (sub-second) instead of DNS changes. If DNS is your only option, lower the TTL to 60 seconds at least 24 hours before the deployment.
- Not matching environment specifications. If blue runs on 4 instances and green runs on 2, you will see performance degradation after cutover and mistake it for an application problem. The two environments must be identical in compute, memory, and network configuration.
- Forgetting to drain in-flight requests. When switching traffic, existing requests on the old environment need time to complete. Set a connection draining timeout (typically 30-60 seconds) so active requests finish before the old environment stops receiving health checks. For more on graceful shutdown patterns, the monitoring and alerting template covers health check design.
- Decommissioning the old environment too quickly. Keep the old environment running for at least 24-48 hours after cutover. Some issues only surface at scale during peak traffic periods that may not occur in your initial monitoring window.
Key Takeaways
- Blue-green deployments eliminate downtime by maintaining two identical production environments and switching traffic between them
- The database migration strategy is the most critical part of the plan. Use expand-and-contract patterns to maintain backward compatibility
- Use load balancer target group swaps instead of DNS changes for faster, more reliable cutover and rollback
- Define clear rollback criteria with specific metric thresholds before the deployment starts
- Keep the old environment running for at least 24-48 hours post-cutover to ensure a reliable rollback path
About This Template
Created by: Tim Adair
Last Updated: 3/5/2026
Version: 1.0.0
License: Free for personal and commercial use
