What This Template Is For
A canary release sends a new version of your product to a small percentage of users before rolling it out to everyone. If the canary group shows healthy metrics, you increase the percentage. If error rates spike or performance degrades, you roll back before most users are affected. This approach turns deployment from a binary "ship and hope" into a controlled, measurable process.
This template structures canary releases with traffic allocation stages, success and failure metrics, monitoring checkpoints, rollback procedures, and communication plans. It prevents the two most common canary failures: rolling out too fast (before enough data accumulates) and not defining rollback triggers upfront (so the team debates whether to roll back while users are affected).
Use this template for any change that carries risk: backend migrations, UI redesigns, pricing changes, algorithm updates, or infrastructure upgrades. It pairs naturally with the test strategy template for pre-deployment quality assurance. The Technical PM Handbook covers progressive delivery patterns from a PM perspective. For defining what to measure during the canary, the product metrics glossary provides standard metric definitions. Teams running feature flags can use this template to plan the rollout stages for each flag.
How to Use This Template
- Define the change being canaried. Be specific about which components, services, or features are in scope.
- Choose your canary population carefully. Random selection works for most cases. For geographic or segment-specific features, filter the canary group accordingly.
- Set traffic allocation stages. A typical progression is 1% > 5% > 25% > 50% > 100%. Each stage needs a minimum bake time (how long you wait before advancing).
- Define success metrics (what "healthy" looks like) and failure metrics (what triggers a rollback). These must be measurable, not subjective. "Looks fine" is not a success metric.
- Assign monitoring ownership. Someone must watch dashboards during each stage transition and be empowered to roll back without waiting for approval.
- Document the rollback procedure before the canary starts. When something goes wrong at 2am, you do not want to be figuring out how to revert.
The Template
Canary Release Overview
| Field | Details |
|---|---|
| Change Description | [What is being released] |
| Release Owner | [Name, on-call contact] |
| PM | [Name] |
| Engineering Lead | [Name] |
| Start Date | [Date and time, with timezone] |
| Target Full Rollout | [Date] |
| Feature Flag Name | [e.g., enable-new-checkout-v2] |
| Rollback Contact | [Name, phone, Slack channel] |
Pre-Canary Checklist
Complete before starting the canary rollout.
- ☐ Change passes all automated tests (unit, integration, E2E)
- ☐ QA signoff completed on staging environment
- ☐ Feature flag configured and tested (on/off toggle verified)
- ☐ Monitoring dashboards set up with baseline metrics
- ☐ Alerting rules configured for failure thresholds
- ☐ Rollback procedure documented and tested
- ☐ On-call engineer identified for the rollout window
- ☐ Communication sent to support team about the canary
- ☐ Database migrations (if any) are backward-compatible
- ☐ Rollback does not require data migration reversal
Traffic Allocation Stages
| Stage | Traffic % | User Count (est.) | Min Bake Time | Advance Criteria | Rollback Trigger |
|---|---|---|---|---|---|
| 0 | 0% (internal only) | [Team members] | [1 hour] | All functional tests pass, no errors in logs | Any functional failure |
| 1 | [1%] | [~X users] | [4 hours] | Error rate < [0.5%], latency P95 < [Xms] | Error rate > [2%] or latency P95 > [Xms] |
| 2 | [5%] | [~X users] | [12 hours] | Error rate < [0.5%], latency P95 < [Xms], no support tickets | Error rate > [1%] or latency P95 > [Xms] |
| 3 | [25%] | [~X users] | [24 hours] | All success metrics within acceptable range | Any success metric degrades > [X%] |
| 4 | [50%] | [~X users] | [24 hours] | All success metrics within acceptable range | Any success metric degrades > [X%] |
| 5 | [100%] | [All users] | [48 hours monitoring] | Full rollout stable, feature flag cleaned up | Emergency rollback procedure remains available |
Success Metrics
Define what "healthy" looks like. Metrics should be measurable and have clear thresholds.
| Metric | Baseline (current) | Acceptable Range | Measurement Source | Check Frequency |
|---|---|---|---|---|
| Error rate (5xx) | [e.g., 0.1%] | [< 0.5%] | [e.g., Datadog, CloudWatch] | [Every 15 min] |
| API latency (P95) | [e.g., 320ms] | [< 500ms] | [APM tool] | [Every 15 min] |
| API latency (P99) | [e.g., 800ms] | [< 1200ms] | [APM tool] | [Every 15 min] |
| Conversion rate | [e.g., 3.2%] | [> 2.8%] | [Analytics] | [Every 4 hours] |
| Client-side errors (JS) | [e.g., 12/hour] | [< 25/hour] | [Sentry, LogRocket] | [Every 30 min] |
| Core Web Vitals (LCP) | [e.g., 1.8s] | [< 2.5s] | [RUM, Vercel Analytics] | [Every 4 hours] |
| [Custom metric] | [Baseline] | [Range] | [Source] | [Frequency] |
Failure Metrics and Rollback Triggers
If any of these thresholds are breached, initiate rollback immediately.
| Trigger | Threshold | Response | Response Time |
|---|---|---|---|
| Error rate spike | > [2%] for > [5 minutes] | Immediate rollback | < 5 minutes |
| Latency degradation | P95 > [1000ms] for > [10 minutes] | Immediate rollback | < 5 minutes |
| Crash rate increase | > [0.5%] increase over baseline | Immediate rollback | < 5 minutes |
| Conversion drop | > [15%] drop vs. control group | Pause rollout, investigate | < 30 minutes |
| Support ticket spike | > [3x] normal rate for canary feature | Pause rollout, investigate | < 1 hour |
| Data integrity issue | Any data corruption or loss detected | Immediate rollback | < 5 minutes |
Rollback Procedure
| Step | Action | Owner | Time Estimate |
|---|---|---|---|
| 1 | Set feature flag to 0% (disable for all users) | [On-call engineer] | [< 1 minute] |
| 2 | Verify flag change propagated (check 3 sample requests) | [On-call engineer] | [< 2 minutes] |
| 3 | Monitor error rate returning to baseline | [On-call engineer] | [5-10 minutes] |
| 4 | If flag rollback insufficient, revert deployment | [On-call engineer] | [5-15 minutes] |
| 5 | Post incident Slack message in #engineering | [Release owner] | [< 30 minutes] |
| 6 | Create incident ticket with root cause analysis | [Release owner] | [Within 24 hours] |
| 7 | Schedule post-mortem (if user impact occurred) | [Engineering lead] | [Within 48 hours] |
Monitoring Dashboard Checklist
Ensure these panels are visible on your monitoring dashboard before starting the canary.
- ☐ Error rate time series (canary vs. control, side by side)
- ☐ Latency percentiles (P50, P95, P99) for canary vs. control
- ☐ Request throughput (canary traffic volume)
- ☐ Feature flag activation count (confirms correct traffic split)
- ☐ Client-side error count (canary vs. control)
- ☐ Business metrics (conversion, revenue, engagement) for canary vs. control
- ☐ Infrastructure metrics (CPU, memory, queue depth) for canary hosts
- ☐ Deployment status (current version on canary vs. control)
Communication Plan
| When | Who | Channel | Message |
|---|---|---|---|
| Before canary starts | Support team | [Slack #support] | [Brief description of change, what customers might notice, escalation path] |
| Each stage advance | Engineering team | [Slack #releases] | [Stage update: "Canary at X%, metrics healthy, advancing to Y%"] |
| On rollback | Engineering + Support + PM | [Slack #incidents] | [Rollback notification: what happened, user impact, next steps] |
| Full rollout complete | All stakeholders | [Slack #releases + email] | [Rollout complete, metrics summary, feature flag cleanup timeline] |
Post-Rollout Cleanup
- ☐ Feature flag removed from codebase (not just set to 100%)
- ☐ Old code path deleted
- ☐ Monitoring thresholds adjusted for new baseline
- ☐ Release notes published
- ☐ Canary metrics summary shared with team
- ☐ Lessons learned documented (what worked, what to improve)
Filled Example: New Pricing Page Rollout
Overview
| Field | Details |
|---|---|
| Change Description | Redesigned pricing page with new tier names, updated feature comparison table, annual discount badge |
| Release Owner | Marcus Chen, Senior Engineer |
| PM | Jordan Lee |
| Start Date | March 10, 2026 at 10:00 AM PST |
| Target Full Rollout | March 17, 2026 |
| Feature Flag | pricing-page-v2 |
| Rollback Contact | Marcus Chen, +1-555-0199, #pricing-rollout Slack |
Traffic Stages (Filled)
| Stage | Traffic | Users | Bake Time | Advance | Rollback |
|---|---|---|---|---|---|
| 0 | Internal only | 15 team members | 2 hours | Manual QA pass | Any visual or functional bug |
| 1 | 2% | ~400 visitors/day | 6 hours | Error rate < 0.3%, no layout bugs | Error rate > 1% |
| 2 | 10% | ~2,000 visitors/day | 24 hours | Signup rate > 2.5% (baseline 2.8%) | Signup rate < 2.0% |
| 3 | 50% | ~10,000 visitors/day | 48 hours | Signup rate within 10% of baseline | Signup rate drops > 15% |
| 4 | 100% | All visitors | 72 hours monitoring | Stable metrics, positive qualitative feedback | Emergency: any metric breach |
Key Findings
Stage 1: Clean. No errors, layout rendered correctly across browsers.
Stage 2: Signup rate at 2.9% (above 2.8% baseline). Annual plan selection rate increased from 34% to 41%. Page load time increased by 120ms due to new comparison table. Acceptable.
Stage 3: Signup rate at 3.1%. Support received 2 tickets asking about removed "Starter" tier name. Updated FAQ section on pricing page to address tier renaming. Advanced to 100%.
Key Takeaways
- Start canaries at the smallest practical traffic percentage and increase gradually
- Define measurable success and failure metrics before the rollout begins
- Document the rollback procedure and test it before starting the canary
- Allow enough bake time at each stage to collect statistically meaningful data
- Clean up feature flags after full rollout. Stale flags accumulate as technical debt.
About This Template
Created by: Tim Adair
Last Updated: 3/5/2026
Version: 1.0.0
License: Free for personal and commercial use
