What This Template Is For
Containers package an application and its dependencies into a single, portable unit that runs consistently across development, staging, and production environments. The promise is simple: if it runs in a container on your laptop, it runs the same way in production. The reality requires planning. Teams that adopt containers without a strategy end up with inconsistent base images, no vulnerability scanning, sprawling container registries, and production incidents caused by misconfigured networking or resource limits.
This template helps product and engineering teams plan a containerization strategy that covers the full lifecycle: which runtime to use, how to build and manage images, how to handle secrets and configuration, how to scan for vulnerabilities, and how to prioritize which services to containerize first. It is designed for teams that are either adopting containers for the first time or formalizing an ad-hoc container setup into a production-grade platform.
For teams that have already containerized and are evaluating orchestration, the Kubernetes adoption template picks up where this template leaves off. For the broader context of infrastructure decisions, the Technical PM Handbook covers platform engineering strategy. To document individual infrastructure decisions, use the architecture decision record template.
How to Use This Template
- Start with the Current State Assessment. Before deciding where to go, document where you are. List all services, their current deployment method, and their containerization readiness.
- Complete the Container Runtime and Tooling section. Choose your runtime, build tool, and registry. These decisions cascade through everything else.
- Define your Image Management Standards. This is where most container strategies fail. Without consistent base images, tagging conventions, and vulnerability scanning, your container environment becomes harder to manage than the VMs it replaced.
- Plan the Security and Compliance section. Container security is not optional. Define scanning, secrets management, and runtime security policies before your first production container.
- Fill in the Migration Prioritization matrix. Not every service should be containerized at once. Prioritize based on value, complexity, and risk.
- Write the Networking and Storage strategy. Container networking is different from VM networking. Document your approach before services start communicating.
The Template
Strategy Overview
| Field | Details |
|---|---|
| Organization | [Team or company name] |
| Strategy Owner | [Name, title] |
| Target Timeline | [e.g., Q2-Q4 2026] |
| Current State | [e.g., 80% VM-based, 20% already containerized] |
| Goal | [e.g., 90% of production services containerized by Q4 2026] |
| Orchestration Target | [e.g., Kubernetes on EKS / Docker Compose / Nomad / ECS] |
Current State Assessment
Service Inventory
| Service | Language/Runtime | Current Deployment | Database Dependencies | Traffic (rps) | Container Ready? |
|---|---|---|---|---|---|
| [Service 1] | [e.g., Node.js 20] | [e.g., EC2 + PM2] | [e.g., PostgreSQL, Redis] | [e.g., 500] | [Yes / Partial / No] |
| [Service 2] | [e.g., Python 3.12] | [e.g., Lambda] | [e.g., DynamoDB] | [e.g., 200] | [Yes / Partial / No] |
| [Service 3] | [e.g., Java 21] | [e.g., ECS Fargate] | [e.g., MySQL, Elasticsearch] | [e.g., 1,200] | [Yes / Partial / No] |
| [Service 4] | [e.g., Go 1.22] | [e.g., bare metal] | [e.g., PostgreSQL] | [e.g., 3,000] | [Yes / Partial / No] |
Containerization Readiness Criteria
A service is "container ready" when it meets all of the following:
- ☐ Stateless or has externalized state (no local file system dependencies)
- ☐ Configuration via environment variables or mounted config files
- ☐ Logs written to stdout/stderr (not local files)
- ☐ Health check endpoint available (HTTP or TCP)
- ☐ Graceful shutdown handling (SIGTERM)
- ☐ No hard-coded hostnames or IP addresses
- ☐ Build process is automated and reproducible
Container Runtime and Tooling
Runtime Selection
| Option | Pros | Cons | Team Decision |
|---|---|---|---|
| Docker (containerd) | Industry standard, mature tooling, largest community | Daemon-based, larger attack surface | [Selected / Rejected] |
| Podman | Daemonless, rootless by default, Docker CLI compatible | Smaller community, some Docker Compose gaps | [Selected / Rejected] |
| containerd (standalone) | Lightweight, Kubernetes native, no Docker overhead | Less developer-friendly CLI | [Selected / Rejected] |
Selected runtime: [e.g., Docker with containerd backend]
Rationale: [2-3 sentences explaining the choice]
Build Tooling
| Tool | Use Case | Decision |
|---|---|---|
| Docker Build (BuildKit) | Standard image builds, multi-stage, layer caching | [Use / Skip] |
| Buildpacks (Paketo/Heroku) | Auto-detect language, generate images without Dockerfiles | [Use / Skip] |
| Kaniko | In-cluster builds (no Docker daemon needed in CI) | [Use / Skip] |
| Bazel / Buck | Monorepo builds with granular caching | [Use / Skip] |
Selected build tool: [e.g., Docker Build with BuildKit, Kaniko for CI pipelines]
Container Registry
| Option | Features | Cost Model | Decision |
|---|---|---|---|
| Amazon ECR | AWS-native, IAM integration, vulnerability scanning | Per-GB storage + transfer | [Selected / Rejected] |
| Google Artifact Registry | GCP-native, multi-format, vulnerability scanning | Per-GB storage + transfer | [Selected / Rejected] |
| Docker Hub | Universal compatibility, public images | Free tier + per-seat pricing | [Selected / Rejected] |
| GitHub Container Registry | GitHub Actions integration, GHCR packages | Free for public, included in Enterprise | [Selected / Rejected] |
| Self-hosted (Harbor) | Full control, policy engine, replication | Infrastructure + maintenance cost | [Selected / Rejected] |
Selected registry: [e.g., Amazon ECR for production, Docker Hub for public base images]
Image Management Standards
Base Image Policy
| Language/Runtime | Approved Base Image | Tag | Update Frequency |
|---|---|---|---|
| Node.js | [e.g., node:20-slim] | [Pinned digest or version] | [Monthly] |
| Python | [e.g., python:3.12-slim-bookworm] | [Pinned digest] | [Monthly] |
| Java | [e.g., eclipse-temurin:21-jre-jammy] | [Pinned digest] | [Monthly] |
| Go | [e.g., gcr.io/distroless/static-debian12] | [Pinned digest] | [Quarterly] |
| Generic | [e.g., ubuntu:24.04] | [Pinned digest] | [Monthly] |
Base image rules:
- ☐ All production images use approved base images from this table
- ☐ Base images are pinned to a specific digest (not just a tag like
latest) - ☐ Base image updates are tested in staging before rolling to production
- ☐ Custom base images (if any) are rebuilt and scanned weekly
Tagging Convention
| Tag Pattern | Example | Purpose |
|---|---|---|
| Git SHA | checkout-svc:a1b2c3d | Immutable, traceable to exact commit |
| Semantic version | checkout-svc:3.8.0 | Human-readable release version |
| Environment | checkout-svc:staging | Mutable, points to current env deploy |
| Latest | checkout-svc:latest | Development only, never in production |
Tagging rules:
- ☐ Production images tagged with both git SHA and semantic version
- ☐
latesttag is prohibited in production manifests - ☐ Tags are immutable once pushed (no overwriting existing tags)
Image Size Targets
| Service Type | Target Size | Technique |
|---|---|---|
| Go services | < 20 MB | Multi-stage build, distroless base |
| Node.js services | < 150 MB | Multi-stage build, slim base, .dockerignore |
| Python services | < 200 MB | Multi-stage build, slim base, no dev deps |
| Java services | < 250 MB | Multi-stage build, JRE-only base, jlink |
Security and Compliance
Vulnerability Scanning
| Stage | Tool | Severity Threshold | Action on Failure |
|---|---|---|---|
| Build (CI) | [e.g., Trivy, Snyk Container] | [e.g., Block on Critical/High] | [Block merge / Warning] |
| Registry | [e.g., ECR scanning, Harbor] | [e.g., Alert on Medium+] | [Alert security team] |
| Runtime | [e.g., Falco, Aqua] | [e.g., Terminate on exploit attempt] | [Kill pod, alert, audit log] |
Secrets Management
| Secret Type | Storage | Injection Method |
|---|---|---|
| Database credentials | [e.g., AWS Secrets Manager] | [e.g., Sidecar, init container, env var from secret store] |
| API keys | [e.g., HashiCorp Vault] | [e.g., Vault Agent injector] |
| TLS certificates | [e.g., cert-manager, ACM] | [e.g., Mounted volume] |
| Environment config | [e.g., ConfigMaps, Parameter Store] | [e.g., Environment variables] |
Security rules:
- ☐ No secrets baked into images (no hardcoded credentials in Dockerfiles)
- ☐ All images run as non-root user
- ☐ Read-only root filesystem where possible
- ☐ Resource limits (CPU, memory) set on all containers
- ☐ Network policies restrict inter-container communication to explicit allow-lists
Networking Strategy
| Concern | Approach | Details |
|---|---|---|
| Service discovery | [e.g., Kubernetes DNS, Consul, AWS Cloud Map] | [How services find each other] |
| Load balancing | [e.g., Kubernetes Service, ALB Ingress, Envoy] | [L4/L7, internal/external] |
| Ingress / API Gateway | [e.g., NGINX Ingress, AWS API Gateway, Traefik] | [External traffic routing] |
| Service mesh | [e.g., Istio, Linkerd, None] | [mTLS, traffic management, observability] |
| Network policies | [e.g., Calico, Cilium NetworkPolicy] | [Inter-pod/container traffic rules] |
For teams managing API infrastructure alongside containerization, the API gateway template covers API routing patterns in detail.
Migration Prioritization
| Service | Business Value | Complexity | Risk | Dependencies | Priority | Target Sprint |
|---|---|---|---|---|---|---|
| [Service 1] | [High/Med/Low] | [High/Med/Low] | [High/Med/Low] | [List blockers] | [P0/P1/P2] | [Sprint N] |
| [Service 2] | [High/Med/Low] | [High/Med/Low] | [High/Med/Low] | [List blockers] | [P0/P1/P2] | [Sprint N] |
| [Service 3] | [High/Med/Low] | [High/Med/Low] | [High/Med/Low] | [List blockers] | [P0/P1/P2] | [Sprint N] |
Prioritization criteria:
- High business value: Frequent deployments, scaling bottlenecks, or developer productivity impact
- Low complexity: Stateless, 12-factor compliant, minimal dependencies
- Low risk: Non-critical path, easy to test, rollback is straightforward
Recommended migration order:
- Start with stateless, low-risk services to build team confidence and tooling
- Move to high-value services where containerization enables deployment frequency or scaling
- Tackle stateful services last (databases, message queues, file storage)
Resource Limits and Requests
| Service | CPU Request | CPU Limit | Memory Request | Memory Limit | Replicas |
|---|---|---|---|---|---|
| [Service 1] | [e.g., 250m] | [e.g., 500m] | [e.g., 256Mi] | [e.g., 512Mi] | [e.g., 3] |
| [Service 2] | [e.g., 500m] | [e.g., 1000m] | [e.g., 512Mi] | [e.g., 1Gi] | [e.g., 2] |
| [Service 3] | [e.g., 100m] | [e.g., 250m] | [e.g., 128Mi] | [e.g., 256Mi] | [e.g., 5] |
Resource rules:
- ☐ All containers have both requests and limits set
- ☐ Memory limits are at most 2x requests (prevents OOM thrashing)
- ☐ CPU limits are at most 2x requests (prevents CPU throttling surprises)
- ☐ Horizontal Pod Autoscaler configured for services with variable load
Filled Example: B2B SaaS Platform (TaskFlow)
Strategy Overview
| Field | Details |
|---|---|
| Organization | TaskFlow Engineering |
| Strategy Owner | David Park, Staff Platform Engineer |
| Target Timeline | Q2-Q3 2026 |
| Current State | 12 services: 4 containerized (ECS), 6 on EC2 (PM2/systemd), 2 on Lambda |
| Goal | 10 of 12 services containerized on EKS by end of Q3 (keep 2 Lambda functions) |
| Orchestration Target | Amazon EKS (Kubernetes 1.29) |
Migration Wave Plan
Wave 1 (Q2, Sprint 1-3): notification-service, webhook-processor, report-generator. All stateless, low traffic, minimal dependencies. Purpose: validate tooling, CI/CD pipeline, monitoring.
Wave 2 (Q2, Sprint 4-6): api-gateway, auth-service, search-service. Core services with higher traffic. Purpose: validate scaling, networking, load balancing.
Wave 3 (Q3, Sprint 7-9): billing-service, analytics-pipeline, admin-api, file-processor. Complex services with database dependencies and stateful processing.
Not migrating: Two Lambda functions (thumbnail-generator, email-sender) remain on Lambda. They are event-driven, bursty workloads where serverless is the better fit. See the capacity planning template for the analysis that informed this decision.
Image Standards Applied
All services use multi-stage Dockerfiles with the pattern:
- Stage 1: Full SDK image for building
- Stage 2: Slim runtime image with only the compiled artifact
- All images scan clean on Trivy (zero Critical, zero High)
- Average image size reduced from 800MB (EC2 AMI) to 120MB (container)
Common Mistakes to Avoid
- Containerizing without changing the deployment pipeline. Putting your app in a Docker image but still deploying it by SSHing into a server and running
docker pullmisses the point. Containers enable immutable, automated deployments. Invest in CI/CD automation alongside containerization. - Ignoring image size. A 2GB Docker image takes 30 seconds to pull, which means 30 seconds of cold start time. Use multi-stage builds, slim base images, and
.dockerignorefiles to keep images small. Every megabyte matters for scaling speed. - Running containers as root. The default Docker behavior runs processes as root inside the container. If a container is compromised, the attacker has root access. Always add
USER nonrootto your Dockerfiles and test that the application works without root privileges. - Storing state inside containers. Containers are ephemeral. Any data written to the container filesystem is lost when the container restarts. Externalize all state to databases, object storage, or mounted volumes.
- Skipping resource limits. A container without memory limits can consume all available memory on the host and crash other containers. A container without CPU limits can starve neighbors. Always set both requests and limits.
Key Takeaways
- Assess containerization readiness before migrating. Services need externalized state, environment-based configuration, and health check endpoints
- Standardize base images, tagging conventions, and vulnerability scanning before your first production container
- Migrate in waves, starting with stateless low-risk services. Build tooling confidence before tackling critical-path services
- Set resource limits on every container. Unbounded containers cause cascading failures
- Not every workload belongs in a container. Evaluate serverless and VM alternatives for each service independently
About This Template
Created by: Tim Adair
Last Updated: 3/5/2026
Version: 1.0.0
License: Free for personal and commercial use
