What This Template Is For
Chaos engineering is the practice of intentionally injecting failures into a system to verify that it handles them gracefully. The premise is straightforward: if you do not know how your system behaves when a database fails over, a network partition occurs, or a third-party API goes down, you will find out during a real incident when the stakes are highest and the pressure is worst.
Most teams discover their resilience gaps during production incidents. Chaos engineering reverses the order. You choose the failure, control the blast radius, and observe the result in a safe environment first. This template structures each chaos experiment as a testable hypothesis with defined scope, safety controls, and success criteria. It prevents the common failure mode of chaos engineering: running experiments without a plan and causing the very outages you were trying to prevent.
Use this alongside the Service Reliability Template to connect chaos findings back to your SLO targets. The Disaster Recovery Template covers the broader recovery procedures that chaos experiments validate. The Incident Response Template provides the runbook format for handling actual failures once you have identified gaps. The Technical PM Handbook explains how product managers should evaluate the risk-vs-reliability trade-offs that chaos experiments inform.
When to Use This Template
- You are preparing for a reliability milestone (moving from 99.9% to 99.95% uptime)
- A production incident revealed a failure mode that was not handled gracefully
- You are deploying a new service and want to verify its failure behavior before production traffic
- You are onboarding a new team and want to build confidence in the system's resilience
- A dependency (database, message queue, third-party API) has changed and you want to re-validate failover behavior
- Compliance or audit requirements mandate documented resilience testing
How to Use This Template
- Start with the Hypothesis. State what you expect to happen when a specific failure is introduced.
- Define the blast radius. Limit the experiment to a specific service, traffic percentage, or environment.
- Set abort conditions. Define the signals that require immediate experiment termination.
- Run the experiment. Inject the failure, observe the system, and record the results.
- Document findings. Compare actual behavior to the hypothesis. Log any surprises.
- Create follow-up actions. If the system did not behave as expected, file tickets to fix the gaps.
The Template
Experiment Overview
| Field | Details |
|---|---|
| Experiment Name | [Descriptive name, e.g., "Database primary failover under load"] |
| Author | [Engineer name] |
| Reviewer | [SRE lead or senior engineer] |
| Date | [Planned execution date] |
| Status | Planned / Approved / Running / Completed / Aborted |
| Environment | Staging / Production (canary) / Production (full) |
| Related Incident | [Link to incident that motivated this experiment, or N/A] |
Hypothesis
Steady state. [Describe normal system behavior with specific metrics. Example: "The checkout API handles 400 rps with p95 latency under 200ms and error rate under 0.1%."]
Injection. [Describe the failure you will introduce. Example: "Kill the primary database instance, forcing automatic failover to the replica."]
Expected behavior. [What should happen? Example: "The application detects the failover within 5 seconds. Queries fail during the failover window (estimated 3-10 seconds). After failover, the system returns to steady state within 30 seconds. No data loss. Error rate spike is transient and does not trigger customer-facing error pages."]
Null hypothesis. [What would disprove your hypothesis? Example: "The system does not recover within 60 seconds, or data is lost, or the error rate remains elevated after failover completes."]
Blast Radius
| Dimension | Scope |
|---|---|
| Services affected | [List specific services] |
| Traffic percentage | [X% of production traffic, or staging only] |
| Users affected | [Internal only / X% of users / Specific region] |
| Duration | [X minutes maximum] |
| Time window | [Day and time, e.g., Tuesday 2pm-3pm ET (low traffic)] |
Abort Conditions
If any of these conditions are met, the experiment must be terminated immediately:
- ☐ Error rate exceeds [X%] for more than [X seconds]
- ☐ p95 latency exceeds [X ms] for more than [X seconds]
- ☐ Any downstream service reports a cascading failure
- ☐ Data integrity check fails (missing writes, duplicate records)
- ☐ On-call receives a page from a monitor not related to the experiment
- ☐ [Custom condition specific to this experiment]
Abort procedure. [How to stop the experiment. Example: "Run chaos-monkey stop experiment-123 or kill the fault injection process. If automated abort fails, manually restart the affected service."]
Prerequisites Checklist
- ☐ Experiment reviewed and approved by [SRE lead / engineering manager]
- ☐ On-call engineer notified and aware of the experiment window
- ☐ Monitoring dashboards open (list specific dashboards)
- ☐ Abort procedure tested and verified
- ☐ Rollback procedure tested and verified
- ☐ Communication sent to affected teams (if production)
- ☐ Recent backup verified (if experiment involves data services)
- ☐ Steady-state baseline measured within the last 24 hours
Experiment Execution
Step 1: Baseline measurement.
| Metric | Baseline Value | Measured At |
|---|---|---|
| [Request rate] | [X rps] | [Timestamp] |
| [p95 latency] | [X ms] | [Timestamp] |
| [Error rate] | [X%] | [Timestamp] |
| [Custom metric] | [X] | [Timestamp] |
Step 2: Inject failure.
| Time | Action | Observer |
|---|---|---|
| [T+0] | [Inject failure: describe exact command or action] | [Name] |
| [T+Xs] | [Observe: what metrics to watch] | [Name] |
| [T+Xs] | [Verify: check specific behavior] | [Name] |
| [T+Xs] | [End: stop injection or allow recovery] | [Name] |
Step 3: Recovery measurement.
| Metric | During Injection | After Recovery | Recovery Time |
|---|---|---|---|
| [Request rate] | [X rps] | [X rps] | [X seconds] |
| [p95 latency] | [X ms] | [X ms] | [X seconds] |
| [Error rate] | [X%] | [X%] | [X seconds] |
Results
Hypothesis confirmed? [Yes / Partially / No]
Summary. [2-3 sentences describing what happened.]
Surprises. [List anything that did not match the hypothesis.]
Artifacts. [Links to dashboards, logs, screenshots captured during the experiment.]
Follow-Up Actions
| Finding | Severity | Action | Owner | Ticket | Due Date |
|---|---|---|---|---|---|
| [What was discovered] | [Critical / High / Medium / Low] | [What needs to change] | [Name] | [Link] | [Date] |
Filled Example: Database Primary Failover Under Load
Experiment Overview
| Field | Details |
|---|---|
| Experiment Name | Database primary failover under sustained checkout traffic |
| Author | Sarah Chen |
| Reviewer | Marcus Rivera (SRE Lead) |
| Date | 2026-03-12 |
| Status | Completed |
| Environment | Staging (production-equivalent traffic replay) |
| Related Incident | INC-2024-089: 4-minute checkout outage during unplanned DB failover |
Hypothesis
Steady state. The checkout API handles 400 rps with p95 latency under 200ms and error rate under 0.1%. Background jobs (order processing, notifications) complete within 30 seconds of submission.
Injection. Kill the Postgres primary instance (RDS), forcing automatic failover to the read replica promoted to primary. DNS endpoint update propagation takes 15-30 seconds per AWS documentation.
Expected behavior. Connection pool detects stale connections within 5 seconds. Retry logic handles transient failures during the 15-30 second DNS propagation window. Error rate spikes to 5-10% during failover, then returns to baseline within 60 seconds. No committed transactions are lost. Background jobs retry automatically and complete within 2 minutes of failover.
Null hypothesis. The system does not recover within 120 seconds, committed transactions are lost, or the connection pool does not recover without a manual restart.
Blast Radius
| Dimension | Scope |
|---|---|
| Services affected | Checkout API, Order Processor, Notification Service |
| Traffic | Staging (replayed production traffic at 80% volume) |
| Users affected | None (staging) |
| Duration | 10 minutes maximum |
| Time window | Wednesday 3pm-4pm ET |
Results
Hypothesis confirmed? Partially.
Summary. Failover completed in 22 seconds. The checkout API recovered, but the connection pool in the Order Processor service did not detect stale connections. It continued attempting queries on dead connections for 4 minutes until the connection pool's max lifetime timer triggered a refresh. During those 4 minutes, all background order processing was stalled.
Surprises.
- Order Processor connection pool (HikariCP) had
maxLifetimeset to 30 minutes. Stale connections were not detected until the pool cycled them out. - The Notification Service recovered in 8 seconds because it uses a different connection library with built-in health checks.
- One committed transaction was replayed (duplicate order confirmation email) because the retry logic did not check idempotency keys.
Follow-Up Actions
| Finding | Severity | Action | Owner | Ticket | Due Date |
|---|---|---|---|---|---|
| Order Processor stale connections | Critical | Add connection validation query (SELECT 1) on checkout from pool | James Park | ENG-4521 | 2026-03-19 |
| Duplicate notification on retry | High | Add idempotency key check to notification dispatch | Lisa Wong | ENG-4522 | 2026-03-19 |
| No failover runbook for Order Processor | Medium | Document manual connection pool reset procedure | Sarah Chen | OPS-891 | 2026-03-26 |
Key Takeaways
- State a falsifiable hypothesis before every experiment. "Let's see what happens" is not chaos engineering. "The system recovers within 60 seconds with no data loss" is a testable hypothesis.
- Start in staging, graduate to production. Run every new experiment type in staging first. Only move to production after the system passes in staging and you have verified abort procedures.
- Define abort conditions before you start. When the experiment is running and metrics are spiking, you need pre-agreed thresholds for stopping. Deciding in the moment leads to either premature aborts or prolonged outages.
- The value is in the surprises. If every chaos experiment confirms the hypothesis perfectly, either your system is unusually resilient or your experiments are not aggressive enough. The most useful experiments are the ones that reveal unexpected failure modes.
- Create follow-up tickets immediately. A chaos experiment that discovers a gap but does not result in a fix is wasted effort. Log findings as tickets with owners and due dates before closing the experiment.
