Skip to main content
New: Deck Doctor. Upload your deck, get CPO-level feedback. 7-day free trial.
TemplateFREE⏱️ 25 minutes

Chaos Engineering Experiment Template

Plan chaos engineering experiments with hypothesis, blast radius, abort conditions, and rollback procedures.

Last updated 2026-03-05
Chaos Engineering Experiment Template preview

Chaos Engineering Experiment Template

Free Chaos Engineering Experiment Template — open and start using immediately

or use email

Instant access. No spam.

Get Template Pro — all templates, no gates, premium files

888+ templates without email gates, plus 30 premium Excel spreadsheets with formulas and professional slide decks. One payment, lifetime access.

Need a custom version?

Forge AI generates PM documents customized to your product, team, and goals. Get a draft in seconds, then refine with AI chat.

Generate with Forge AI

What This Template Is For

Chaos engineering is the practice of intentionally injecting failures into a system to verify that it handles them gracefully. The premise is straightforward: if you do not know how your system behaves when a database fails over, a network partition occurs, or a third-party API goes down, you will find out during a real incident when the stakes are highest and the pressure is worst.

Most teams discover their resilience gaps during production incidents. Chaos engineering reverses the order. You choose the failure, control the blast radius, and observe the result in a safe environment first. This template structures each chaos experiment as a testable hypothesis with defined scope, safety controls, and success criteria. It prevents the common failure mode of chaos engineering: running experiments without a plan and causing the very outages you were trying to prevent.

Use this alongside the Service Reliability Template to connect chaos findings back to your SLO targets. The Disaster Recovery Template covers the broader recovery procedures that chaos experiments validate. The Incident Response Template provides the runbook format for handling actual failures once you have identified gaps. The Technical PM Handbook explains how product managers should evaluate the risk-vs-reliability trade-offs that chaos experiments inform.


When to Use This Template

  • You are preparing for a reliability milestone (moving from 99.9% to 99.95% uptime)
  • A production incident revealed a failure mode that was not handled gracefully
  • You are deploying a new service and want to verify its failure behavior before production traffic
  • You are onboarding a new team and want to build confidence in the system's resilience
  • A dependency (database, message queue, third-party API) has changed and you want to re-validate failover behavior
  • Compliance or audit requirements mandate documented resilience testing

How to Use This Template

  1. Start with the Hypothesis. State what you expect to happen when a specific failure is introduced.
  2. Define the blast radius. Limit the experiment to a specific service, traffic percentage, or environment.
  3. Set abort conditions. Define the signals that require immediate experiment termination.
  4. Run the experiment. Inject the failure, observe the system, and record the results.
  5. Document findings. Compare actual behavior to the hypothesis. Log any surprises.
  6. Create follow-up actions. If the system did not behave as expected, file tickets to fix the gaps.

The Template

Experiment Overview

FieldDetails
Experiment Name[Descriptive name, e.g., "Database primary failover under load"]
Author[Engineer name]
Reviewer[SRE lead or senior engineer]
Date[Planned execution date]
StatusPlanned / Approved / Running / Completed / Aborted
EnvironmentStaging / Production (canary) / Production (full)
Related Incident[Link to incident that motivated this experiment, or N/A]

Hypothesis

Steady state. [Describe normal system behavior with specific metrics. Example: "The checkout API handles 400 rps with p95 latency under 200ms and error rate under 0.1%."]

Injection. [Describe the failure you will introduce. Example: "Kill the primary database instance, forcing automatic failover to the replica."]

Expected behavior. [What should happen? Example: "The application detects the failover within 5 seconds. Queries fail during the failover window (estimated 3-10 seconds). After failover, the system returns to steady state within 30 seconds. No data loss. Error rate spike is transient and does not trigger customer-facing error pages."]

Null hypothesis. [What would disprove your hypothesis? Example: "The system does not recover within 60 seconds, or data is lost, or the error rate remains elevated after failover completes."]

Blast Radius

DimensionScope
Services affected[List specific services]
Traffic percentage[X% of production traffic, or staging only]
Users affected[Internal only / X% of users / Specific region]
Duration[X minutes maximum]
Time window[Day and time, e.g., Tuesday 2pm-3pm ET (low traffic)]

Abort Conditions

If any of these conditions are met, the experiment must be terminated immediately:

  • Error rate exceeds [X%] for more than [X seconds]
  • p95 latency exceeds [X ms] for more than [X seconds]
  • Any downstream service reports a cascading failure
  • Data integrity check fails (missing writes, duplicate records)
  • On-call receives a page from a monitor not related to the experiment
  • [Custom condition specific to this experiment]

Abort procedure. [How to stop the experiment. Example: "Run chaos-monkey stop experiment-123 or kill the fault injection process. If automated abort fails, manually restart the affected service."]

Prerequisites Checklist

  • Experiment reviewed and approved by [SRE lead / engineering manager]
  • On-call engineer notified and aware of the experiment window
  • Monitoring dashboards open (list specific dashboards)
  • Abort procedure tested and verified
  • Rollback procedure tested and verified
  • Communication sent to affected teams (if production)
  • Recent backup verified (if experiment involves data services)
  • Steady-state baseline measured within the last 24 hours

Experiment Execution

Step 1: Baseline measurement.

MetricBaseline ValueMeasured At
[Request rate][X rps][Timestamp]
[p95 latency][X ms][Timestamp]
[Error rate][X%][Timestamp]
[Custom metric][X][Timestamp]

Step 2: Inject failure.

TimeActionObserver
[T+0][Inject failure: describe exact command or action][Name]
[T+Xs][Observe: what metrics to watch][Name]
[T+Xs][Verify: check specific behavior][Name]
[T+Xs][End: stop injection or allow recovery][Name]

Step 3: Recovery measurement.

MetricDuring InjectionAfter RecoveryRecovery Time
[Request rate][X rps][X rps][X seconds]
[p95 latency][X ms][X ms][X seconds]
[Error rate][X%][X%][X seconds]

Results

Hypothesis confirmed? [Yes / Partially / No]

Summary. [2-3 sentences describing what happened.]

Surprises. [List anything that did not match the hypothesis.]

Artifacts. [Links to dashboards, logs, screenshots captured during the experiment.]

Follow-Up Actions

FindingSeverityActionOwnerTicketDue Date
[What was discovered][Critical / High / Medium / Low][What needs to change][Name][Link][Date]

Filled Example: Database Primary Failover Under Load

Experiment Overview

FieldDetails
Experiment NameDatabase primary failover under sustained checkout traffic
AuthorSarah Chen
ReviewerMarcus Rivera (SRE Lead)
Date2026-03-12
StatusCompleted
EnvironmentStaging (production-equivalent traffic replay)
Related IncidentINC-2024-089: 4-minute checkout outage during unplanned DB failover

Hypothesis

Steady state. The checkout API handles 400 rps with p95 latency under 200ms and error rate under 0.1%. Background jobs (order processing, notifications) complete within 30 seconds of submission.

Injection. Kill the Postgres primary instance (RDS), forcing automatic failover to the read replica promoted to primary. DNS endpoint update propagation takes 15-30 seconds per AWS documentation.

Expected behavior. Connection pool detects stale connections within 5 seconds. Retry logic handles transient failures during the 15-30 second DNS propagation window. Error rate spikes to 5-10% during failover, then returns to baseline within 60 seconds. No committed transactions are lost. Background jobs retry automatically and complete within 2 minutes of failover.

Null hypothesis. The system does not recover within 120 seconds, committed transactions are lost, or the connection pool does not recover without a manual restart.

Blast Radius

DimensionScope
Services affectedCheckout API, Order Processor, Notification Service
TrafficStaging (replayed production traffic at 80% volume)
Users affectedNone (staging)
Duration10 minutes maximum
Time windowWednesday 3pm-4pm ET

Results

Hypothesis confirmed? Partially.

Summary. Failover completed in 22 seconds. The checkout API recovered, but the connection pool in the Order Processor service did not detect stale connections. It continued attempting queries on dead connections for 4 minutes until the connection pool's max lifetime timer triggered a refresh. During those 4 minutes, all background order processing was stalled.

Surprises.

  1. Order Processor connection pool (HikariCP) had maxLifetime set to 30 minutes. Stale connections were not detected until the pool cycled them out.
  2. The Notification Service recovered in 8 seconds because it uses a different connection library with built-in health checks.
  3. One committed transaction was replayed (duplicate order confirmation email) because the retry logic did not check idempotency keys.

Follow-Up Actions

FindingSeverityActionOwnerTicketDue Date
Order Processor stale connectionsCriticalAdd connection validation query (SELECT 1) on checkout from poolJames ParkENG-45212026-03-19
Duplicate notification on retryHighAdd idempotency key check to notification dispatchLisa WongENG-45222026-03-19
No failover runbook for Order ProcessorMediumDocument manual connection pool reset procedureSarah ChenOPS-8912026-03-26

Key Takeaways

  • State a falsifiable hypothesis before every experiment. "Let's see what happens" is not chaos engineering. "The system recovers within 60 seconds with no data loss" is a testable hypothesis.
  • Start in staging, graduate to production. Run every new experiment type in staging first. Only move to production after the system passes in staging and you have verified abort procedures.
  • Define abort conditions before you start. When the experiment is running and metrics are spiking, you need pre-agreed thresholds for stopping. Deciding in the moment leads to either premature aborts or prolonged outages.
  • The value is in the surprises. If every chaos experiment confirms the hypothesis perfectly, either your system is unusually resilient or your experiments are not aggressive enough. The most useful experiments are the ones that reveal unexpected failure modes.
  • Create follow-up tickets immediately. A chaos experiment that discovers a gap but does not result in a fix is wasted effort. Log findings as tickets with owners and due dates before closing the experiment.

Frequently Asked Questions

How do I get buy-in for chaos engineering from leadership?+
Frame it in terms of incident prevention and cost. Every chaos experiment that finds a gap before production is an incident that did not happen. Incidents cost engineering time, customer trust, and potentially revenue. Start with low-risk experiments in staging to build confidence, then present the findings (gaps discovered, fixes applied) to justify expanding to production.
What tools should I use for chaos engineering?+
For Kubernetes: Chaos Mesh, LitmusChaos, or Gremlin. For AWS: AWS Fault Injection Simulator (FIS). For general-purpose: Gremlin (commercial, multi-platform). For simple experiments: a bash script that kills processes or blocks network ports is a valid starting point. The tool matters less than the process. This template works regardless of tooling.
How often should I run chaos experiments?+
Run experiments monthly on critical services and quarterly on supporting services. After any major architecture change, run the relevant experiments again to re-validate. Game days (larger-scale chaos exercises involving multiple teams) should happen quarterly. The goal is to make resilience testing routine, not a one-time event.
What if a chaos experiment causes a real outage?+
This is why abort conditions and blast radius controls exist. If an experiment causes impact beyond the defined scope, activate the abort procedure immediately, then treat it as a real incident with a post-mortem. The post-mortem should cover both the system failure and the experiment's safety controls. Tighten the blast radius for future experiments.
Should PMs be involved in chaos engineering?+
PMs should understand the reliability posture of their product and the business impact of potential failures. They do not need to attend experiments, but they should review the findings. If a chaos experiment reveals that database failover causes 4 minutes of checkout downtime, the PM needs to know that and factor it into reliability investment decisions.

Explore More Templates

Browse our full library of PM templates, or generate a custom version with AI.

Free PDF

Like This Template?

Subscribe to get new templates, frameworks, and PM strategies delivered to your inbox.

or use email

Join 10,000+ product leaders. Instant PDF download.

Want full SaaS idea playbooks with market research?

Explore Ideas Pro →