Definition
Chaos engineering is the practice of deliberately introducing failures, latency, and disruptions into production systems to verify that they behave as expected under stress. The discipline was pioneered by Netflix in 2011 with the creation of Chaos Monkey, a tool that randomly terminated virtual machine instances in production. The reasoning was straightforward: if Netflix's systems could survive random instance failures during business hours when engineers were available, they could survive unexpected failures at any time.
The core process follows the scientific method. Teams form a hypothesis about how the system should behave when a specific failure occurs ("If the recommendation service goes down, the homepage should still load within 2 seconds using cached data"). They design an experiment to test that hypothesis by injecting the failure in a controlled way. They observe the actual behavior and compare it to the hypothesis. When reality does not match the hypothesis, the team has found a weakness to fix before customers discover it the hard way.
Chaos engineering extends beyond random instance termination. Modern practices test network partitions, DNS failures, database slowdowns, third-party API outages, and even data corruption scenarios. The discipline connects closely to DevOps culture and CI/CD practices by treating reliability as something that must be continuously tested, not just designed. Teams that use canary releases and feature flags already have the infrastructure needed to limit the blast radius of chaos experiments.
Why It Matters for Product Managers
Reliability is a product feature. When a PM promises 99.9% uptime in a pricing page or SLA, chaos engineering is what turns that promise from aspiration into evidence. Without proactive testing, reliability claims are based on hope. With chaos experiments, they are based on data.
PMs benefit from chaos engineering in two practical ways. First, it reduces unplanned work. Teams that discover and fix weaknesses proactively spend less time in firefighting mode and more time on the roadmap. Second, it builds customer trust. Products that degrade gracefully during partial outages (showing cached content, queuing transactions, displaying informative error messages) retain users better than products that crash completely. Understanding the results of chaos experiments helps PMs make informed decisions about where to invest in reliability versus feature development. For technical PMs, this knowledge is essential when managing technical debt tradeoffs.
How to Apply It
Start small. Pick a non-critical service dependency and test what happens when it becomes unavailable. Before running the experiment, document your hypothesis and define the abort criteria (what conditions will cause you to immediately stop the test). Run the experiment during business hours when the team is available to respond. After the experiment, compare actual behavior to expected behavior and create action items for any gaps.
Gradually increase scope. Move from testing individual service failures to testing network partitions between services, then to testing regional failover. Build a library of experiments that run on a regular schedule. The goal is to make chaos experiments as routine as unit tests. Use the results to inform your roadmap: if a chaos experiment reveals that your checkout flow has a single point of failure, that is a high-priority fix that should be planned into the next sprint. For a deeper dive into managing technical infrastructure decisions, see the technical PM handbook.