Skip to main content
New: 9 PM Courses with hands-on exercises and certificates
Back to Glossary
EngineeringC

Chaos Engineering

Definition

Chaos engineering is the practice of deliberately introducing failures, latency, and disruptions into production systems to verify that they behave as expected under stress. The discipline was pioneered by Netflix in 2011 with the creation of Chaos Monkey, a tool that randomly terminated virtual machine instances in production. The reasoning was straightforward: if Netflix's systems could survive random instance failures during business hours when engineers were available, they could survive unexpected failures at any time.

The core process follows the scientific method. Teams form a hypothesis about how the system should behave when a specific failure occurs ("If the recommendation service goes down, the homepage should still load within 2 seconds using cached data"). They design an experiment to test that hypothesis by injecting the failure in a controlled way. They observe the actual behavior and compare it to the hypothesis. When reality does not match the hypothesis, the team has found a weakness to fix before customers discover it the hard way.

Chaos engineering extends beyond random instance termination. Modern practices test network partitions, DNS failures, database slowdowns, third-party API outages, and even data corruption scenarios. The discipline connects closely to DevOps culture and CI/CD practices by treating reliability as something that must be continuously tested, not just designed. Teams that use canary releases and feature flags already have the infrastructure needed to limit the blast radius of chaos experiments.

Why It Matters for Product Managers

Reliability is a product feature. When a PM promises 99.9% uptime in a pricing page or SLA, chaos engineering is what turns that promise from aspiration into evidence. Without proactive testing, reliability claims are based on hope. With chaos experiments, they are based on data.

PMs benefit from chaos engineering in two practical ways. First, it reduces unplanned work. Teams that discover and fix weaknesses proactively spend less time in firefighting mode and more time on the roadmap. Second, it builds customer trust. Products that degrade gracefully during partial outages (showing cached content, queuing transactions, displaying informative error messages) retain users better than products that crash completely. Understanding the results of chaos experiments helps PMs make informed decisions about where to invest in reliability versus feature development. For technical PMs, this knowledge is essential when managing technical debt tradeoffs.

How to Apply It

Start small. Pick a non-critical service dependency and test what happens when it becomes unavailable. Before running the experiment, document your hypothesis and define the abort criteria (what conditions will cause you to immediately stop the test). Run the experiment during business hours when the team is available to respond. After the experiment, compare actual behavior to expected behavior and create action items for any gaps.

Gradually increase scope. Move from testing individual service failures to testing network partitions between services, then to testing regional failover. Build a library of experiments that run on a regular schedule. The goal is to make chaos experiments as routine as unit tests. Use the results to inform your roadmap: if a chaos experiment reveals that your checkout flow has a single point of failure, that is a high-priority fix that should be planned into the next sprint. For a deeper dive into managing technical infrastructure decisions, see the technical PM handbook.

Frequently Asked Questions

Is it safe to break things in production on purpose?+
Chaos experiments are designed to be controlled and limited in scope. Teams start with a hypothesis ('if this service fails, traffic should automatically reroute to the backup'), define a small blast radius (one region, one percentage of traffic), and have an immediate kill switch to stop the experiment. The experiments reveal whether your systems actually behave the way you think they do. The alternative is discovering failures at 3 AM during a real outage when the stakes are much higher. Netflix, Amazon, Google, and Microsoft all run chaos experiments in production.
How does chaos engineering relate to product management?+
Product managers should care about chaos engineering for two reasons. First, it directly affects the customer experience. If your payment processing service cannot handle a dependency failure gracefully, customers lose money and trust. Chaos experiments prove that your reliability claims are real, not theoretical. Second, it affects roadmap planning. Teams that invest in chaos engineering spend less time on emergency incident response and more time building features. PMs who support chaos engineering investments get more predictable velocity in return.
What tools do teams use for chaos engineering?+
Netflix's Chaos Monkey (randomly terminates instances) is the most famous tool. AWS Fault Injection Simulator provides managed chaos experiments for AWS infrastructure. Gremlin offers a commercial platform with a library of attack types. LitmusChaos is an open-source option for Kubernetes environments. Smaller teams can start without tools by manually shutting down a non-critical service during business hours and observing what happens. The tool matters less than the practice of forming hypotheses and testing them.

Explore More PM Terms

Browse our complete glossary of 100+ product management terms.