Quick Answer (TL;DR)
Mean Time to Recovery (MTTR) measures average time to recover from a failure. The formula is Total downtime / Number of incidents. Industry benchmarks: <1 hour. Track this metric when measuring operational resilience.
What Is Mean Time to Recovery (MTTR)?
Average time to recover from a failure. This is one of the core metrics in the operational metrics category and is essential for any product team serious about data-driven decision making.
Mean Time to Recovery (MTTR) measures the health and efficiency of your product infrastructure and team operations. While not a customer-facing metric, it directly impacts user experience and your team's ability to ship improvements.
Understanding mean time to recovery (mttr) in context --- alongside related metrics --- gives you a more complete picture than tracking it in isolation. Use it as part of a balanced metrics dashboard.
The Formula
Total downtime / Number of incidents
How to Calculate It
Track timestamps for each event. If you measure five cases with durations of 2, 4, 5, 8, and 11 hours, the median is 5 hours. Use the median rather than the mean to avoid skew from outliers.
Benchmarks
<1 hour
Benchmarks vary significantly by industry, company stage, business model, and customer segment. Use these ranges as starting points and calibrate to your own historical data over 2-3 quarters. Your trend matters more than any absolute number --- consistent improvement is the goal.
When to Track Mean Time to Recovery (MTTR)
When measuring operational resilience. Specifically, prioritize this metric when:
You are building or reviewing your metrics dashboard and need operational indicators
Leadership or investors ask about operational performance
You suspect a change in product, pricing, or go-to-market strategy has affected this area
You are running experiments that could impact mean time to recovery (mttr)
You need a quantitative baseline before making a strategic decision
How to Improve
Reduce unnecessary steps. Map the process from start to finish and eliminate anything that does not directly contribute to the outcome. Fewer steps means faster completion.
Automate monitoring and alerting. Do not rely on manual checks. Set up automated alerts that trigger when this metric crosses a threshold so your team can respond immediately.
Invest in infrastructure and tooling. Operational metrics improve when you invest in better CI/CD pipelines, monitoring tools, and incident response processes.
Set clear SLAs and track compliance. Define service-level agreements for this metric and hold teams accountable. What gets measured and targeted gets improved.
Common Pitfalls
Using averages instead of medians. Time-based metrics are often skewed by outliers. A few extremely slow cases can inflate the average and mask the typical experience. Use medians for a more accurate picture.
Setting thresholds too tightly or loosely. Overly sensitive alerts cause alarm fatigue while loose thresholds miss real issues. Calibrate against historical baselines and adjust as the system matures.
Measuring without acting. Tracking this metric is only valuable if you have a process for reviewing it regularly and a playbook for responding when it moves outside acceptable ranges.
Related Metrics
Lead Time for Changes --- time from code commit to production deployment
Change Failure Rate --- percentage of deployments causing a failure
Deployment Frequency --- how often code is deployed to production
Sprint Velocity --- amount of work completed per sprint
Product Metrics Cheat Sheet --- complete reference of 100+ metrics