Mean Time to Recovery (MTTR)

Quick Answer (TL;DR)

Mean Time to Recovery (MTTR) measures average time to recover from a failure. The formula is Total downtime / Number of incidents. Industry benchmarks: <1 hour. Track this metric when measuring operational resilience.

What Is Mean Time to Recovery (MTTR)?

Average time to recover from a failure. This is one of the core metrics in the operational metrics category and is essential for any product team serious about data-driven decision making.

Mean Time to Recovery (MTTR) measures the health and efficiency of your product infrastructure and team operations. While not a customer-facing metric, it directly impacts user experience and your team's ability to ship improvements.

Understanding mean time to recovery (mttr) in context --- alongside related metrics --- gives you a more complete picture than tracking it in isolation. Use it as part of a balanced metrics dashboard.

The Formula

Total downtime / Number of incidents

How to Calculate It

Track timestamps for each event. If you measure five cases with durations of 2, 4, 5, 8, and 11 hours, the median is 5 hours. Use the median rather than the mean to avoid skew from outliers.

Benchmarks

<1 hour

Benchmarks vary significantly by industry, company stage, business model, and customer segment. Use these ranges as starting points and calibrate to your own historical data over 2-3 quarters. Your trend matters more than any absolute number --- consistent improvement is the goal.

When to Track Mean Time to Recovery (MTTR)

When measuring operational resilience. Specifically, prioritize this metric when:

You are building or reviewing your metrics dashboard and need operational indicators

Leadership or investors ask about operational performance

You suspect a change in product, pricing, or go-to-market strategy has affected this area

You are running experiments that could impact mean time to recovery (mttr)

You need a quantitative baseline before making a strategic decision

How to Improve

Reduce unnecessary steps. Map the process from start to finish and eliminate anything that does not directly contribute to the outcome. Fewer steps means faster completion.

Automate monitoring and alerting. Do not rely on manual checks. Set up automated alerts that trigger when this metric crosses a threshold so your team can respond immediately.

Invest in infrastructure and tooling. Operational metrics improve when you invest in better CI/CD pipelines, monitoring tools, and incident response processes.

Set clear SLAs and track compliance. Define service-level agreements for this metric and hold teams accountable. What gets measured and targeted gets improved.

Common Pitfalls

Using averages instead of medians. Time-based metrics are often skewed by outliers. A few extremely slow cases can inflate the average and mask the typical experience. Use medians for a more accurate picture.

Setting thresholds too tightly or loosely. Overly sensitive alerts cause alarm fatigue while loose thresholds miss real issues. Calibrate against historical baselines and adjust as the system matures.

Measuring without acting. Tracking this metric is only valuable if you have a process for reviewing it regularly and a playbook for responding when it moves outside acceptable ranges.

Lead Time for Changes --- time from code commit to production deployment

Change Failure Rate --- percentage of deployments causing a failure

Deployment Frequency --- how often code is deployed to production

Sprint Velocity --- amount of work completed per sprint

Product Metrics Cheat Sheet --- complete reference of 100+ metrics

Mean Time to Recovery (MTTR): Definition, Formula & Benchmarks

Quick Answer (TL;DR)

What Is Mean Time to Recovery (MTTR)?

The Formula

How to Calculate It

Benchmarks

When to Track Mean Time to Recovery (MTTR)

How to Improve

Common Pitfalls

Put Metrics Into Practice

Mean Time to Recovery (MTTR): Definition, Formula & Benchmarks

Quick Answer (TL;DR)

What Is Mean Time to Recovery (MTTR)?

The Formula

How to Calculate It

Benchmarks

When to Track Mean Time to Recovery (MTTR)

How to Improve

Common Pitfalls

Related Metrics

Put Metrics Into Practice