Operational Metrics8 min read

Mean Time to Recovery (MTTR): Definition, Formula & Benchmarks

Learn how to measure and reduce Mean Time to Recovery (MTTR). Includes the formula, benchmarks (<1 hour), and strategies to improve speed and efficiency.

By Tim Adair• Published 2026-02-08

Quick Answer (TL;DR)

Mean Time to Recovery (MTTR) measures average time to recover from a failure. The formula is Total downtime / Number of incidents. Industry benchmarks: <1 hour. Track this metric when measuring operational resilience.


What Is Mean Time to Recovery (MTTR)?

Average time to recover from a failure. This is one of the core metrics in the operational metrics category and is essential for any product team serious about data-driven decision making.

Mean Time to Recovery (MTTR) measures the health and efficiency of your product infrastructure and team operations. While not a customer-facing metric, it directly impacts user experience and your team's ability to ship improvements.

Understanding mean time to recovery (mttr) in context --- alongside related metrics --- gives you a more complete picture than tracking it in isolation. Use it as part of a balanced metrics dashboard.


The Formula

Total downtime / Number of incidents

How to Calculate It

Track timestamps for each event. If you measure five cases with durations of 2, 4, 5, 8, and 11 hours, the median is 5 hours. Use the median rather than the mean to avoid skew from outliers.


Benchmarks

<1 hour

Benchmarks vary significantly by industry, company stage, business model, and customer segment. Use these ranges as starting points and calibrate to your own historical data over 2-3 quarters. Your trend matters more than any absolute number --- consistent improvement is the goal.


When to Track Mean Time to Recovery (MTTR)

When measuring operational resilience. Specifically, prioritize this metric when:

  • You are building or reviewing your metrics dashboard and need operational indicators
  • Leadership or investors ask about operational performance
  • You suspect a change in product, pricing, or go-to-market strategy has affected this area
  • You are running experiments that could impact mean time to recovery (mttr)
  • You need a quantitative baseline before making a strategic decision

  • How to Improve

  • Reduce unnecessary steps. Map the process from start to finish and eliminate anything that does not directly contribute to the outcome. Fewer steps means faster completion.
  • Automate monitoring and alerting. Do not rely on manual checks. Set up automated alerts that trigger when this metric crosses a threshold so your team can respond immediately.
  • Invest in infrastructure and tooling. Operational metrics improve when you invest in better CI/CD pipelines, monitoring tools, and incident response processes.
  • Set clear SLAs and track compliance. Define service-level agreements for this metric and hold teams accountable. What gets measured and targeted gets improved.

  • Common Pitfalls

  • Using averages instead of medians. Time-based metrics are often skewed by outliers. A few extremely slow cases can inflate the average and mask the typical experience. Use medians for a more accurate picture.
  • Setting thresholds too tightly or loosely. Overly sensitive alerts cause alarm fatigue while loose thresholds miss real issues. Calibrate against historical baselines and adjust as the system matures.
  • Measuring without acting. Tracking this metric is only valuable if you have a process for reviewing it regularly and a playbook for responding when it moves outside acceptable ranges.

  • Lead Time for Changes --- time from code commit to production deployment
  • Change Failure Rate --- percentage of deployments causing a failure
  • Deployment Frequency --- how often code is deployed to production
  • Sprint Velocity --- amount of work completed per sprint
  • Product Metrics Cheat Sheet --- complete reference of 100+ metrics
  • Put Metrics Into Practice

    Build data-driven roadmaps and track the metrics that matter for your product.