Calculate service recovery time from incident detection to resolution. Classify your DORA restoration tier and benchmark recovery speed.
Service recovery time measures the elapsed time from when an incident is detected to when the service is fully restored. Also known as Time to Restore Service in the DORA framework, it is a critical reliability metric that directly impacts user experience and SLA compliance.
Elite teams can restore service in under one hour, while low performers may take over six months to recover from failures. The speed of recovery often matters more than the frequency of failures because users experience the outage duration, not the failure event itself.
This calculator computes recovery time from detection and resolution timestamps, classifies your DORA tier, and provides the breakdown in minutes, hours, and days. Tracking recovery time across incidents helps identify patterns and justify investments in observability, runbooks, and automated remediation.
Integrating this calculation into monitoring and reporting workflows ensures that engineering decisions are grounded in real data rather than assumptions about system behavior.
Fast recovery minimizes the blast radius of every incident. Even if failures occur, rapid restoration limits downtime costs, SLA violations, and customer churn. This calculator helps teams benchmark their recovery capability against DORA standards and track improvement over time. Data-driven tracking enables evidence-based infrastructure decisions, reducing the risk of over-provisioning costs or under-provisioning that leads to performance bottlenecks.
Recovery Time = Incident Resolved Timestamp − Incident Detected Timestamp. DORA tiers: Elite < 1 hour, High < 1 day, Medium < 1 week, Low < 1 month, Very Low > 1 month.
Result: 115 minutes (1.92 hours) — High tier
An incident detected 120 minutes ago and resolved 5 minutes ago has a recovery time of 115 minutes (about 1 hour 55 minutes). This falls in the High DORA tier, close to Elite. Reducing the recovery time by 55 minutes would achieve Elite status.
Time to restore service is one of the four DORA metrics that separate elite engineering organizations from the rest. It measures not whether you fail, but how quickly you recover when failures occur — a far more practical measure of operational excellence.
Break recovery time into distinct phases: detection (from failure to first alert), triage (from alert to incident ownership), diagnosis (from ownership to root cause identification), remediation (from root cause to fix deployment), and verification (from fix to confirmed restoration). Each phase can be optimized independently.
Fast recovery is a skill that requires practice. Regular game days, chaos engineering experiments, and incident response drills build the muscle memory teams need to respond quickly under pressure. Teams that practice recover faster.
Track recovery time as a rolling median over 30 and 90 days. Monitor trends by severity level (SEV1 vs SEV2 vs SEV3) and by service. Look for improvements after investing in runbooks, automation, or observability tooling.
It is the elapsed time from when an incident is detected (typically when an alert fires) to when the service is fully restored and verified operational. It captures the full incident resolution lifecycle.
MTTR (Mean Time to Repair) is an average across many incidents. Service recovery time can be measured per incident. In the DORA framework, Time to Restore Service typically uses the median across incidents.
Elite teams restore service in under one hour. This requires comprehensive observability, well-maintained runbooks, automated remediation for common failures, and practiced incident response procedures.
DORA measures from detection, not from failure start. However, improving detection speed (reducing time from failure to alert) is equally important. Consider tracking both failure-to-detection and detection-to-restoration times.
Invest in three areas: faster detection through better monitoring and alerting, faster diagnosis through observability tools (logs, metrics, traces), and faster remediation through runbooks and automation. Comparing your results against established benchmarks provides valuable context for evaluating whether your figures fall within the expected range.
Track the total wall-clock time, not individual effort. Multi-team incidents often have longer recovery times due to coordination overhead. Clear incident command processes and communication channels help reduce this.
Yes. Different services have different recovery characteristics based on their complexity, team expertise, and tooling. Per-service tracking helps prioritize investments where they will have the most impact.