7 Tips for faster MTTI and MTTR

Observability, cloud-native solutions, and automation can all help developers improve these important metrics

Estimated time to read: 8.5 minutes

MTTI and MTTR are two of the main metrics used to measure your incident response capability. Mean time to identify (MTTI) is the average time taken to identify an incident, while mean time to recover (MTTR) is the total time that elapses before the incident is successfully resolved.

Optimizing these values enhances software quality and reliability: Shortening the incident resolution process by reducing MTTI and MTTR minimizes downtime for customers, increasing satisfaction with your product. However, it can be difficult to achieve improvements when DevOps teams lack the tools and processes required to accurately measure these values and gain visibility into where problems are occurring.

This article will discuss seven tips for refining your incident response strategy and accelerating MTTI and MTTR. Your team will be more equipped to effectively deal with incidents and developers will be freed up to focus on productive work.

Why MTTI and MTTR matter

MTTI and MTTR measure how quickly you’re responding to new incidents. Their values — and the trends seen within them — let you determine whether your incident response is improving and where you’re struggling during the process.

MTTI is relevant to the start of incidents. MTTI measures the time required for an incident to be recognized as such after it has begun. Improving this value minimizes the delay before response teams start working on a problem, which lowers the chance that customers will be the first to report an incident. MTTI is usually relatively easy to improve by deploying observability suites that provide continuous monitoring and alerting for your apps.

MTTR represents the total time taken to deal with incidents. This encompasses everything from when they begin through verifying that your resolution has been successful. MTTR will always be higher than MTTI because it includes the time spent diagnosing and mitigating incidents after they’ve been detected.

Tracking MTTR lets you assess your team's ability to troubleshoot issues and apply fixes. The relationship or ratio between MTTI and MTTR provide insight, too. If MTTI is low but your MTTR is a much higher value, then you're able to detect issues promptly but struggle to mitigate them. Conversely, an MTTR that’s not much higher than MTTI suggests you can quickly triage and fix incidents, but lack the observability tools to spot them as they start — in this situation, the identification time is the longest part of your response.

Tips for faster MTTI and MTTR

DevOps teams are continuously seeking to reduce MTTI and MTTR. Of course, incidents are best avoided altogether, but this isn't achievable in real-world scenarios.It's critical that when issues do arise, you can respond efficiently, first by pinpointing the problem and then applying an effective mitigation. The following tips will set you up to be more proactive in your incident responses, leading to lower MTTI and MTTR.

1. Use real-time monitoring solutions

Implementing real-time monitoring via the use of observability solutions is one of the simplest ways to achieve an immediate improvement in MTTI and, in turn, MTTR. Knowing when an incident has begun is key to achieving a fast response, so it's important that your environments are continually monitored for anomalies that can point to an emerging incident.

Observability platforms such as Cisco AppDynamics collate critical metrics from your applications (such as response time, performance, and code-level traces) so you can detect when service degradation occurs. Because the monitoring is continuous, you don't need to manually inspect your services; instead, when an alert is received, you can immediately move to triage the incident.

App monitoring suites can help MTTR by providing vital context for detected incidents, such as relevant new entries in log files. By integrating with your app's environment and deployment systems such as CI/CD pipelines, it's also possible to extract vital context for developers to reference; this could be a list of changes that were deployed immediately prior to the incident, or an analysis of suspected anomalous events affecting your infrastructure.

2. Obtain complete coverage

Full coverage of your complete application inventory is required to minimize time spent dealing with incidents. If only a subset of your apps are monitored, then some incidents could still go unnoticed until they're manually flagged.

Obtaining complete coverage can be challenging when apps, teams, and environments frequently change. To address this, seek observability solutions that can automatically discover new endpoints through direct connections to your cloud platforms. This will ensure constant coverage of your systems and all incidents, even when your app inventory changes.

It's also important to properly instrument your services so your observability suites can collect meaningful data. Raw performance metrics like CPU and memory usage don't always tell the full story; spikes in business values like average checkout completion time can point to incidents too. So monitoring these metrics will help you discover more problems earlier to reduce MTTI and MTTR.

3. Implement cloud-native solutions

Cloud operations require a cloud-native approach to observability. This accelerates incident response by permitting instant, automated correlation of events from across your cloud providers, even when multiple services are used.

Cisco Cloud Observability lets you connect to your AWS, Azure, and Google Cloud accounts. It aggregates the data from all your cloud resources and correlates trends to the infrastructure in your account. You can then make a holistic assessment of overall cloud performance.

This model provides a simpler, more intuitive picture of your cloud inventory, making it easier to analyze trends and spot potential incidents. The ability to observe all cloud activity in one solution lifts data out of platform silos, reducing the amount of context-switching required. This allows you to efficiently spot incidents and obtain insights to analyze their cause, ultimately reducing MTTI and MTTR.

4. Use automated remediation

While the three points above primarily concern incident detection and analysis, minimizing MTTR also mandates rapid resolution workflows that support developers in applying effective fixes. This is best achieved using tools that can automate common remediations to decrease the time spent by developers.

To be useful, incident reports should be actionable. Beyond knowing an incident has started, developers need the context to understand what has gone wrong and which services are affected. This information can be provided by platforms that understand the connections between services and the potential consequences on them in case of an outage.

Even more helpful are one-click mitigation options shown within the platforms where developers are working (including IDEs, chat apps, and issue trackers). This can permit incident resolution via a single click, cutting the recovery time down to seconds compared to minutes or hours when manual actions are required.

For example, if an incident is caused by excessive memory consumption on one of your services, your observability suite could suggest an automated workflow that adds capacity by spinning up another replica of the service. You could even make the action apply automatically, without initially prompting a developer. Your MTTR will improve since more incidents will be ended without having to wait for teams to respond.

5. Practice root cause analysis

Resolution efforts need to be followed by a thorough post-mortem where you can analyze the root cause of the incident and implement additional mitigations that improve stability. This is important so you can preempt similar incidents in the future. It also ensures that all incidents eventually receive a permanent fix.

In the case of the "out of memory" scenario described above, the root cause could be a general increase in user activity—signaling that the service needs more capacity to support future load. Alternatively, you could determine that the memory consumption was caused by a poorly optimized code section that created a memory leak, requiring the development of an appropriate patch. Root cause analysis improves quality by preventing you from becoming dependent on what are supposed to be temporary fixes when an incident is first detected. It also contributes to MTTR reductions by giving you the knowledge to anticipate the causes of incidents. Recognizing that an incident is similar to a previous one lets you jump straight to applying mitigation, meaning less time spent in the research and diagnosis stages.

6. Improve team collaboration

MTTI and MTTR improvements can be driven by collaborative methods too. Improving training and knowledge sharing for incident response engineers will help unify how incidents are tackled; it also ensures everyone is productive as soon as the pager rings, facilitating faster incident response.

Clearly documented processes, standard toolchains, and reliable communication platforms are key to achieving this. Incidents can be stressful, but using a consistent workflow will help everyone stay focused. Centralizing and automating your method also makes it easier for all parties to contribute, even if they lack specialist training for the particular service that's affected.

Rehearsing your response process is an effective way to maximize your preparedness ahead of encountering a real incident for the first time. Using techniques like chaos testing to simulate incidents by randomly disabling parts of your infrastructure is an invaluable way to practice resolving issues—without the pressure of real events. Drawing on the lessons learned, you can then make iterative improvements to your process, which will enable faster MTTI and MTTR when an actual incident hits.

7. Analyze where time is spent during incident response

Finally, remember to implement instrumentation that's active during your incident responses. You should time not only the overall start-to-finish duration, but also the time spent in each response phase—i.e., diagnosis, analysis, mitigation, and verification.

Tracking these metrics means you can identify any weak areas in need of improvement. For example, is it taking too long for developers to build and test fixes after the source of an incident is discovered? Although there could be a myriad of reasons for an individual stage to be slower than the others, knowing which ones you're performing well at versus those creating bottlenecks will help you identify new tools and processes to increase resolution speed.

Ideally, this data should be collected automatically as you interact with your incident response workflow. For example, once an automated alert is fed into your platform, the incident timer needs to begin running automatically. An alert acknowledged by a team member could then represent the next event, followed by the time when a suggested mitigation is applied or the incident is assigned to a developer for manual remediation.

Summary

MTTI and MTTR are two of the main metrics for understanding how quickly—and by extension, how effectively—you can respond to incidents that affect your systems. High-performing DevOps teams should prioritize achieving low MTTI and MTTR results, meaning incidents are being rapidly triaged and resolved with minimal customer impact.

Accelerating MTTI and MTTR requires you to first obtain full visibility into your app inventory so you can detect new incidents as they happen. To do this, you need robust cloud-native monitoring tools that let you diagnose faults by inspecting logs, metrics, and correlated events across your infrastructure. Such a platform equips you to resolve incidents and analyze root causes, preferably using automated mitigations to reduce the time developers spend testing fixes.

Ready to start using your data for faster MTTI and MTTR?

Cisco's Full-Stack Observability platform provides visibility into what's happening in your applications—across cloud, hybrid, and on-premises environments. You gain multi-domain insights into your system’s behavior, performance, and security issues. The Cisco Full-Stack Observability platform even integrates business context analysis, using Cisco services such as AppDynamics.

Debugging with DevAdvocates

Insight into effective incident management

During my DevOps days, I was responsible — as many developers are — with both proactive development as well as incident response. Particularly with incident response, I experienced first-hand how a process for knowledge-sharing across the team saves time in the long run.

The way our organization worked was, if you were assigned the incident, you were in charge of the response. That meant, even if it was the middle of the night and you wanted to get started coding, building, and testing the solution, you first had to log the incident and update the status page. Now, AppDynamics and ThousandEyes have made updating status easier with real-time status, but no matter how you’re communicating the status of the application, it will save time to have this documented somewhere to share with stakeholders.

By publishing status on a webpage, such as status.myapplication.com, the developer can avoid interruptions to their work and long conference calls with stakeholders who are wondering the status of the application or services. A well known public status page answers the questions and provides push updates without engaging the developer, ultimately freeing the developer to start or continue working on the solution. For example, you can see the status of services at WebEx and Intersight by visiting status.webex.com and status.intersight.com.

In some cases, it may be tempting to simply resolve the issue and move on — especially if the incident has a run book with a known solution or if the developer knows the solution from recent memory. However, as part of our operational discipline, we first set the status page and opened a ticket to track the issue before working on resolving the issue itself.

The second practice that is crucial to reduced MTTR is an up-to-date run-book. A run-book is a living document that lists all incidents, and the steps taken to resolve this incident. This documentation allows the incident responder to refer to the run-book and more quickly resolve an issue by referring to previous similar issues. Ultimately, a run-book helps teams identify and resolve known issues faster, while building on the collective knowledge base.

From experience, I can relate to the way developers’ minds work. And, it isn’t always instinctive to want to spend time logging an incident, updating status, and recording in a run-book. Instead, a developer often wants to dive in and quickly fix the issue. However, once these systems are in place — and only when they are in place — can an organization reduce MTTR.

– Mel Delgado

Stay connected for more tips and advice from Mel:

Resources

Hub: Full-Stack Observability

Documentation: Cisco Observability Platform

eBook: Full-stack observability