Metric loss

Reassessing MTTR as a Key Indicator of Operational Performance – The New Battery

Ali siddiqui

Ali is Product Manager for BMC Software Inc. In this role he has end-to-end responsibility for the entire company’s product portfolio, including BMC Helix Suite solutions, Control-M and its Automated Mainframe Intelligence (FRIEND). Ali received a Bachelor of Science degree in Electrical Engineering from the California Institute of Technology and a Master of Science degree from Stanford University.

While the technology for monitoring systems and applications has changed dramatically over the years, the way we measure performance and availability hasn’t changed much. But maybe it’s time to think differently about the metrics we use to manage our IT systems.

Most IT organizations use fairly standard metrics to assess operational performance: application performance and availability, adherence to service level agreements (SLAs), number and severity of incidents, and mean time to repair (MTTR).

When those numbers are working well, we know our systems are generally stable, our teams and workflows are well balanced, we handle issues competently, and we recover quickly when there are issues.

With these numbers in hand, IT can effectively demonstrate its value to the business, the business can better plan its workload and deliverables, and both can look for ways to make changes and improvements in business. ‘pressing data.

Within IT teams, these numbers are frequently used to set benchmarks and reward those who surpass them, because if we continually improve how quickly we respond and resolve issues, we certainly improve the customer experience and their sense of being. our enterprise.

But the growing reach and use of artificial intelligence for computing operations, or AIOps, at least one of those parameters may soon be seen differently.

AIOps enables efficient use of data

While you may not yet have adopted it in an obvious way in your own organization, in its most basic form AIOps was developed to help better manage today’s amazing volumes and varieties of data. .

What’s wrong with so much data? As with everything, too much of a good thing isn’t really a good thing. Too much data means more time to go through it to find all kinds of actionable information.

If you have an outage and 100 alerts go off, how much time do you waste investigating 99 false alarms before you get to the one that can tell you what really went wrong?

Enter AIOps. Combining big data and machine learning to automate the types of IT operations processes that have so far required a lot of time and effort, AIOps creates efficiencies at scale, enables visibility into the world. across your infrastructure and help your team gain the insights needed to create powerful data. -making business decisions more easily.

When event correlation, anomaly detection, and root cause determination are essentially removed from your team’s work case, thanks to AIOps’ analytical capabilities, IT teams will end up with more time to spend on more interesting and more productive projects.

Oh the irony

But, there is a catch. Think about the powerful problem-solving abilities you get with AIOps. With the increased efficiency, visibility and insight provided by AIOps’ machine learning capabilities, your MTTR numbers can actually increase….

So if you’ve assessed your team’s performance based on incremental reductions in the time it takes to restore services, you might want a new metric soon. Here’s why:

While AIOps-enabled solutions automate routine testing and research, proactively suggest fixes, and potentially correct problems – all without human intervention or oversight – these disruptions will in fact cease to exist. Your AIOps solution ended this outage before it even happened.

But what is left? The biggest and most complex service and operating issues that can’t be automated. The ones that may actually require the talent of your operations staff, and possibly a lot more time.

All is not lost, however. While these types of remaining challenges may be more thorny, these are also the types of problems that engineering minds love, that you actually want to pay those competitive salaries for – and that ultimately lead to innovation.

Metrics moving forward

If MTTR is not going to accurately describe the success of an operations team, then what is a metric to watch in an AIOps-compatible future? Size of issue resolved? Complexity index? A scholarly relationship between the seriousness of the problem and the time to resolution? Or have we really entered an era where the axiom “if you can’t measure it, you can’t handle it” no longer fits?

This may be the next puzzle for your team to solve. No matter how you set the parameters for progress in this next era, the beauty is that we all win: less little mundane problems, more big interesting problems, and greater overall efficiency. It’s the numbers that count.

Photo by RODNAE Productions from Pexels.