Incident Management in IT Operations 101

In the previous episodes of this series, we explored some basic concepts of Incident Management in IT Operations. Part 1 covered what an incident is and why it is essential in supporting different kinds of computer systems. In Part 2, we looked into the incident lifecycle and the connection between incident handling and the quality of IT services.

In this part, I will discuss the KPIs that are usually used to measure the performance of the Incident Management process implementation and what happens with incidents after they have been resolved.

I want it all, and I want it now!

What are FTR, TTR, TTA and other metrics?

As I said before, taking care of incidents in a fast and efficient way directly impacts the stability of business operations in general. The reasonable question that arises, therefore, is: How do we know whether we are performing well enough to comply with a service level agreement? One of the possible solutions is to have some KPIs, or in other words, metrics, that will allow you to keep an eye on the process itself and on SLA compliance specifically.

This set of metrics can be as simple or as complex as you like. There is no single solution that would satisfy the needs of every organization. However, some time-proved indicators are worth considering when defining your measurement system. So, let’s talk about them.

The most important metric is Time To Resolve, or TTR shortly. It is basically used to track the actual time it took to resolve an incident. Moreover, it is common for SLAs to define target resolution time for different kinds of incidents explicitly. Depending on incident priority, the target objective for this metric usually varies: incidents with higher priority should be resolved in a shorter time, while low-priority incidents can be handled during a more extended period.

Another widely used metric is Time To Response (TTR) or First Time Response (FTR). Despite the variation in terminology, these metrics mean almost the same and are used to measure the time it takes to react or respond to an incident. Technically, it means that a support representative in some way acknowledges the incident by changing its status or contacting the incident reporter via a comment in ITSM, chat, email, phone, etc.

One more noteworthy metric is Time To Assign (TTA). It is less used than the two previous ones; nevertheless, it might be useful for understanding how fast incoming incidents are assigned to support engineers. Of course, it is good when incidents are automatically routed to defined support teams, but as I said, that is not always the case. In situations where you have some manual routines to dispatch incident tickets, measuring this time will help you control the efficiency of this part of the process.

Why are the metrics essential? As Peter Drucker said:

You can't manage what you can't measure.

Imagine that you are driving a car without any indicators on the dashboard. You can accelerate or brake, turn a steering wheel and switch gears to control the movement direction, but you have no idea about the speed or amount of fuel in your tank. The same is true for a process without metrics, which can be seen as a mechanism without any feedback about its state or performance.

In addition to the apparent role of metrics as indicators on your process control dashboard, they have more significant meaning. When, for instance, you come to a bank, a shop, a cafeteria, etc., and their payment processing system is not working, the first thing you would like to know in this situation is how much time you should wait until being served. For business, knowing that time and sharing it with customers is all about managing customer expectations. I suppose, for you as a customer, it might be more convenient to understand whether the expected system restore time is 15 minutes or 15 hours. In the first case, you might decide to take your time and just wait for 15 minutes, in the other one it might be more reasonable to postpone your visit or look for alternatives.

Here lies…

Incident postmortem reports

So, a computer system is back online after a crash, all indicators are green, and life is good or somewhere close to that. What else can be done? Can you learn something from this unpleasant situation when you worked for 20 hours straight spinning up servers, restoring user data and listening to complaints from your users and company management? Or, possibly, you already experienced firsthand the process of filling up countless report forms and explaining what happened. Let’s look at how you can benefit from incidents.

First of all, if you keep the records of your incidents, which I really hope you do, and use ticketing and monitoring tools that allow you to relate incident records to the specific parts of your infrastructure (server, database, website or another configuration item), you can significantly reduce resolution time for future similar incidents. If you, as a support engineer, receive an incident to handle, and this incident is logically connected with other ones from the same piece of hardware or software, and these other ones have some notes on what to do to fix the issue, it might save you tons of googling and troubleshooting time.

Secondly, and more importantly, each incident is a signal about flaws in your architecture, configuration, or operation itself. Therefore, if properly read, these signals can be used as inputs in the ITIL Problem Management process, which is targeted to detect flaws and prevent future incidents from occurring.

Every interruption in normal system operation provides us with some bits of useful data. For a large-scale infrastructure, it can be really Big Data. If you are overwhelmed by hundreds of big and small incidents per day, it becomes hard to find time to return to already resolved ones and scrutinize them in retrospect. So, in practice, not all incidents undergo postmortem analysis, but usually only significant ones. It can be major incidents, which affected a large part of the infrastructure or user base, or failures in operation that caused critical data loss or extended periods of downtime – it will depend on the specifics of your IT operation and the established incident management process.

The key idea of postmortems is to extract the most valuable information from your incident flow. By valuable information, I mean the information that will provide you with useful insights about possible improvement areas. It can be information about the incident impact area, the severity of the failure, the chronological course of events, assumptions about what caused the incident, suggested improvements to ensure that similar situations will not be repeated in the future, etc.

If you do not perform incident postmortems, you just do not learn from your mistakes.

(Just) One big happy family

Problem, Change and Event Management

As you might have already noticed from the previous posts in this series, Incident Management is not the only process in the ITIL framework. Yes, it can be seen as a cornerstone of IT Operation, but this one stone is not enough to support a whole building.

Apart from just fixing what has broken and keeping your systems afloat, there is a lot more work to do. As Lewis Carroll wrote in his famous novel “Alice in Wonderland”:

“…here we must run as fast as we can, just to stay in place. And if you wish to go anywhere, you must run twice as fast as that.”

If you do not take actions to prevent incidents from happening, to change your computer systems to be more resilient to failures and more accessible to its users, to understand better what is going on with your system when it is under heavy load or under a hacker attack, you will unlikely succeed in business. There will always be some competitors on the market who will try to make their products or services more appealing to customers by making them more available, faster, with better user support and for less money. Therefore, if your business wants to remain profitable and competitive, it will also need to find ways to make its offerings better.

So, what are the other essential components of effective IT Operation? I already mentioned the Problem Management process, which is intended to look into the reasons why incidents happen and how to reduce their numbers and severity to some acceptable levels. Problem Management is a natural extension of Incident Management and uses information about incidents as the input data source.

When you have investigated a problem and are planning the improvements in your IT infrastructure, it is high time to consider the Change Management process. Change Management’s primary focus is to perform changes in an organized and predictable way to reduce the risks of unplanned service interruptions and downtimes. Why should you pay attention to changes too? According to different research, approximately 80% of incidents can be traced to changes in the configuration of your application or underlying infrastructure.

After the change has been performed, it is the right time to check your monitoring tools and look into system logs for events that might tell you whether everything is working as expected or there are some traces of undesirable behavior, i.e., new incidents.

If you want to hear more about Incident Management or have questions, please share your feedback in the comments! 👇