In the first part of this series, I talked about what an incident is in IT Operation, why you should distinguish them from other types of support tickets and how to put them in order.
In this post, I will continue exploring the basic concepts of Incident Management.
If it's not written down, it never happened
Incident lifecycle, workflow and states
Now, that we already know, at least in theory, what an incident is and how to sort their queue it is time to take a look at the incident lifecycle. Let’s step back for a moment and ask ourselves a question: “How do we know that there is an incident?”
To work on incidents, we need some records of them, someplace to note all the information that might be useful for their resolution. Apart from that, it is helpful to share this information with your teammates, so they know what is going on and what the current statuses of specific issues are. If there is no incident record, it is really hard to track any changes and follow status updates. So formally, an incident becomes known when the information about it has been logged and a corresponding record has been created. In IT Operations, the proven best practice for incident logging is to use ITSM tools that are specifically designed for such purpose.
After an incident record has been created, support engineers can become aware that there is some issue and start working towards its resolution. After the issue has been resolved, the incident record should be updated with resolution information, so other engineers know that this specific issue has been already addressed and there is no need to look into it at least right now. On the way from point A, an incident happened, to point B, the incident was resolved, the incident record can transition through different states and status updates, e.g.:
- incident acknowledged – a support representative verified and confirmed that the issue indeed exists;
- incident assigned to an engineer – incident record has been assigned to a support engineer for investigation and resolution;
- work in progress – a support engineer or a team is working on the issue and trying to fix it;
- awaiting confirmation – a support representative was not able to verify the issue and is waiting for additional information;
- customer pending – a support engineer has made some changes and is waiting for customer feedback to ensure that issue is indeed fixed, etc.
The exact incident status names and definitions may vary from system to system and from process implementation to implementation as well as the workflow that is described in a state diagram for incident records. Nevertheless, the underlying idea remains the same – to have an explicitly defined workflow for processing all incidents according to the same standard rules.
In most scenarios, a simple workflow that defines only two incident states: “active” and “resolved,” works fine. It is just good enough for 80% of incident management process implementations. If this is insufficient in your specific situation, you should remember that workflows are not something carved in stone and therefore they can and should be tailored to your needs.
It is not my business
Incident owning, assigning and dispatching
As in my example with patients at the hospital, it is good to know who your doctor is. Basically, who is responsible for taking care of a specific incident? If no one is in charge, then there is a high chance that nobody will do something to solve the issue. It has nothing to do with technology but rather with human psychology. It is so easy for many of us to say, “It’s not my business…” and then to take responsibility and accept related risks of making things worse. Of course, someone might say that it is a matter of organizational culture, character, and personal attitude, but when you have strict SLAs to comply with (more on this later in this post), you just do not have the luxury of checking every time whether someone is going to fix that new dumb issue.
So, how are you going to ensure that every incident has its own “doctor”? Clearly, you have two options. The first one is to have some person (or a few people) who will dispatch incoming incidents to appropriate support teams or even assign them to specific engineers right now. To do this, the dispatcher should have some rules to route incidents. The second one is to have some automation system with a certain algorithm to process registered incidents and forward them to relevant technicians. These rules may vary from team to team and from organization to organization, but the key point is to have an algorithm or procedure in place which everybody understands and follows.
From my experience, there is no single best solution for incident dispatching. In some cases, it is possible to have one-to-one relationships between computer systems and corresponding support teams and fully automate incident assignments based on this knowledge. In the other ones, the boundaries among systems and responsibility zones might be blurred and manual dispatching works best. In some, it is just a straightforward process of assigning all incoming incidents to the first tier of support (L1) and later escalating them to a higher tier if needed. Of course, it would be great to automate incident assignments as much as possible, although there might be organizational, experience and process differences that will not allow this to happen. So, I advise you to experiment and choose one that works best for you.
We will look into your issue as soon as we can…
Incidents and SLA
For business operations, it is crucial that its supporting processes were predictable. A Service Level Agreement, or shortly SLA, is an agreement between a consumer, usually a business, and a supplier, or, in other words, a service provider, which sets such expectations by defining service-level requirements. These requirements, for example, may determine that your workplace should be operational 99% of business hours. If there are no such expectations, it becomes really hard to run a business. Let me illustrate this point.
For example, you, as an office employee, come to work every working day to carry out your duties. You are assigned some tasks and plan your time accordingly to complete these tasks on time. How do you do that? Basically, you assume that each working day you have 8 (or any other number) working hours to work on tasks. You estimate the number of hours it will take to complete each task, book a time for them on your schedule and start working. But what if each day when you come to work, you don’t know exactly, how many hours you will be able to work today? How in this case can you make any plans and commitments to complete the tasks in some due time?
Another example is on a business level. When you go to a bank, you fully expect that you will be able to make your payments during the bank’s opening hours. However, what can you do if the bank’s payment processing system is not working and there is no forecast of when it will become operational again? I assume that you might decide to transfer your money to another bank that can at least provide you with information on recovery time in case of outages.
To ensure predictability, SLAs can define a target system availability as well as recovery time objectives. To put it another way, there are usually obligations to resolve incidents and recover computer systems from failures in some specified time.
When an incident happened, and you cannot do what you intended to, at least it is helpful to know when the issue is going to be resolved so you can make adjustments to your plans.
In the next part of this series, I will talk more about SLA metrics to measure process performance and service levels. Additionally, we will look into the outcomes of the incident management process and how they can be connected with other activities in IT Operations.
So, stay in touch and ask your questions in the comments!