Despite the common misconception, monitoring in a DevOps-aligned organization is not about “Hey, let’s use Prometheus, ELK, Grafana or whatever else and output nice dashboards on big wall-mounted screens!” Moreover, monitoring is not about tools and underlying technologies.
Many teams that host applications in Azure App Service setup the availability monitoring for their apps using Application Insights web tests. Also, it is a common practice to put an application into the maintenance mode before applying any changes, so as not to be wakened up at 2 am in the morning.
I this part, I will talk a little bit more about the KPIs that are usually used to measure the performance of the Incident Management process implementation and what happens with incidents after they have been resolved.
In the first part of this series, I talked about what an incident is in IT Operation, why you should distinguish them from other types of support tickets and how to put them in order. In this post, I will continue exploring the basic concepts of Incident Management.
With this post, I want to start a series of publications on dealing with incidents in IT Operations and typical mistakes people make in this