Recently, I was reviewing my notes from the meetings with clients for the last year and noticed that some questions about IT monitoring were repeated almost in half of the records. There was nothing special about them, just a basic checklist that many IT Operations managers use while onboarding new services. However, the client’s response to those questions revealed some unpleasant mistakes that even mature companies make in their operation. So, let’s talk about them.
Mistake #1: No monitoring
It is sad to hear, but many organizations, especially small ones, still do not pay attention to the importance of continuous service health monitoring. Some people just see network, infrastructure or application monitoring as an unnecessary overhead to their business processes. Others faithfully believe that modern technologies and cloud services are so reliable that there is no need to worry about their availability.
Unfortunately, such irresponsibility has its price. If you do not know whether your computer systems operate as needed, you will have to bear the risk of loss or damage to your business. The more critical the system is, the more significant the impact of system outages. Those who survive this impact first time usually change their mind about the monitoring process very quickly. The others, well, they just do not stay in business long enough to tell us their experience.
Mistake #2: Insufficient coverage
The next most common mistake is not having end-to-end monitoring of all your IT components. A computer system is like an onion: its work depends on multiple layers – network, operational systems, databases, web servers, application components and even end-user devices and web browsers. Just monitoring your system on one or two layers is not enough to see the whole picture. For example, from an infrastructure perspective, everything seems fine and all lights are green, but your users report multiple errors when accessing the application itself. This is clear evidence that you failed to implement application performance monitoring and synthetic transaction checks.
This mistake is generally observed in the organizations where departments that support computer systems work separately and do not align with each other on some unified principles of operation. Typical scenarios might include teams of Devs and Ops that have different views on the core processes of running applications in production. Developers believe that their responsibility ends after an application passed all unit, integration, UI and other tests, while Operations engineers feel responsible only for providing the underlying infrastructure for the application. As a result, you have two pieces, an application package and a bunch of servers or other IT stuff, that are fully OK on their own but totally useless as a business application.
Mistake #3: Too much data
This one is the root of all evil unsuccessful implementations of monitoring systems. In the beginning, it seems that the more information you gather about your computer systems, the better. However, this is not true. When you start receiving thousands of events every minute, it becomes really hard to make some sense from all these red lights, flags, exclamation marks, or whatsoever. I witnessed too many projects of monitoring implementation failed due to overblown requirements to monitor everything.
This usually happens for two reasons. The first one is us, computer geeks. You know, we are so excited about all these new cool shiny features in this latest release of our favorite program that we want to try out all these switches, configuration options and tons of other stuff. The second one is the managers. Yes, I mean all these Dilbert-style bosses who do not make an effort, clarify objectives and effectively communicate with tech nerds. These two statements may sound harsh, but as you see, they both point to people and not monitoring tools itself.
So, do not blame the tools and look in a mirror if you face such an issue. Start small and then add new data only if it has a value for you. The rule here is simple – if a monitoring alert, warning or event is not of any value for you, get rid of it. If you do nothing with this data, you do not need it.
Mistake #4: Collecting the wrong data
Having a lot of data from the monitoring system does not necessarily mean that you collect the data you actually need, and sometimes people forget about the real monitoring objectives in all this informational noise.
“There is nothing quite so useless,” said management expert Peter Drucker, “as doing with great efficiency something that should not be done at all.”
For example, you might have a server with a website running on it. You need to monitor the website’s availability for its users from different regions. Novice engineers might configure an ICMP probe that will ping the server every X minutes and will raise an alert if ping fails after Y attempts. Simple and effective solution, right? Wrong! ICMP ping will provide you only with the information that your server can be reached over a network, but it will not indicate whether the server listens to HTTP/HTTPS requests and serve the website content. Moreover, if you check your server availability only from your network where the monitoring system is located, you will not know if the website becomes inaccessible to users from other locations in case of network routing issues. Basically, in this example, you are getting wrong data that do not reflect actual website availability.
The purpose of monitoring is not just about getting some alerts, but about getting the information that is relevant to your goal.
Mistake #5: No monitoring routine
Unfortunately, many people do not understand that monitoring is not about alerts, agents, consoles and dashboards. All these are merely useless pieces of software unless there is an established Event Management process. If there is an alert with no reaction to it, no follow-up or no automated remediation, your monitoring is good for nothing.
Imagine that you invested millions of dollars and built an airport. It was constructed by the most up-to-date standards and used the most recent technologies and devices to monitor flights from its air traffic control tower. However, your ATC personnel does not have clearly defined and unified rules of assisting planes’ landings and take-offs. Each flight is expedited differently if expedited at all. Some planes land successfully, some not, and some crash in each other in the air. Looks like a bad dream, right?
I used this example to illustrate the point that sometimes having even the best tools without established monitoring processes and well-trained staff can be worse than not having the monitoring at all. Well implemented IT monitoring is about people, processes and tools altogether. Remove one component, and I won’t give you a dime for the system you “monitor.”
What typical mistakes in IT monitoring or operation have you observed? Share your thoughts in the comments!