Monitoring from the DevOps perspective

Despite the common misconception, monitoring in a DevOps-aligned organization is not about “Hey, let’s use Prometheus, ELK, Grafana or whatever else and output nice dashboards on big wall-mounted screens!” Moreover, monitoring is not about tools and underlying technologies. In the first place, monitoring is about the value you will have from the combination of tools, processes and people in your teams.

From my experience, the implementation of even best-in-the-class monitoring solutions failed if they didn’t correspond to your internal processes, organization structure and actual needs of its consumers. So, this post will be about organizational and architectural aspects rather than technical ones.

Spoiler: there is no best monitoring tool or solution on the market regardless of what salespeople can tell you.

Traditional IT monitoring

Historically, monitoring was something that the majority thought about as purely IT Operations thing. Developers wrote code, created applications and handled them over to Ops people for deployment and running. In the best cases, both parties clarified the requirements for monitoring an app before launching it into production. In other ones, Ops did their best to understand a healthy state for a specific application and configured their monitoring tools accordingly. If a new version containing some breaking changes was released, the only way to avoid false-positive alerting in Operations was to deploy it into a staging environment and verify whether it was not causing an alert storm in a monitoring console.

Using as few monitoring tools as possible was common to minimize the costs and maintenance overhead. As usual, many organizations incorporated only one primary monitoring solution that all teams had to use as the most relevant source of operational metrics and events. If there was a need to dig into some context-specific areas such as security, network, application performance monitoring, etc., other tools were used on-demand and usually limited to one functional team.

The primary advantage of such monitoring implementation is that it can be run with a relatively small workforce: in general, one system requires less maintenance effort than a dozen different tools. Also, having only one system to operate in makes it easier to ensure a unified user experience and standard processes for alert handling.

On the contrary, the implementation effort for these unified monitoring tools remained quite big. As infrastructure and application environments have lots of varieties, especially on a large scale, it usually requires some trade-offs during the configuration of monitoring rules for all of them: you either have to make additional investments to tailor unsupported monitoring scenarios to your needs or take it “as is,” with limited functionality for some of your components. Apart from that, due to their versatile nature, such ‘to-rule-them-all’ monitoring tools are pretty sophisticated and require substantial expertise to configure them. As a result, a small group of IT engineers who support it usually becomes a bottleneck for multiple inquiries from development teams to put some new applications on monitoring or change the configuration for an existing one.

The monitoring system becomes a monolithic application that is hard to reconfigure, has many dependencies and accumulates technical debt like a high-interest credit card.

But don’t get me wrong. The traditional approach to IT monitoring described here should not be considered ineffective or outdated. In the application development lifecycle, it can be a perfect fit in specific environments with low change velocities, such as big, stable enterprises or highly regulated organizations. However, it cannot keep up with the fast flow of small and frequent changes promoted by the DevOps methodology.

The new era

A new approach to software development, initially promoted by the Agile movement, started gaining traction in the 2010s. Shorter development intervals, continuous changes to the product/project based on the feedback from the previous cycle, and the involvement of people from different areas created new challenges to implementing monitoring solutions that now should satisfy not only the needs of operational engineers but also developers, testers, security and business representatives altogether. Moreover, each role had a different point of view on the information it would like to have from the monitoring. What can be of high interest for Ops, like CPU or memory metrics, has little or no value for people from marketing or sales who are more interested in tracking the number of acquired new customers or submitted orders.

So, agile teams demanded new monitoring solutions that they can easily use and reconfigure independently without the dependency on some external party. Some vendors tried to deliver more and more functionality in their tools to cover the needs of all product/project roles. However, that didn’t work well for me because it added more complexity and created multitools that can do almost anything but are not as good as specialized solutions: tools designated for security event management are hardly applicable for tracking business metrics.

Apart from that, you should also consider personal preferences in different teams. If a team is more familiar with a specific monitoring solution and can efficiently use it in the current context, why enforce some enterprise-scale solution that will slow them down ten times? Such top-down enforcement will only make your teams resistant to them, and it is much better to give them freedom in toolkit choice. People are likelier to do their best with their own initiatives rather than external motivation.

That doesn’t mean, however, that the team’s choices of tools will always be the best ones, but the essence of Agile software development is that you learn from experience and adapt. During the application lifecycle, the requirements and product vision can change significantly from what you started with, and at some point, you might want to reevaluate your monitoring toolbox. That is totally fine and is highly desirable. The key idea is that your monitoring experience should change and evolve with your application or service.

DevOps challenges

One of the ideas promoted by DevOps is to make your product or project as self-sufficient as possible. Ideally, you should be able to do most of the tasks locally in your team with no or little dependency on external parties (see my Notes on “The Unicorn Project”). Besides, the idea of running services independently became very popular after the famous Jeff Bezos’s “API Manifesto” and followed Amazon’s success, creating an organizational pattern where many teams operate autonomously by their own internal rules.

Additionally, although it is good to have all the required expertise: Devs, Ops, InfoSec, QA, Product Owners, onboard and high-performance independent teams, it created new challenges in terms of ensuring that all these produced applications and services, especially customer-facing, are delivered according to some unified service quality levels to satisfy customer expectations.

Hint: your customers don’t care about your technical, organizational or any other internal specifics; they use your product to solve their problems.

So, on the one hand, now you can have a range of applications and services that use different technology stacks, have different requirements towards their monitoring, and, what’s more, might use a couple of specialized tools – one to monitor network performance, other one to track security events, yet one more to monitor user behavior, etc. On the other hand, having dedicated operational centers, aka NOC, for each product/service team is ineffective, considering the relatively small size of development teams (remember 2 pizza team rules) and the overhead for establishing and running a NOC. So, somehow, you should find a balance between the variety of monitoring tools your teams use and the efficiency of your NOC.

That challenge created a marked demand for a proper solution, and companies such as PagerDuty (f. 2009), OpsGenie (f. 2012), VictorOps (f. 2012) and others responded to it with a new class of monitoring applications — incident response or incident management platforms. They were not monitoring tools per se but more like aggregators that can consume information from different monitoring solutions and streamline it in a single incident management process.

Hint: DevOps doesn’t replace ITIL, and lots of ITIL practices are necessary for a DevOps organization.

Using these aggregators, you can still employ different monitoring tools and simultaneously have top-down visibility on incidents in all services and applications you run. The same applies to applications such as Grafana, Kibana and other data visualization solutions. They are useless on their own but powerful in displaying data from various monitoring tools they integrate.

Even Microsoft eventually gave up on having a single operational team that monitors all company services. Instead, they created a monitoring toolkit enabling other development groups to assemble custom monitoring solutions that suit their needs.

Final thoughts

Now, you might start thinking how complicated all that stuff has become. If you followed the story and put all the pieces of that puzzle together, you can now see that monitoring in a DevOps organization is not only about selecting the fanciest tools. It has lots to do with your organization as well: how your monitoring process and routines should look like, who will be using that system, what data is essential for your application, what metrics your team members would like to track, how you are going to inform your internal and external customers on outages and performance degradations in your services, and many other essential aspects.

How you design your monitoring solution to satisfy the business objectives of your product or service is also crucial for the future success of your enterprise. When you can track how many new users signed up for your service or how much money a new feature earned you last period, you can stop guessing and start building your product strategy based on real customer feedback.

As Adam Savage wrote in his book, “Every tool is a hummer,” a person’s hand and craftsmanship can create a masterpiece with it. So, don’t rely blindly on yet another new cool monitoring application and always use your most valuable asset for logical thinking.