I work a lot with clients who need their computer systems up and running. It is not something unusual because today it is hard to imagine a business that does not rely on computers in its operation: accounting, sales, production or other departments use computer systems to perform their core functions.

When I discuss with my clients the terms of service agreement, I often hear the following phrases: “We need four (or five) nines SLA…” or better one “We need maximum uptime…”. However, they are rarely able to clarify their reasons without discussing the following questions:

  • How do you understand uptime / SLA?
  • Do you know what it means for your system?
  • How much are you willing to pay?
  • How much money will you lose each hour your system is down?
  • What is a reasonable uptime?

In this post, I am going to follow through with these questions and show you why a higher SLA is not always better.

What is uptime?

In the context of computer systems, the terms “uptime,” “SLA” and “availability” are tightly connected with the concept of high availability. Often they are used synonymously and this can cause some misunderstanding.

To speak the same language, let’s start with some definitions. Uptime is a metric of computer system performance that is usually measured in days, hours, minutes, etc., and represents the time when a computer system is operational. The opposite metric is downtime which is also measured in the units of time and represents the time when the system is down (non-operational). Availability is almost the same as uptime and is used interchangeably, but it is expressed as a percentage of the system’s operational time to the total time the system should be functional. All these metrics are usually measured over some time: a year, month, week or day.

An SLA or service level agreement is a document, usually, a legal one, which specifies the aspects of the service provided, in particular, the target value for the system’s availability. Since availability is the most commonly used key aspect of service level agreements, it became a synonym for SLA, which is not always correct.

What does it really mean in terms of time?

As we have already figured out, uptime and availability are time-based metrics. You can use a calculator or a conversion cheat sheet to get the figures for the specific target levels. For example, 99.99% uptime means that a system should be available 99.99% of its designated operational time. The acceptable downtime to reach this goal is 52 minutes per year or 4 minutes per month assuming that we talk about the system that is required to operate continuously all year round. If we consider 99.9% uptime, the acceptable downtime will be 8h 46m and 43m respectfully which is ten times less demanding.

Why do I talk about all these hours and minutes? Imagine that you have an e-commerce business that heavily relies on its website to sell products to your customers and actually to make you money. When your site is running, customers can access it and probably purchase some products. When the site is down, you are basically losing money on the orders that could have been processed during this time. In real-world scenarios, a business might depend on dozens of interconnected computer systems with different availability levels, but I use the described example for the sake of simplicity.

It is highly unlikely that you as a business owner or a CEO want to lose money. Therefore, you need your website running ideally 100% of the time or somewhat close to that, which could be 99.99% as in our example. That literally means that you accept downtime for your business of no more than 52 minutes per year or 4 minutes per month.

For now, it is all we need to know about time, so let’s switch to the money side.

How much does it cost to have 99.99% uptime?

First of all, when you design a new or rebuild an existing computer system, you should always consider its total cost of ownership or TCO. What does it mean? To put it simply, TCO is the money you have to spend on implementing or changing the system plus the cost of supporting that system over time, usually from 3 to 5 years. When we speak about the first part of the TCO equation, implementing better SLA or higher uptime will cost you more money. Besides, many inexperienced in IT clients often forget or omit the second part which nowadays may be a much more substantial sum of money due to the subscription nature of many infrastructure, platform and software services required to run your system.

Now, let’s go back to our example with the e-commerce website. Assume that the site’s infrastructure consists of a web app powering the site, some middleware application server to process orders and a backend database to store all the information:

If you are using Microsoft Azure to host all these components, each mentioned tier will have the following SLAs (all figures are represented up to the time I write this post):

  • Azure Web App – 99.95%;
  • Virtual Machine as an application server (with premium disks) – 99.9%;
  • Azure SQL Database – 99.99%.

Now the question is “What is the resulting SLA for this service as a whole?” In our example, if one of the tiers fails, the whole service becomes non-operational for its end users. Therefore, we will calculate the overall SLA as system availability in series:

Overall SLA = Web Tier SLA * Middleware Tier SLA * Backend Tier SLA = 99.95% * 99.9% * 99.99% = 99.84% or 99.8% roughly

As you can see, the resulted SLA will be lower than the lowest SLA in a series. 99.8% SLA means that your acceptable downtimes will be 17 and a half hours yearly or almost an hour and a half monthly. In real life, these figures will be even worse because I don’t consider the probabilities of human, process and software errors and stick simply to the infrastructure availability.

To calculate the cost of the described infrastructure which will provide you with 99.8% SLA you can use Azure Pricing Calculator. For our example, let it be $2,000 monthly.

What can be done to increase the overall availability level of our sample infrastructure to the targeted 99.99%? We can either look for options to increase the availability level of each tier or find ways to make the whole infrastructure more resilient. In our case, to reach the target availability level you can duplicate each tier or basically replicate the entire infrastructure:

By implementing this architecture, you can reach the service availability of 99.999% which should satisfy your initial goal of having a service with 99.99% SLA.
“Wait! But this means that my monthly infrastructure cost will be twice higher! I will have to pay 4 grand instead of only 2!”, you might say. Well, not exactly. Actually, you will have to pay even more for this because you will have to implement load balancing, traffic management and other failover mechanics to make this infrastructure more robust and autonomous. So, your yearly expenditure on infrastructure can quickly transform from $24,000 to $60,000 or $72,000. To make it worse, you should also clearly understand that the cost of support for such complex infrastructure will increase because you will need more skilled professionals to operate it.

In practice, to add one more “nine” to your current SLA, you will have to pay from 3 to 5 times more money than you spend on your current availability level.

What is the cost of downtime?

Now, that we have figured out the cost of achieving “four nines” in our example, it is time to turn back to the downtime figures I mentioned earlier. To simplify comparing, I have put all the statistics in the following table:

SLA

Acceptable yearly downtime

Yearly infrastructure cost

Decrease in downtime

Increase in cost

99.8%

17h 31m

$24,000

-

-

99.99%

0h 52m

$60,000

16h 39m

$36,000

I intentionally used time and money values in this table to demonstrate the idea that in our case time is indeed money. To have an additional 16 hours and 39 minutes of operational time you will have to pay $36,000 each year. You will be actually losing this sum of money.

To be precise, you are losing money in both cases: with 99.8% uptime by not processing orders, while with 99.99% uptime by buying additional time to handle them. Now the question is in which case you lose more.

Do you really need it?

At this point, it is high time to ask yourself: “How much money does my system make each hour on average and how much will it earn me an additional 16 hours and 39 minutes?” If this sum is less than $36,000, you are just throwing money away. Yes, you have heard it correctly. You are pouring money down the drain. It is not economically feasible to increase the uptime of your system or service unless you are sure that it will bring you more money than you spend on such an increase. Moreover, in some cases it might be more practical, from a financial point of view, to make your uptime worse and cut your operational costs.

I hope, now, after following me through this discussion, you can see that just targeting a higher SLA without sound reasoning behind this decision is not as good as it might seem.

Have you ever tried to do such math and find the optimal SLA for your systems? Share your uptime figures in the comments!