Software systems can either work or not work. When they work users are happy, we make money and everything is great. When they're not working, though, we miss a chance to make money and users get angry (rightfully so).
The system works = Good.
The system doesn't work = Bad.
So, we just make the system work all the time and we make more money, right? Well, yeah, possibly, kind of, but you should know how expensive that is.
We're going to talk about making a system work more often and not work less often, how to achieve that and how much can it cost (spoiler: it's not linear, it becomes A LOT).
In more technical terms, we're going to talk about what is system availability, what is system downtime, how to measure availability of a service, high availability systems, how to architect a system for higher availability, and what is the cost of high availability in a system.
What is System Availability?
Let's go with a few technical terms first. We call System Availability the percentage of time that a system is working. By the way, working means it can serve user requests. We measure System Availability in nines (9s), where two nines (2 9s) means the system is working 99% of the time. If the system is working 99% of the time, it's not working 1% of the time.
There are more states than working and not working. For example, the system could be presenting old data when it can't fetch the latest, or it could be working for only some users, or with increased latency. These are not necessarily bad things, they're actually strategies to deal with parts of a system being down. However, to keep this discussion simple, we're just going to consider the two states of up (everything is working) and down (nothing is working).
This feels like a good spot to warn you that this discussion requires a bit of math. For now we'll go with some numbers so you get a feeling of how much actual downtime we're talking about, but we'll be multiplying some probabilities later.
95% availability = 5% downtime = 1 hour 12 minutes of downtime per day.
99% availability = 1% downtime = 14 minutes of downtime per day or 3.5 days per year.
99.5% availability = 0.5% downtime = 3.5 hours of downtime per month or 1.8 days per year.
99.9% availability = 0.1% downtime = 8 hours 45 minutes of downtime per year.
99.99% availability = 0.01% downtime = 52 minutes of downtime per year.
We measure availability in 9s and operate in probabilities, but I wanted you to understand how much downtime our values for availability really mean. Going from 95% to 99% availability doesn't seem too crazy, but can you imagine what you'd need to do to achieve 99.99% availability? Probably rearchitect the whole system and even the whole operations area of the company, at a minimum!
How to Calculate System Availability
Calculating system availability can be done in two ways:
If the system is already implemented, we can measure system availability with the formula 1 - % of failed requests = system availability.
If we're designing a high availability system, we need to calculate the intersection of the probability of each component failing. Let's see an example.
Designing a System for High Availability
Let's start with a basic system and see how to measure system availability. Here's what our system looks like:
We'll make it serverless, because it makes the math easier
We're going to use AWS
We'll leave the webpage out of this example, and just deal with the backend
There are GET, POST and PUT requests that will always be handled by AWS API Gateway -> AWS Lambda -> AWS RDS with a single instance.
If any part of the system fails, it will return an error response. There are no caches and no automatic retries (these are areas where we can improve system availability!).
I'm simplifying things a bit so you can understand how to calculate system availability. In the real world, there are a lot more moving parts, such as a VPC for RDS and Secrets Manager, and a lot of What Ifs that can happen like a whole AWS region failing. But let's enjoy a moment of blissful ignorance, at least for this article.
If our code works all the time, we could assume our system works all the time, right? Well, we'd be wrong. It turns out infrastructure can fail as well! And AWS even tells us how often it fails, through its Service Level Agreements (SLAs).
What is a Service Level Agreement (SLA)
SLAs, or Service Level Agreements, are contracts that outline the level of service that a provider is expected to deliver to a customer. SLAs often specify the availability, reliability, and performance of a service, as well as the terms and conditions for service delivery and any penalties or credits that may be applied if the service does not meet the agreed-upon level of quality.
We can check AWS's SLAs in this link. There, we can see that both API Gateway and Lambda have 99.95% availability, and RDS single instance has 99.5% availability.
We're not actually guaranteed that services will have the service availability stated in the SLA. We're only guaranteed that if they don't, AWS will give us some money back (which will probably not cover our own losses if it fails). SLAs are actually a contract that says "If the service works for less % of time than X, we'll pay you $Y.". No guarantees that it will work, just a promise to pay you if it doesn't.
I'm trying to keep things simple so we're just going to take those numbers at face value. This is actually enough for most systems, but a handful of industries and use cases require a much deeper analysis.
Calculating System Availability
First, we're going to define the rules:
Our system's components are AWS API Gateway, AWS Lambda and AWS RDS.
If we went into more detail we should consider at least VPC and Secrets Manager, but let's ignore them for the sake of simplicity.
The availability of each service is:
API Gateway: 99.95% availability
Lambda: 99.95% availability
RDS single instance: 99.5% availability
We can say that our system is available if and only if every component is working
Failure events are statistically independent (i.e. they don't depend on other components failing)
The combined probability of every component working is the product of the probability of every single component working. So, the availability of our system is 0.9995 0.9995 0.995 = 0.994 = 99.4%.
For the record, that's almost 2 days of downtime a year. Can we live with that Honestly, most systems can actually live with that level of availability. So at this point, I want you to stop panicking. You're fine! But for the sake of learning, let's explore how to increase it.
How to Increase System Availability
If we take a look at the numbers, we'll notice our system's availability is dominated by the smallest service availability: AWS RDS. Taking a page out of Theory of Constraints, we can determine that the only meaningful increments to system availability can come from incrementing the constraint.
How to Make AWS RDS Highly Available
If you opened the RDS SLA page earlier, you might have noticed that Multi-AZ DB Instances have an SLA of 99.95%. If we use a Multi-AZ RDS Cluster instead of a single instance, then our system's availability goes up significantly: 0.9995 0.9995 0.9995 = 0.9985 = 99.85%.
That's approximately 10 hours of downtime per year. Much better than our previous number. Good job!
Can we go further up? Yes, we can!
How to Increase System Availability Further
We can use AWS ElastiCache as a cache for reads. That way, read operations can be served from the cache even if our database fails. Redis multi-AZ has 99.9% service availability (from the ElastiCache SLA).
For write operations, we can detect whether the database is failing and push writes to an SQS queue if that happens, so they can be completed at a later time when the database is back up. SQS queues have 99.9% service availability (from the SQS SLA).
The probability of our Multi-AZ RDS Cluster failing is 1 - 0.9995. The probability of our Redis cluster failing is 1 - 0.999. If we combine both, we get the probability of them failing at the same time: (1 - 0.9995) (1 - 0.999) = 0.0000005. And then we can calculate the probability of either of them working (in which case, our read operations are working): 1 - ((1 - 0.9995) (1 - 0.999)) = 0.9999995.
The same numbers can be used for our write operations, since the service availability of an SQS queue is also 99.9%.
Finally, our system availability is now 0.9995 0.9995 (1 - ((1 - 0.9995) * (1 - 0.999))) = 0.998999 = 99.8999%. It's the same number for reads and writes even though we're using different services (ElastiCache for reads and SQS for writes), because those two services have the same service availability. If they had different values, our system would have a different availability for reads than for writes. Which is not necessarily a bad thing.
Can we go even further up? Maybe... But at what cost?
How much does system availability cost?
For the first improvement that we made, the cost calculation is pretty simple: We need to pay twice as much per month for RDS. A multi-az cluster with two instances costs twice as much as a single instance (which makes sense, because you get two instances). The change is also pretty simple to make while we're designing the system, thanks to RDS being a managed service.
Adding ElastiCache and SQS is not that simple, and it's definitely not cheap. Both ElastiCache and SQS have their own monthly costs (not too high), but you'll also need to make a few changes to your app:
You'll have to modify it to write to ElastiCache (which increases Lambda execution time and response time) and read from ElastiCache when the DB is not working.
You'll need to modify it to write to SQS when the database is not working.
You'll need an app to consume from the SQS queue and write to the DB when it's back up.
It's not something completely unfeasible. But it's significantly harder and more expensive than what we had to do to use multi-AZ RDS. And the increase in availability is smaller. That's because the more highly available your system is, the harder and more expensive it is to increase system availability.
How much system availability is enough?
As you asymptotically approach 100% system availability, you need to know when to stop. More is not always better: at some point your users won't even notice the change. But I assure you they will notice the price increase that you'll need to apply to stay in business because you went overboard and quadrupled your operational costs.
Tips for Highly Available Systems in AWS
Avoid using a single instance (e.g. one EC2 instance or one RDS instance). Going multiple instances is usually cheap. And do so in multiple AZs when you can (which is 99% of the time).
Don't go multi-cloud unless you have a good reason to do so. It's going to cost you a lot more. Avoiding cloud vendor lock-in is not a good enough reason.
Don't even go multi-region unless you really have to. If you must do it, use different disaster recovery strategies for different parts of your service. Use pilot light for most of them if you can.
Don't overengineer for an idealized number. Netflix is down pretty often and we're still using it. If you're not sure whether you're fine or you need higher system availability, high chances are you're perfectly fine.
Reality is not black and white, HTTP 200 = user happy and HTTP 404 = user sad. Implement retries and/or degraded responses, and don't lose focus on other quality aspects like response latency. More often than not, using a cache is a good idea.
Users only get really mad when they lose money. The rest of the time they rarely even care. Protect their money and their ability to make money.
System availability is the percentage of time your system is working and making money.
You can calculate system availability by multiplying the service availability of every component of your system.
High system availability is better, but also more expensive.
Find a level of system availability that works for your users (look at the downtime, think of the impact), and aim for that.
Don't overpay, don't overengineer.
Implement mitigation strategies (retries, degraded responses).
Don't forget about other quality aspects like latency.
You should focus on what the users actually care about.
Ask them what they care about.
System works = Good.
System doesn't work = Bad.
Paying a bit for system to work much more often = Good.
Paying too much for the system to work only slightly more often = Bad.
Don't overpay, don't overengineer.
Thanks for reading!
If you're interested in building on AWS, check out my newsletter: Simple AWS.
It's free, runs every Monday, and tackles one use case at a time, with all the best practices you need.
If you want to know more about me, visit my website, www.guilleojeda.com