Balancing System Availability and Cost: A Guide to Designing Highly Available Systems
9 min read
Table of contents
- Understanding System Availability
- How to Calculate System Availability
- Designing a System for High Availability
- What is a Service Level Agreement (SLA)
- Calculating System Availability
- How to Increase System Availability
- How to Make AWS RDS Highly Available
- How to Increase System Availability Further
- How Much Does System Availability Cost?
- How Much System Availability is Enough?
- Tips for Highly Available Systems in AWS
When it comes to software systems, they can either work or not work. If they work, users are satisfied, and businesses make money. If they don't work, it leads to missed opportunities to make money, and unhappy users. It is essential to have a highly available system to keep users happy and businesses profitable. In this article, we will explore the meaning of system availability, how to measure it, and how to achieve high system availability.
Understanding System Availability
System availability refers to the percentage of time that a system is functional and serving user requests. The uptime of a system is measured in nines, where 99% uptime equals two nines (2 9s) and means the system is not working 1% of the time. The probability of system availability is measured by the likelihood of each component of the system functioning.
There are more states than working and not working. For example, the system could be presenting old data when it can't fetch the latest, or it could be working for only some users, or with increased latency. These are not necessarily bad things, they're actually strategies to deal with parts of a system being down. However, to keep this discussion simple, we're just going to consider the two states of up (everything is working) and down (nothing is working).
For reference, this is how much downtime you get for different percentages of availability:
95% availability = 5% downtime = 1 hour 12 minutes of downtime per day.
99% availability = 1% downtime = 14 minutes of downtime per day or 3.5 days per year.
99.5% availability = 0.5% downtime = 3.5 hours of downtime per month or 1.8 days per year.
99.9% availability = 0.1% downtime = 8 hours 45 minutes of downtime per year.
99.99% availability = 0.01% downtime = 52 minutes of downtime per year.
How to Calculate System Availability
Calculating system availability can be done in two ways:
If the system is already implemented, we can measure system availability with the formula 1 - % of failed requests = system availability.
If we're designing a high availability system, we need to calculate the intersection of the probability of each component failing. Let's see an example.
Designing a System for High Availability
To design a highly available system, it is essential to understand the components of the system, their availability, and their impact on the overall system. Let's start designing a basic system and calculate its availability. Here's what our system looks like:
We'll make it serverless, because it makes the math easier
We're going to use AWS
We'll leave the webpage out of this example, and focus on the backend
There are GET, POST and PUT requests that will always be handled by AWS API Gateway -> AWS Lambda -> AWS RDS with a single instance.
If any part of the system fails, it will return an error response. There are no caches and no automatic retries (these are areas where we can improve system availability!).
I'm simplifying things a bit so you can understand how to calculate system availability. In the real world, there are a lot more moving parts, such as a VPC for RDS and Secrets Manager, and a lot of What Ifs that can happen like a whole AWS region failing. But let's enjoy a moment of blissful ignorance, at least for this article.
If our code works all the time, we could assume our system works all the time, right? We'd be wrong, though. Infrastructure can fail as well! And AWS even tells us how often it fails, through its Service Level Agreements (SLAs).
What is a Service Level Agreement (SLA)
SLAs, or Service Level Agreements, are contracts that outline the level of service that a provider is expected to deliver to a customer. SLAs often specify the availability, reliability, and performance of a service, as well as the terms and conditions for service delivery and any penalties or credits that may be applied if the service does not meet the agreed-upon level of quality.
We can check AWS's SLAs in this link. There, we can see that both API Gateway and Lambda have 99.95% availability, and RDS single instance has 99.5% availability.
We're not guaranteed that services will have the service availability stated in the SLA. We're only guaranteed that if they don't, AWS will give us some money back (which will probably not cover our own losses if it fails). SLAs are actually a contract that says "If the service works for less % of time than X, we'll pay you $Y.". No guarantees that it will work, just a promise to pay you if it doesn't.
I'm trying to keep things simple so we're just going to take those numbers at face value. This is enough for most systems, but a handful of industries and use cases require a much deeper analysis.
Calculating System Availability
First, we're going to understand our premises:
Our system's components are AWS API Gateway, AWS Lambda and AWS RDS.
If we went into more detail we should consider at least VPC and Secrets Manager, but let's ignore them for the sake of simplicity.
The availability of each service is:
API Gateway: 99.95% availability
Lambda: 99.95% availability
RDS single instance: 99.5% availability
We can say that our system is available if and only if every component is working
Failure events are statistically independent (i.e. they don't depend on other components failing)
The combined probability of every component working is the product of the probability of every single component working. So, the availability of our system is 0.9995 * 0.9995 * 0.995 = 0.994 = 99.4%.
That's almost 2 days of downtime a year. Most systems can live with that level of availability. But for the sake of learning, let's explore how to increase it.
How to Increase System Availability
If we take a look at the numbers, we'll notice our system's availability is dominated by the smallest service availability: AWS RDS. Taking a page out of Theory of Constraints, we can determine that the only meaningful increments to system availability can come from incrementing the constraint.
How to Make AWS RDS Highly Available
If you opened the RDS SLA page earlier, you might have noticed that Multi-AZ DB Instances have an SLA of 99.95%. If we use a Multi-AZ RDS Cluster instead of a single instance, then our system's availability goes up significantly: 0.9995 * 0.9995 * 0.9995 = 0.9985 = 99.85%.
That's approximately 10 hours of downtime per year. Much better than our previous number. Good job! Can we go further up? Yes, we can!
How to Increase System Availability Further
We can use AWS ElastiCache as a cache for reads. That way, read operations can be served from the cache even if our database fails. Redis multi-AZ has 99.9% service availability (from the ElastiCache SLA).
For write operations, we can detect whether the database is failing and push writes to an SQS queue if that happens, so they can be completed at a later time when the database is back up. SQS queues have 99.9% service availability (from the SQS SLA).
The probability of our Multi-AZ RDS Cluster failing is 1 - 0.9995. The probability of our Redis cluster failing is 1 - 0.999. If we combine both, we get the probability of them failing at the same time: (1 - 0.9995) * (1 - 0.999) = 0.0000005. And then we can calculate the probability of either of them working (in which case, our read operations are working): 1 - ((1 - 0.9995) * (1 - 0.999)) = 0.9999995.
The same numbers can be used for our write operations, since the service availability of an SQS queue is also 99.9%.
Finally, our system availability is now 0.9995 * 0.9995 * (1 - ((1 - 0.9995) * (1 - 0.999))) = 0.998999 = 99.8999%. It's the same number for reads and writes even though we're using different services (ElastiCache for reads and SQS for writes), because those two services have the same service availability. If they had different values, our system would have a different availability for reads than for writes. Which is not necessarily a bad thing.
Can we go even further up? Maybe... But at what cost?
How Much Does System Availability Cost?
The cost of system availability depends on the services used and the design complexity. The cost of these modifications and their maintenance can be high.
For the first improvement that we made, the cost calculation is pretty simple: We need to pay twice as much per month for RDS. A multi-az cluster with two instances costs twice as much as a single instance (which makes sense, because you get two instances). The change is also pretty simple to make while we're designing the system, thanks to RDS being a managed service.
Adding ElastiCache and SQS is not that simple, and it's not cheap. Both ElastiCache and SQS have their own monthly costs (not too high), but you'll also need to make a few changes to your app:
You'll have to modify it to write to ElastiCache (which increases Lambda execution time and response time) and read from ElastiCache when the DB is not working.
You'll need to modify it to write to SQS when the database is not working.
You'll need an app to consume from the SQS queue and write to the DB when it's back up.
These changes are not unfeasible, but they're significantly harder and more expensive than what we had to do to use multi-AZ RDS, and the increase in availability is smaller. The reason behind that is that the more highly available your system is, the harder and more expensive it is to increase system availability.
How Much System Availability is Enough?
As you asymptotically approach 100% system availability, you need to know when to stop. More is not always better: at some point your users won't even notice the change. But I assure you they will notice the price increase that you'll need to apply to stay in business because you went overboard and quadrupled your operational costs.
Tips for Highly Available Systems in AWS
To achieve a highly available system, some tips are:
- Avoid using a single instance; use multiple instances across multiple availability zones.
- Avoid multi-cloud or multi-region unless necessary. Avoiding cloud vendor lock-in is not a good enough reason
- Implement retries and/or degraded responses.
- Don't overpay or overengineer.
Achieving high system availability requires understanding the components of the system, their availability, and their impact on the overall system. By identifying bottlenecks and using appropriate services, the availability can be increased. It is important to remember that high system availability comes at a cost, and the level of availability should be tailored to the users' needs without overpaying or overengineering the system.
Thanks for reading!
Cloud solutions are often much more complicated than they need to be. The Simple AWS newsletter is about removing that complexity. Join hundreds of software experts learning how to solve complex problems in AWS with simple solutions, and how to scale and secure them with best practices.
Every issue starts with a real scenario, presents the simplest solution possible, and discusses best practices, always considering the context and tradeoffs.
If you'd like to know more about me, you can find me at www.guilleojeda.com