Balancing High Availability and Cost in AWS Architectures

Software systems have two possible states: either they work, or don't work. If they work, users are satisfied, and businesses make money. If they don't work, it leads to missed opportunities to make money, and unhappy users. It is essential to have a highly available system to keep users happy and businesses profitable. In this article, we will explore the meaning of system availability, how to measure it, and how to achieve high system availability.

Understanding System Availability

System availability refers to the percentage of time that a system is functional and serving user requests. The uptime of a system is measured in nines, where 99% uptime equals two nines (2 9s) and means the system is not working 1% of the time. The probability of system availability is measured by the likelihood of each component of the system functioning. For those diving into AWS, understanding its architecture becomes crucial. Understanding Architecture in AWS provides a deep dive into AWS architectural concepts.

There are more states than working and not working. For example, the system could be presenting old data when it can't fetch the latest, or it could be working for only some users, or with increased latency. These are not necessarily bad things, they're actually strategies to deal with parts of a system being down. However, to keep this discussion simple, we're just going to consider the two states of up (everything is working) and down (nothing is working).

For reference, this is how much downtime you get for different percentages of availability:

95% availability = 5% downtime = 1 hour 12 minutes of downtime per day.
99% availability = 1% downtime = 14 minutes of downtime per day or 3.5 days per year.
99.5% availability = 0.5% downtime = 3.5 hours of downtime per month or 1.8 days per year.
99.9% availability = 0.1% downtime = 8 hours 45 minutes of downtime per year.
99.99% availability = 0.01% downtime = 52 minutes of downtime per year.

How to Calculate System Availability

Calculating system availability can be done in two ways:

If the system is already implemented, we can measure system availability with the formula 1 - % of failed requests = system availability.
If we're designing a high availability system, we need to calculate the intersection of the probability of each component failing. Let's see an example.

Master AWS with Real Solutions and Best Practices. Subscribe to the free newsletter Simple AWS. 3000 engineers and tech experts already have.

Designing For High Availability in AWS

To design a highly available system, it is essential to understand the components of the system, their availability, and their impact on the overall system. Let's start designing a basic system and calculate its availability. Here's what our system looks like:

We'll make it serverless, because it makes the math easier
We're going to use AWS
We'll leave the webpage out of this example, and focus on the backend
There are GET, POST and PUT requests that will always be handled by AWS API Gateway -> AWS Lambda -> AWS RDS with a single instance.
If any part of the system fails, it will return an error response. There are no caches and no automatic retries (these are areas where we can improve system availability!).
I'm simplifying things a bit so you can understand how to calculate system availability. In the real world, there are a lot more moving parts, such as a VPC for RDS and Secrets Manager, and a lot of What Ifs that can happen like a whole AWS region failing. But let's enjoy a moment of blissful ignorance, at least for this article.

If our code works all the time, we could assume our system works all the time, right? We'd be wrong, though. Infrastructure can fail as well! And AWS even tells us how often it fails, through its Service Level Agreements (SLAs).

If you're curious about the specifics of AWS services and their SLAs, check out AWS's official documentation on Service Level Agreements.

What is a Service Level Agreement (SLA)

SLAs, or Service Level Agreements, are contracts that outline the level of service that a provider is expected to deliver to a customer. SLAs often specify the availability, reliability, and performance of a service, as well as the terms and conditions for service delivery and any penalties or credits that may be applied if the service does not meet the agreed-upon level of quality.

We can check AWS's SLAs in this link. There, we can see that both API Gateway and Lambda have 99.95% availability, and RDS single instance has 99.5% availability.

We're not guaranteed that services will have the service availability stated in the SLA. We're only guaranteed that if they don't, AWS will give us some money back (which will probably not cover our own losses if it fails). SLAs are actually a contract that says "If the service works for less % of time than X, we'll pay you $Y.". No guarantees that it will work, just a promise to pay you if it doesn't.

I'm trying to keep things simple so we're just going to take those numbers at face value. This is enough for most systems, but a handful of industries and use cases require a much deeper analysis.

Calculating System Availability in AWS

First, we're going to understand our premises:

Our system's components are AWS API Gateway, AWS Lambda and AWS RDS.
If we went into more detail we should consider at least VPC and Secrets Manager, but let's ignore them for the sake of simplicity.
The availability of each service is:
- API Gateway: 99.95% availability
- Lambda: 99.95% availability
- RDS single instance: 99.5% availability
We can say that our system is available if and only if every component is working
Failure events are statistically independent (i.e. they don't depend on other components failing)

The combined probability of every component working is the product of the probability of every single component working. So, the availability of our system is 0.9995 0.9995 0.995 = 0.994 = 99.4%.

That's almost 2 days of downtime a year. Most systems can live with that level of availability. But for the sake of learning, let's explore how to increase it.

How to Increase System Availability in AWS

If we take a look at the numbers, we'll notice our system's availability is dominated by the smallest service availability: AWS RDS. Taking a page out of Theory of Constraints, we can determine that the only meaningful increments to system availability can come from incrementing the constraint.

How to Make AWS RDS Highly Available

If you opened the RDS SLA page earlier, you might have noticed that Multi-AZ DB Instances have an SLA of 99.95%. If we use a Multi-AZ RDS Cluster instead of a single instance, then our system's availability goes up significantly: 0.9995 0.9995 0.9995 = 0.9985 = 99.85%.

That's approximately 10 hours of downtime per year. Much better than our previous number. Good job! Can we go further up? Yes, we can!

How to Increase System Availability Further

We can use AWS ElastiCache as a cache for reads. That way, read operations can be served from the cache even if our database fails. Redis multi-AZ has 99.9% service availability (from the ElastiCache SLA).

For write operations, we can detect whether the database is failing and push writes to an SQS queue if that happens, so they can be completed at a later time when the database is back up. SQS queues have 99.9% service availability (from the SQS SLA).

The probability of our Multi-AZ RDS Cluster failing is 1 - 0.9995. The probability of our Redis cluster failing is 1 - 0.999. If we combine both, we get the probability of them failing at the same time: (1 - 0.9995) (1 - 0.999) = 0.0000005. And then we can calculate the probability of either of them working (in which case, our read operations are working): 1 - ((1 - 0.9995) (1 - 0.999)) = 0.9999995.

The same numbers can be used for our write operations, since the service availability of an SQS queue is also 99.9%.

Finally, our system availability is now 0.9995 0.9995 (1 - ((1 - 0.9995) * (1 - 0.999))) = 0.998999 = 99.8999%. It's the same number for reads and writes even though we're using different services (ElastiCache for reads and SQS for writes), because those two services have the same service availability. If they had different values, our system would have a different availability for reads than for writes. Which is not necessarily a bad thing.

Can we go even further up? Maybe... But at what cost?

How Much Does High Availability Cost?

The cost of system availability depends on the services used and the design complexity. The cost of these modifications and their maintenance can be high.

For the first improvement that we made, the cost calculation is pretty simple: We need to pay twice as much per month for RDS. A multi-az cluster with two instances costs twice as much as a single instance (which makes sense, because you get two instances). The change is also pretty simple to make while we're designing the system, thanks to RDS being a managed service.

Adding ElastiCache and SQS is not that simple, and it's not cheap. Both ElastiCache and SQS have their own monthly costs (not too high), but you'll also need to make a few changes to your app:

You'll have to modify it to write to ElastiCache (which increases Lambda execution time and response time) and read from ElastiCache when the DB is not working.
You'll need to modify it to write to SQS when the database is not working.
You'll need an app to consume from the SQS queue and write to the DB when it's back up.

These changes are not unfeasible, but they're significantly harder and more expensive than what we had to do to use multi-AZ RDS, and the increase in availability is smaller. The reason behind that is that the more highly available your system is, the harder and more expensive it is to increase system availability.

How Much System Availability is Enough?

As you asymptotically approach 100% system availability, you need to know when to stop. More is not always better: at some point your users won't even notice the change. But I assure you they will notice the price increase that you'll need to apply to stay in business because you went overboard and quadrupled your operational costs.

Strategies to Improve System Availability

Several strategies can be employed to improve system availability. Some of these include:

Redundancy: Duplicate critical components of the system, so if one fails, another takes over. This can be achieved by setting up multiple instances of services, using load balancers, or having backup databases.
Failover: Implement failover mechanisms to automatically switch to a backup component or system when a failure occurs. This can be done using DNS failover, database replication, or deploying multiple instances of a service behind a load balancer.
Caching: Use caching to store frequently accessed data, reducing the need to access the main data source. This can help minimize downtime during database outages, reduce latency, and improve overall system availability.
Monitoring and Alerting: Continuously monitor the system and set up alerts to notify you when there is a component failure or when the system is underperforming. This enables you to quickly identify and resolve issues, minimizing downtime.
Load Testing and Capacity Planning: Regularly perform load testing to ensure the system can handle increasing user traffic and plan for capacity upgrades accordingly.
Graceful Degradation: Design the system in such a way that when a component fails, the system continues to operate, albeit with reduced functionality. This can be done by implementing fallback mechanisms, retries, or serving stale data when fresh data is not available.

Tips for Architecting Highly Available Systems in AWS

To achieve a highly available system, some tips are:

Avoid using a single instance; use multiple instances across multiple availability zones.
Avoid multi-cloud or multi-region unless necessary. Avoiding cloud vendor lock-in is not a good enough reason
Implement retries and/or degraded responses.
Don't overpay or overengineer.

Conclusion

Achieving high system availability requires understanding the components of the system, their availability, and their impact on the overall system. By identifying bottlenecks and using appropriate services, availability can be increased. It is important to remember that high system availability comes at a cost, and the level of availability should be tailored to the users' needs without overpaying or overengineering the system.

Master AWS with Real Solutions and Best Practices.
Join over 3000 devs, tech leads, and experts learning real AWS solutions with the Simple AWS newsletter.

Analyze real-world scenarios
Learn the why behind every solution
Get best practices to scale and secure them

Simple AWS is free. Start mastering AWS!

If you'd like to know more about me, you can find me on LinkedIn or at www.guilleojeda.com