Understanding Azure Availability Sets and Availability Zones

Azure Availability Zones (AZs) are gradually rolling out around the globe, and a common question I see is what is the difference between the existing Azure Availability Set construct and the new Availability Zone construct? Can they co-exist, does one replace the other... there are lots of questions!

In this post I'm going to explore the two concepts and hopefully provide some clarity.

Azure Regions

First we need to understand what an Azure Region is.

The official Microsoft documentation states:

"A [Region is a] set of datacenters deployed within a latency-defined perimeter and connected through a dedicated regional low-latency network".

Regions have been the standard way Microsoft makes Azure available somewhere in the world.

Typically there are multiple Azure Regions in a Geography (or Geopolitical area) and Regions are paired for redundancy of some low-level Azure services such as Storage Accounts. In Australia we are lucky to have four Regions - Australia East and Southeast (paired) along with Australia Central 1 & 2 which are paired.

Azure Regions differ slightly from how AWS defines and deploys Regions. Typically AWS has a single Region in a Geography - so in Australia you have just one - APAC Sydney. There are some cases of multiple Regions in a Geography such as in Japan, but these are in the minority.

Traditionally, if you wanted to build a highly-available service in Azure you would have to use inter-Regional designs that duplicated services between Regions (i.e. Australia East and Australia Southeast) and then you decided how to manage data synchronisation between Regions. For people coming from multi-DC environments on-premises this wasn't a massive leap.

Service Level Agreements (SLAs)

Let's take a quick detour to talk about SLAs for a moment because they play a part in Availability Sets and Availability Zones.

Many people fixate on SLA numbers as some form of concrete commitment to meet or exceed availability of a service in Azure. As is stated in the Virtual Machine SLA document:

"If we do not achieve and maintain the Service Levels for each Service as described in this SLA, then you may be eligible for a credit towards a portion of your monthly service fees."

So.. the important take away here is that an SLA is an instrument under which you can receive compensation in the event that the Azure platform fails to meet the threshold set out in the SLA. The SLA does not mean that any Virtual Machine (in this case) will always have availability at or above the stated SLA target.

Things can and do happen which may impact a VMs ability to function as expected, and where it is determined that Azure was the root cause then the SLA defines what impacted parties can expect to receive as a remedy (in this instance a credit on your Azure account).

The final point on an SLA. The vast majority of SLAs for Azure services relate to availability of a service. They do not relate to performance of a service (i.e. how quickly it responds). Cosmos DB is one of the few services that has performance as a part of its SLA.

Right, with these two basic ideas out of the way let's dig into availability designs and how Availability Sets and Availability Zones help.

Single Virtual Machine (Azure VM SLA = 99.9%)

Let's start with an example Virtual Machine (VM) hosted workload. This VM will have a single disk and a single network interface.

If we simply "lifted and shifted" this workload from on-premises into Azure and ran it on a single VM with no special features enabled we could achieve a maximum Azure-backed SLA for availability of 95% which represents ~1.5 days a month of downtime being within acceptable boundaries.

If that SLA meets your business needs then you could stop here, but in many cases this wouldn't suffice.

We could make some configuration changes around the Disks used for the VM and increase the SLA to 99.9% which is better at ~45 minutes of monthly downtime being acceptable.

Let's dig a little deeper and find out how we can improve this.

High Availability

I touched on how you could achieve high availability in Azure by using a multi-Region deployment model. Clearly in some cases this represents unnecessary complexity or cost for the style of workload being run.

If you want to run a VM-based workload in a single Azure Region, then the way you achieve improved availability is to use an Availability Set, which we'll look at next.

Availability Sets (Azure VM SLA = 99.95%)

An Availability Set allows you to take a Virtual Machine (VM) and improve it's availability by configuring multiple copies of the VM to be deployed as a group which ensures that the Azure management plane will place the VMs such that the hosted workload(s) are resilient across Azure updates and faults.

It's important to note that Availability Sets can only be used for VM workloads (so Infrastructure-as-a-Service) within a single Azure Region. Other PaaS services such as Azure App Service don't require you to know about Availability Sets, even if they use them under the hood. It's also not possible to span Availability Sets across Regions.

When creating a VM you must select the Availability Set into which you wish to deploy. If you don't define this at VM creation time you cannot go back later and change this. If you want to move a VM into or out of an Availability Set you'll have to recreate the VM.

I think for Availability Sets to make sense we need to talk about some underlying Azure concepts - namely Fault and Update Domains.

Fault and Update Domains

There are many thousands of racks of servers and other hardware that provide both Virtual Machine capacity and other services such as storage for Azure.

All of these components need to be periodically updated (patches for hardware drivers and hypervisor software for example) and given the volume of hardware it's not uncommon for some components to periodically fail (memory failure, ephemeral disk failure / corruption, etc).

Now consider the impact on an application hosted on a single VM where an update requires a hypervisor reboot to install a patch or a hardware failure that causes the VM host to crash.

This is where Fault and Update Domains (and Availability Sets) come into play.

When you define an Availability Set and you place multiple VMs into it the Azure management plane will ensure that each VM instance is deployed to a different Fault and Update Domain.

The result is that whether hardware fails, or updates are pushed out, that your hosted application should remain operational.

By default you have 3 Fault Domains and 5 Update Domains you can use in an Availability Set. If you need more Update Domains you can request support to increase the number you can use up to 20.

Availability Set SLA uplift

So if we place a VM into an Availability Set with at least one other VM running the same workload then the SLA lifts from 99.9 to 99.95 which represents ~ 22 minutes downtime a month being acceptable.

While Availability Sets are free to use, you will incur the cost of running multiple VMs and their associated artefacts such as disks and public IP addresses (if you use them).

Each VM is uniquely addressable, so you if want to spread load across the hosts then you'd have to run a Load Balancer or Application Gateway instance as well. This would mean your workload also needs to support load balancing, or you need a mechanism to manage outages due to a single instance restarting / being unavailable.

Now let's take a look at Availability Zones.

Availability Zones (Azure VM SLA = 99.99%)

The Availability Set construct has been with Azure for many years and works for a lot of use cases. While 22 minutes of downtime a month doesn't sound like much there might be use cases where you need higher availability.

Your choice in this case is to look at Availability Zones. When you deploy at least two instances of a VM workload to two different Zones you have access to a 99.99% SLA which represents less than 5 minutes a month being acceptable downtime.

If you're coming from AWS you most likely understand the concept of an Availability Zone already. If not, let's spend a little time exploring them. This is the definition from the official Microsoft documentation:

"Unique physical locations within a Region. Each Zone is made up of one or more datacenters equipped with independent power, cooling, and networking."

Sounds similar to Regions in some respects, right? Where they differ substantially is how you achieve high availability when deploying workloads.

If we used the inter-Region model we would have had to design cross-Region networking support and live with data charges as well as think about how other services we used in our architecture were to be made highly available between Regions.

Now with Availability Zones we can start to use zone-redundant or zone-aware services in Azure to simplify our designs. We do still incur costs such as inter-zone network traffic charges, but we have less virtual networking infrastructure to manage and likely less cost as a result of that overall.

VMs in an Availability Zones

We also simplify our network design by being able to use the same VNet. You can now use zone-aware versions of other Azure services such as Load Balancers and Application Gateways that make managing full stack availability much more transparent and straightforward.

Data services like Azure SQL Database also support Availability Zones which means we have a consistent way to manage availability regardless of whether the workload is IaaS or PaaS.

Note: Availability Zones are not available in all Azure Regions (yet), nor supported by all Azure services. Availability Zones are not simply a logical overlay on existing Azure infrastructure and require hardware and datacentre locations to be provisioned before being available in existing Regions. Services also need to be updated to be zone-aware. In Australia only the Australia East Region (at time of writing) has Zones available.

There are always three Availability Zones in a Region. The Zone numbers are randomised per customer to ensure use is balanced across all infrastructure (who knew people would just use 'Zone 1' as their default? 🤷‍♂️)

So when do I use what?

The following criteria should help you decide:

Can my workload support multiple load-balanced copies?
If not then Single VM it is.
What is the SLA you need to support for your application components?
99.9% = Single VM (with Premium SSD or Ultradisk);
99.95% = Availability Set;
99.99% = Availability Zones.
Are Availability Zones available in a Region for a service I can or want to use?
Use Availability Set if not.

The final question you might come up against is choosing a Set over a Zone where the SLA or the cost isn't necessarily the driving differentiator.

I think starting new work using Availability Zones is the right step because you can use Zones for much more than just Virtual Machines which leads to a longer term consistent management experience.

If you're already using Azure with Availability Sets and Regions then you likely need to go back to grass roots and revisit your cloud architecture foundations to figure out how (or if you need) to move to Availability Zones.

Happy (available) days! 😎