Commentary

What is AWS Multi-AZ? Configuration examples, advantages/disadvantages, and design considerations

Eye-catching image
table of contents

When considering availability measures for Amazon Web Services (AWS), a Multi-AZ configuration is a good choice because it allows you to avoid single points of failure by distributing resources across multiple Availability Zones.

However, there are many cases where simply using Multi-AZ does not allow recovery as expected. If the design or operational assumptions are incorrect, even if the configuration is redundant, it is easy for services to actually be stopped.

In this article, we will clarify the positioning of Multi-AZ in AWS,

  • Failures that can be handled by Multi-AZ and those that cannot be handled

  • Pitfalls in design and operation that can easily occur in practice

  • Relationship with DR measures and the turning point for determining configuration

Before proceeding with the design assuming Multi-AZ, let's first review the important points to keep in mind.

Failures that can be handled by Multi-AZ and those that cannot be handled

When considering a multi-AZ configuration, you must first determine what can be protected and what cannot be protected. Multi-AZ is a basic means of increasing availability, but it is not an all-purpose DR solution. If you get the scope of protection wrong, the risk of outages remains even if you think you have achieved redundancy.

Expected failures in Multi-AZ

Multi-AZ is intended to ensure that the entire service can continue even if a failure occurs in a specific Availability Zone.

For example, if you distribute application servers and databases across multiple AZs and use a load balancer to switch to a healthy AZ, you can reduce the risk of relying on a single location.

The core idea is availability design, which means that even if one AZ becomes unavailable, processing can continue in the remaining ones.

Unexpected failures in multi-AZ (regional failures, operational errors)

However, because Multi-AZ is distributed within a region, it does not automatically cover region-wide failures or large-scale disasters. Also, even if the configuration is Multi-AZ, there may be cases where recovery is not possible if the operational design is not up to date.

If the switching procedure is not well established or dependent services remain in a single AZ, they will not function as expected in the event of a failure. Multi-AZ is merely a prerequisite, and it will only be effective if the design and operation are properly considered.

A typical example of a multi-AZ crash – a design lacking static stability

Even with a multi-AZ configuration, a common practice is that the moment one AZ goes down, the remaining AZs become overloaded, causing a chain reaction of service failures. The underlying reason for this is that the system is designed with the assumption that "scaling can be done after a failure occurs."

In a multi-AZ environment, even if a failure occurs in one of the AZs, it is necessary to maintain a state where a certain load can be handled using only the remaining resources under normal circumstances. This is the concept of static stability.

To increase availability, it is essential not only to distribute the deployment but also to consider the capacity that can withstand failures and the switching design together.

Organizing the relationship between regions and availability zones

To properly understand Multi-AZ, you must first understand how AWS's infrastructure is divided. If the difference between "Regions" and "Availability Zones (AZs)" remains unclear, the assumptions for availability design will be incorrect.

Availability Zones are physically separate units of availability.

An AWS region is a geographical location with multiple availability zones. AZs are not simply logical divisions; they are designed as independent data centers with separate power sources, networks, and facilities.

Therefore, even within the same region, if a failure occurs in one AZ, it is less likely to affect other AZs. The multi-AZ configuration is an availability design based on the premise that "failures can be isolated on an AZ basis."

What does distance and latency mean for AZ independence?

AZs are physically separated from each other, with distances ranging from several kilometers to several tens of kilometers. However, because they are in the same region, AZs are connected with low latency via a dedicated network.

This distance is important because it provides sufficient separation to provide fault tolerance, while being close enough to make synchronous replication and fast failover practical.

While multi-region is responsible for disaster level distribution, multi-AZ is positioned as a design that "isolates failures while maintaining low latency."

Fundamental differences between single-AZ and multi-AZ configurations

A single AZ configuration is a design that consolidates systems into a single AZ. While this configuration is simple and helps keep costs down, if that AZ goes down, the entire service will be affected.

On the other hand, a multi-AZ configuration distributes resources across multiple AZs, aiming to ensure that services can continue even if one system fails. However, the more distributed the resources are, the more complex the design becomes, and coordination is required, including with regard to the network and operations.

The first step to improving availability is not to "span AZs," but to understand the roles of regions and AZs and use them appropriately.

Advantages and disadvantages of a multi-AZ configuration

Multi-AZ configuration is a basic method for increasing availability in AWS, but simply adopting it does not automatically make it secure. You need to consider the cost and design load as well as the effectiveness.

Key benefits of a Multi-AZ configuration

The biggest benefit is that it avoids a single point of failure. By distributing your system across multiple AZs, you increase the chances that your entire service will continue to function even if a failure occurs in a specific AZ.

By making your infrastructure redundant, you can reduce the reliance on human intervention to recover from failures. By combining this with load balancer switching and automatic database failover, you can minimize downtime.

Major disadvantages of a Multi-AZ configuration

On the other hand, this approach comes with increased costs and complexity. Placing servers and databases in multiple AZs essentially duplicates the necessary resources. Additionally, communication and data transfer between AZs increases operational costs.

The more distributed the configuration, the more difficult it is to design and operate. If dependencies remain on one system, or if switching procedures and monitoring are insufficient, recovery will not be possible as expected even with a multi-AZ configuration.

Multi-AZ is an effective means of increasing availability, but simply introducing it does not automatically make it secure; it must be implemented in a way that includes design and operation.

Understanding the Basics of Multi-AZ Design through Configuration Patterns

To utilize Multi-AZ in practice, it is important not only to understand it as a concept, but also to consider the design based on typical configuration patterns. Many AWS services are built on the premise of Multi-AZ, but where distribution is required and what is likely to become a single point of failure will vary depending on the configuration. Here we will organize the most common patterns.

A typical multi-AZ configuration for a web application

The most common configuration for web applications is to distribute application servers across multiple AZs under a load balancer.

For example, by using the ALB as the entry point and distributing traffic to EC2 instances and containers placed in each AZ, processing can continue in the normal AZ even if a failure occurs in one AZ.

What's important here is not just to distribute the number of servers, but also to design a capacity that can maintain the necessary processing power even if one system is lost. Multi-AZ is a matter of deployment, and it comes with a design that creates a tolerable state.

Points to note about session and state management

One common stumbling block in multi-AZ design is deciding where to store the application state.

If session information and temporary data are stored locally on the server, user status cannot be retained in the event of an AZ failure or scale-out, making it difficult to continue service.

Therefore, in a multi-AZ environment, it is essential to design applications to be as stateless as possible and to extract state externally. If necessary, you should use tools such as ElastiCache or DynamoDB to manage sessions in a way that ensures consistency across AZs.

Multi-AZ thinking at the database layer

At the database layer, it is important not to confuse "Multi-AZ configuration" with "RDS Multi-AZ."

RDS's Multi-AZ feature places the standby in a different AZ and provides a mechanism for automatic failover in the event of a failure. However, unless the application's connection design and post-recovery operations are coordinated, there are cases where the switchover does not go as expected.

In addition, read replicas may be used in combination for read performance or scale purposes in addition to availability, so the purpose will change. Multi-AZ should not be seen as "duplicating the database," but as a data layer design that will not stop business operations in the event of a failure.

Design mistakes that can easily cause a multi-AZ configuration to crash

Even if the configuration is correct, there are cases where the system stops when a failure occurs. In many cases, this is because "even if you intended to deploy it across AZs, a single point of failure actually remains."

Single point of failure in the network

The network is a single point of failure that is often overlooked in multi-AZ configurations. For example, if you consolidate NAT Gateways in one AZ, the moment that AZ goes down, external access from other AZs may become impossible.

You should also be careful with interface-type VPC endpoints. If an endpoint exists only in a specific AZ, there is a risk that communication to AWS services from other AZs will be cut off if that AZ fails.

In a multi-AZ design, you need to ensure that not only your applications but also your network routes are redundant across AZs.

Dependencies are pinned to a specific AZ

Even if your workloads are distributed across multiple AZs, if important processes or dependent services are concentrated in one AZ, switching will not be possible in the event of a failure.

A typical example would be a case where management batches only exist in a specific AZ, or storage or authentication infrastructure remains in one system.

Even if the configuration diagram looks redundant, if the actual dependencies are not completed across AZs, it will become a bottleneck in the event of a failure. At the design stage, it is important to break down and confirm "what will cause business operations to stop if something goes down."

Cases where monitoring, switching, and recovery procedures are based on the premise of one system

Availability is not determined by configuration alone; it is achieved in conjunction with operational design.

For example, even if automatic failover is assumed, if there are no procedures in place to check operation after switching or to return to normal after recovery, the result may be long outages.

If monitoring cannot correctly detect a failure in one of the AZs, it is easy to end up in a situation where you are unable to respond even though an abnormality is visible. To make Multi-AZ function in production, you need to verify its behavior in the event of a failure under normal circumstances and be able to switch and recover as part of your operations.

Multi-AZ design considerations from a cost perspective

Multi-AZ configurations are also designs that can easily increase costs. Rather than simply thinking of redundancy as safety, it is important to understand which costs will increase and strike a balance that meets your business requirements.

Breakdown of increased costs with Multi-AZ

In a multi-AZ environment, resources are distributed across multiple AZs, so the infrastructure is essentially duplicated. Not only does the number of application servers increase, but databases, load balancers, and other systems are also designed with redundant configurations in mind.

In addition, increasing the number of monitored targets also increases operational costs. The more complex the configuration, the higher the total cost, including troubleshooting and testing efforts. Multi-AZ costs must be considered not just in terms of the price list, but also in terms of the operational load.

Important points to note regarding inter-AZ communication and data transfer

In multi-AZ environments, communication occurs between AZs, which may increase network transfer costs.

For example, if an application and database communicate across different AZs, traffic between AZs will occur even under normal circumstances. The impact will be greater for systems with large amounts of data transfer, including synchronous replication and log transfer.

It is necessary to check at the design stage whether the configuration is such that inter-AZ communication is normal even under normal circumstances due to prioritizing availability.

Design decisions that balance cost and availability

Multi-AZ is not a "must-have" configuration, but rather a design decision to be made based on the availability you require. It is a standard premise for systems that cannot tolerate business interruptions, but applying it across the board to environments where the impact of interruptions is limited could result in excessive investment.

The key is to determine how long you should continue operations in the event of a failure based on RTO, RPO, and the impact on business, and then implement redundancy appropriate to that scope. A design that allows you to control both availability and costs leads to realistic use of multi-AZ.

How do you think about the relationship between multi-AZ and DR measures?

Multi-AZ is a basic method for designing availability in AWS, but it cannot replace all DR measures. Availability and disaster recovery are often discussed in similar contexts, but the scale of anticipated failures and recovery requirements differ. This section explains when Multi-AZ is sufficient and what additional configurations should be considered.

Cases where Multi-AZ is sufficient

Multi-AZ is useful when you want to prepare for AZ failures or equipment failures that occur within a region.

For example, if you want to avoid service interruptions due to a single AZ outage in a business system or web service, a multi-AZ configuration is the standard choice.

By combining automatic failover and load balancing, downtime can be minimized, making it an effective disaster prevention measure assuming that the region is still operational.

When to consider multi-region

On the other hand, if you are considering a region-wide failure or a large-scale disaster, multi-AZ alone is not enough.

To prepare for the possibility of a service becoming unavailable on a regional basis, a multi-region configuration is required, including standby systems, backups, and switchover designs in other regions.

Additionally, if business continuity requirements include "cannot be down for more than a few hours" or "regional distribution is essential," then a DR design should be considered rather than a multi-AZ design.

Configuration decisions based on RTO and RPO

When deciding whether to use multi-AZ or multi-region, it is important to work backwards from your recovery requirements rather than from a technical standpoint. By clarifying your RTO (the amount of downtime you can tolerate before recovery) and RPO (the amount of data loss you can tolerate), you can determine the level of redundancy you need.

If the goal is immediate recovery from AZ failure, multi-AZ is the main approach, but if regional failures are also taken into consideration, the design must also include a backup strategy and cross-region recovery.

Practical points to keep in mind during the design and operation phases

A multi-AZ configuration is not just about looking redundant on a blueprint. Whether it can truly continue in the event of a failure depends on the consistency of the design and operational preparation. Here we will summarize practical points that are likely to make a difference after construction.

Points to check during design review

In a multi-AZ design, you need to review not just whether resources are distributed, but also whether the structure will function in the event of a failure. The important thing is to consider whether there is a single point of failure and whether dependencies are fixed to a specific AZ.

It is also necessary to check whether the capacity to withstand the loss of one system is maintained and whether the concept of static stability is reflected in the design. It is important to evaluate whether the system as a whole can continue to function, rather than redundancy for each component.

The importance of testing and training assuming failures

Availability can only be verified at the moment a failure occurs. Even if you use Multi-AZ, if you do not check in advance whether switching works as expected, it may not function in production.

Only through testing and training can it become clear whether the application will function normally after failover, whether it can withstand performance after switching, and what the procedure for returning to normal after recovery will be. In addition to creating a configuration, experiencing how it will behave in the event of a failure under normal circumstances will make a difference in practice.

Operational design concepts based on multi-AZ

In a multi-AZ environment, operations cannot be performed on a single system basis. Monitoring must be designed to detect anomalies on an AZ basis, and post-switchover check items and escalation procedures must also be organized.

In addition, regular maintenance and update work are designed to be non-disruptive, so precise change management and work planning are required. To put Multi-AZ into production, it is essential to prepare a set of configuration, monitoring, and recovery procedures and keep them operational.

FAQ

Will service interruption be guaranteed if I use Multi-AZ?

It doesn't mean that there won't be an outage. Multi-AZ is a way to increase tolerance to a single AZ failure, but if it's poorly designed or operated, an outage can still occur.

If there is insufficient capacity in the event of the loss of one system, or if the network or dependent services remain in a single AZ, service continuity will not be possible even with a multi-AZ configuration. In addition to the configuration, it is important to ensure that the design is effective in the event of a failure.

Is Multi-AZ sufficient as a DR solution?

Whether it is sufficient as a DR measure depends on the scope of the anticipated failures. Multi-AZ mainly covers AZ failures within a region, and is not designed to handle failures of the entire region or large-scale disasters.

If your business continuity requirements are stringent, you will need to consider a separate DR design, including multi-region and backup.

How much does multi-AZ increase costs?

The extent of the increase will vary depending on the configuration, but basically costs will increase due to resource duplication and inter-AZ communication. In addition to redundancy of application servers and databases, network costs will also have an impact in systems with high transfer volumes.

It is important to check whether excessive redundancy is required for your availability requirements and to limit the design to only what is necessary.

Are there cases where a single AZ is acceptable?

Yes. A single AZ may be reasonable for development environments, testing environments, and small systems where the impact of outages is limited. However, if business outages cannot be tolerated in a production environment, a single AZ configuration will result in a single point of failure.

The important thing is not whether Multi-AZ is the right choice, but rather choosing the level of availability required for your business requirements.

Conclusion

AWS's multi-AZ configuration is a basic design that isolates failures within a region and reduces the risk of service outages due to a single-AZ failure. It is a standard availability measure for web applications and business systems.

However, simply adopting a multi-AZ configuration does not guarantee safety. It will only be effective if you have a capacity design that takes static stability into account, eliminate single points of failure including the network, and have an operational design that includes switching procedures and training.

Furthermore, if regional failures are anticipated, multi-AZ configuration is insufficient, and DR design and multi-region configurations must be considered based on RTO and RPO. It is important to determine the appropriate redundancy level according to availability requirements.

Do you have any concerns about AWS?

If you have any questions or concerns about using AWS, estimates, configuration, operation, etc., please feel free to contact us. We will help you make a smooth decision by establishing a common understanding with the local team and clarifying prerequisites.

This service, "IIJ Managed Cloud for AWS," is jointly provided by the IIJ Group, Japan's first commercial Internet service provider, and Serverworks, an AWS Premier Tier Service Partner. It is compatible with global environments, including Southeast Asia, and provides AWS support tailored to on-site decisions.

▶ Check out the detailed documentation
▶ Consult with us about using AWS

Kazuki Kato
The person who wrote the article
Kazuki Kato

Serverworks Co., Ltd. Marketing Department, Marketing Section 1 After working as a sales representative for an independent ISP and SIer, optimizing customer systems and networks, he joined Serverworks. Since joining the company, he has worked on development standardization projects for an electric power carrier and proposed and implemented an in-station reading system for a railway operator. He is currently in charge of event marketing and inside sales. His hobby is washing cars. AWS Certified Database – Specialty (DBS)

We offer end-to-end solutions to address all your AWS-related challenges.

Image of a city nightscape intersecting with blue lines of light symbolizing a digital network