SAS Viya Multi Availability Zone Deployment on AWS

This reference architecture provides an overview of how a Viya environment can be deployed to an AWS EKS cluster with multiple Availability Zones to minimize downtime in case an Availability Zone goes down.

Architecture Overview

Scenario

The reference architecture for multi Availability Zone deployments in AWS provides the recommended approach to deploy Viya in these environments.

This architecture provides enhanced recovery in case of the following disruptions:

Single Pod failures: by running multiple instances of all services, the system is protected against pod failures.
Single Node failures: by spreading multiple instances over multiple nodes, the system is protected against node failures.
Availability Zone failures: by ensuring all required persisted data is available in a secondary Availability Zone, the Viya environment can be quickly restarted in case of an Availability Zone failure.

Note that protection against single pod and node failures can also be achieved in single Availability Zone setups.

This reference architecture can be combined with other reference architectures to provide additional resilience n the form of Backup / Restore and Disaster Recovery functionalities.

End-User experience

When an Availability Zone goes down, users will experience a service disruption during the time the SAS Viya environment is restarted in the secondary Availability Zone. After the service has resumed, users can resume working as normal. They should however be aware that:

Compute sessions will have terminated and any work that was in-progress at the time of the disruption will have to be restarted.
CAS data will have to be reloaded into memory before it can be used again.

Considerations for cross Availability Zone deployments

Although Amazon EKS provides the ability to run workloads across Availability Zones, this approach is not recommended. The main reasons for this are:

Performance Cost: Although cross AZ latency is lower than cross region latency, the increase compared to same zone deployments can have a negative impact on the performance of high performance analytical platforms like SAS Viya.
Infrastructure Cost: In order to maintain the same level of performance when compared to single AZ deployments, additional infrastructure needs to be deployed that can handle the application load even when an Availability Zone goes down.
Data Transfer Cost: Data transmitted between Availability Zones will be charged. For analytical applications such as SAS Viya, this cost may be significant.

Solution overview

Assumption

Networking infrastructure has been set up so that end users can reach the SAS Viya platform and the platform can reach its data sources, regardless of in which Availability Zone the application is running.

Components

The following key components make up the reference architecture:

EKS Node Pools Separate EKS Node Pools are deployed in subnets in at least two Availability Zones. All node pools are labeled and tainted according to the SAS documentation. If following the recommended workload placement strategy this means at least 10 node pools will be created:
- 2 default node pools
- 2 stateless node pools
- 2 stateful node pools
- 2 compute node pools
- 2 CAS node pools
Five of these node pools will be scaled down to zero nodes in normal operation. In case of an Availability Zone failure, these node pools can be scaled up to the required number of nodes.
RDS PostgreSQL A multi-AZ RDS PostgreSQL database is deployed. This can either be a RDS instance with a standby in a secondary AZ, or an RDS cluster with two readable standby's in a secondary and tertiary AZ. In case of an Availability Zone failure, the RDS database will automatically switch over to a secondary Availability Zone allowing the SAS Viya platform to be restarted with minimal delay.
FSx ONTAP Amazon FSx for NetApp ONTAP is deployed with the Multi-AZ deployment type. SAS Viya requires both RWO block storage and RWX shared storage. Amazon FSx for NetApp ONTAP provides both storage requirements with a deployment type that makes this storage available across Availability Zones. This again ensures the SAS Viya platform can be restarted with minimal delay in case of an Availability Zone failure.
Elastic Container Registry Although not strictly required, removing the dependency on upstream container image repositories decreases the time in which you are able to restart your environment in a different Availability Zone. Using an Elastic Container Registry removes this dependency. The ECR should not only mirror the SAS container registry, but also any other images required to run the supporting services in the EKS cluster such as the Ingress controller and CSI providers.

Additional Resources

Please also have a look at the related resources for this reference architecture: