Managing vSAN Storage Policies for Azure VMware Solution Private Clouds
Recently, the same vSAN question came up multiple times: What is the best approach to vSAN storage policies on deployment and expansion of the Azure VMware Solution private cloud?
The typical lifecycle of deploying and growing Azure VMware Solution (AVS) is the initial three (3) node deployment; then, the migration of virtual machines begin from on-premises to AVS, or workloads grow organically. Once the three-node cluster fills, of course, more nodes are needed.
The ability to grow and shrink the cluster on demand is the beauty of cloud-scale, and more explicitly, running VMware in the cloud. No need to overprovision the cluster; consume hosts as they are needed! How many reading this can relate to the situation when the cluster is overprovisioned because you’d rather have it and not need it than need it and not have it. So many issues come along with that, but that is a topic for another post. 😊
In a three-node cluster, the storage configuration default is RAID-1, and only a single host can fail before data is lost (FTT=1). RAID-1 is the most inefficient use of storage. Generally speaking, the amount of physical disk needs to be twice the amount of the VMDKs.
Most production implementations of Azure VMware Solution will grow use a configuration of RAID-5 or RAID-6 with either 1 or 2 Failures to Tolerate. OK, great, so what is the problem? The issue which users can fall into when expanding the cluster to four or more nodes is either continuing to deploy virtual machines with a RAID-1, FTT=1 configuration even if they can leverage more efficient use of storage (see table below). Or only leverage RAID-5 or 6 on newly deployed VMs, forgetting to go back and modify pre-existing VMs (the ones deployed when the cluster only had three nodes).
The table below is from the VMware vSAN Design Guide. As you can see, it outlines the high availability options as the cluster grows.
When a virtual machine deploys to a vSAN cluster (in this case, the AVS cluster) by default, the virtual machine gets assigned the vSAN Default Storage Policy.
That vSAN Default Storage Policy is set to RAID-1 (Mirroring), FTT=1, and thick provisioning. Unless the default storage policy is adjusted or a new policy is applied to the virtual machines (existing or new), the cluster will continue to grow with this configuration. Not ideal.
If you plan on expanding to four or more hosts soon after initial deployment, deploy four straight away. As virtual machines are deployed to the AVS cluster, change the disk’s storage policy in the VM settings to either RAID-5 FTT-1 or RAID-6 FTT-2. The VMs disk settings won’t have to change later.
Alternatively, if you are unsure if the cluster will grow to four or more hosts soon or ever, then deploy using the default policy. Ensure if/when the cluster grows to four or more hosts start to set the new VMs to RAID-5 FTT-1 or RAID-6 FTT-2. Remember, you will need to go back and modify the existing virtual machine disk storage policies as they will be RAID-1 FTT-1 because they were deployed when the cluster had only three nodes.
Modifying the storage policy of an existing virtual machine disk does cause some disk churn. My recommendation would be to change the storage policies in smaller groups to ensure there isn’t any impact on performance. If you want to do it the easy way then check out my colleague Chris Nakagaki‘s script to change the storage policies of existing VMs.