Why You Should Convert Azure Clusters To Managed Disks
You may have heard about the recent storage outage that impacted some instances in the US East region back on March 16th. A root cause analysis of the outage is posted here. March 16th US East Storage Outage
Customer Impact
A subset of customers using Storage in the East US region may have experienced errors and timeouts while accessing their storage account in a single Storage scale unit.
You might be asking, “What is a single Storage scale unit”. Well, you can think of it as a single storage cluster, or single SAN, or however you want to think about it. I don’t think Azure publishes their exact infrastructure. Although you can probably assume that behind the scenes they are using Scale Out File Servers for backend storage.
Survive The Outage With Minimal Downtime
So the question is, how could I have survived this outage with minimal downtime? If you read further down that root cause analysis you come across this little nugget.
Virtual Machines using Managed Disks in an Availability Set would have maintained availability during this incident.
Hence, it is time to Convert Azure Clusters To Managed Disks
What’s Managed Disks?
On February 8th Corey Sanders announced the GA of Managed Disks.
Managed Disks would have helped in this outage. Because by leveraging an Availability Set combined with Managed Disks, each of the instances in your Availability Set are connected to a different “Storage scale unit”. So in this particular case, only one of the cluster nodes would have failed, leaving the remaining nodes to take over the workload.
Prior to Managed Disks being available (anything deployed before 2/8/2016), there was no way to ensure that the storage attached to your servers resided on different Storage scale units. Sure, you could use different storage accounts for each instances. But in reality that did not guarantee that those Storage Accounts provisioned storage on different Storage scale units. More reasons to Convert Azure Clusters To Managed Disks.
So while an Availability Set ensured that your instances reside in different Fault Domains and Update Domains to ensure the availability of the instance itself, the additional storage attached to each instance really represented a single point of failure. Although the storage itself is highly resilient, with three copies of your data and geo-redundant options available, in this case with a power failure the entire Storage scale unit went down along with all the servers attached to it.
So long story short… Convert Azure Clusters To Managed Disks as soon as possible in order to help minimize downtime
And if you really want to minimize downtime you should consider Hybrid Cloud Deployments that span cloud providers or on-prem to cloud!
Reproduced with permission from Clusteringformeremortals.com