vSAN Fault Domains

I have recently been working with a customer in EMEA running HP Synergy and VCF. In fact I have been working with this customer for a long time now.

The customer has been looking at resilience, and as a result this has led me to look at vSAN fault domains for them.

First of all lets clear up some confusion I have seen around, regarding what vSAN Fault domains are:

What are vSAN Fault domains?

The vSAN fault domain feature is a key feature in resilience for vSAN. It is a way of vSAN knowing the failure impact to vSAN data if the physical hardware platform hosting the vSAN cluster encounters failure.

This allows vSAN to then place components for a single object, across different fault domains to ensure data is protected against identified failures.

A great example, and one that relates to this post, is a blade enclosure. You may want vSAN to protect against a blade enclosure failure and place components of a single object across multiple enclosures. So if an enclosure fails you still have data protected.

When to use vSAN Fault domains?

This is a key question you should ask, and we have identified an instance in the above section. Adding anything like this to a solution introduces management overhead and arguably complexity as well.

“What are the failure scenarios you are protecting against” This is what I ask customers. This is important as not all customer have the same availability requirements. Also many customers have different capabilities in there datacentres. A good example is some customers have never experienced rack failure in there DC and so don’t protect against it. I often have a debate if this is good practise of not!

If you can identify a scenario where vSAN could have placed data components in an identified failure scenario. Then you need to look at informing vSAN not to do this. This is where fault domains come into play.

The below is showing a vSAN cluster with no vSAN Fault domains configured by the customer and vSAN issuing the default of each host being a vSAN Fault domain. In the example we can see there are 2 racks with 3 Synergy Frames in each. This vSAN cluster is 9 hosts and is distributed across all 6 frames. One of the frames has 4 hosts.

With RAID1 FTT1 this could result in all vSAN components being located on the 4 hosts in the single frame. If this frame fails the customer could have data loss. This is where we should use vSAN Fault domains to let vSAN know it should not place components of a single object on these hosts

9 Hosts randombly placed

Impact of using vSAN Fault domians!

When using vSAN Fault domains there could be an impact to your vSAN Policy. Carrying on with the above example of 9 hosts. And also following the points around identifying the failure scenarios you want to protect against. This customer has decided to protect against frame failures. They are happy that rack failure is not a scenario they want to protect against.

As an Architect I of course make them aware of the possible risks and we record and agree etc.

If we take the vSAN cluster with 9 hosts, in the above diagram we are able to use any of the FTT policies. But when we move these into vSAN fault domains we lose some of the FTT policies we can use. In the example below we have moved the hosts to a single rack and we are placing 3 hots in each vSAN fault domain.

9 hosts three vSAN Fault domains

Why have we placed even hosts in the vSAN fault domains? This is to ensure we can scale, each vSAN fault domain has the same capacity so that we can distribute and host the components as needed. With different number hosts in each vSAN Fault domain we could end up with one vSAN fault domain having no capacity and resulting in us not being able to protect components and meet policy.

The other point here is that we may not be able to use the FTT values used when vSAN Fault domains were not in place. But we can influence this with the placement of hosts across vSAN fault domains.

Hardware impact

Keeping with the 9 host example we can see that we need to make a number of changes to make this work.

  • Identify the failure scenarios we want to protect against
  • Move hosts to even numbers in the same vSAN Fault domain aligned with the failure scenarios
  • We can only use RAID1 FTT1

However we need to use RAID5 FTT1 for capacity reasons. In this configuration
we cannot do this. We are left with two options to meet the RAID5 capacity requirement.

We can adjust the vSAN fault domains to a single host across 9 frames. Essentially
configuring the vSAN fault domains to that of the default of a single host per
vSAN fault domain. This however would require 9 Synergy frames, and this may
not be a possibility for customers.

This image has an empty alt attribute; its file name is screenshot-2021-01-22-151249.png

The other option may be to separate across 6 Synergy frames. below is the layout of this configuration.

9 hosts 6 fault domains

In this example we can see there are 3 vSAN fault domains with 2 hosts and 3 vSAN fault domains with a single host. Based on the details earlier this is not good practise as we have unbalanced capacity in the vSAN Fault domains.

This option does however allow us to use all the FTT policies with the exception of FTT3, this customer has no requirement for this policy. But due to the unbalances vSAN fault domain capacity this is not an option I could advise the customer.

The recommendation for this customer based on their requirements is to scale this cluster to a 12 node cluster. This would allow the cluster to have a balanced number of hosts in each vSAN fault domain for capacity and allow the customer to have 6 vSAN fault domains to protect against Synergy frame failure. This would also allowed each vSAN Fault domain to be able to sustain a host failure and keep the vSAN fault domain online.

 

12 host 6 fault domains

Obviously this has an impact on a number of elements outside of the vSAN fault domain discussion. This is an increase in hardware for this vSAN cluster and as a result has an impact on things like licencing, capacity planning physical DC resources etc. This was all noted and is being worked with the customer. For the purpose of vSAN fault domains and the requirements the customer presented to us, this solution met all the requirements.

Closing

One thing that I think stands out here, and in a large number of customer I currently visit, is that often this is not designed at the start of the deployment. Often for good reasons. Either a vSAN cluster doesn’t have enough hosts, is contained in a single rack or blade chassis etc.

Fitting fault domains into a fully running production environment, is very disruptive. It will result in large amounts of data moving around the vSAN cluster and as a result will have an impact on the network as data is copied, as well as having an impact on the performance of VMs running in the vSAN cluster.

There is a point in a vSAN environments lifecycle when this becomes important. I have yet to quantify when this is but I am hoping in some of the my ongoing work with my customers that I’ll be able to identify when this point may be, to at least give some guidance.

These examples are real life, but you still need to do the design work for your own environment, identify failure scenarios, and identify what you want to protect against.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: