vSAN write IO flow

I have been working with a customer on a detailed vSAN planning exercise recently and I thought I would share some thoughts in a post. I will break this topic down into a few posts as it is quite lengthy and breaking it down will let me give the right level of detail to each point.

What I want to achieve in these posts is awareness of where vSAN performance is gained, and where vSAN performance is impacted. To do this I am going to look at the conceptual flow of a write in vSAN.

Below is a conceptual diagram of the topics I am going to cover in the series of posts.

Conceptual vSAN host

Overview

Before I start to cover the elements in the conceptual layout above, I wanted to set the scene of how and where I came across this data.

The customer I have been working with is a very large vSAN customer. The customer has a requirement to bring some heavy IO and large capacity workloads into a new vSAN cluster. The customer wants to understand if they have existing capacity? or if they need to look to scale. We have also recommended that the customer looks at the services they are using in the vSAN cluster, services such as deduplication and compression etc.

It’s Not Just Capacity

One thing that has been clear in many customers I am working with is that the consideration for anything other than storage capacity is often overlooked when designing a vSAN cluster. This often results in a number of challenges around performance in vSAN clusters.

The areas to focus in this post are:

  • Throughput of workloads: Understand what the requirements for the workload are. This type of data could be collected from vROPs or other monitoring tools. If the workload coming into the vSAN cluster is physical servers, the stats can be captured from an appropriate tool.
  • IO: This is another key metric that would need to be collected. The same sort of toolset can be used to collect this type of data as you would use to collect the throughput.

The two metrics above should be captured as average (ADV) and maximum (MAX) over as long a period as possible for the source workloads. This will give a cluster-wide type required for each metric.

Sanity Check The Values

The colleague I was working with created some fantastic formulas in Excel to format the data into some more digestible values. Simply taking these values and calculating them raw will lead to some values skewing the results.

As an example, if 2000 workloads have values within 5% of each other, but 3 additional workloads have vales 100x higher. They will skew the results of the averages. These types of workloads often need to be examined to see if they are valid and may require to be removed from the exercise as they push the cluster requirement too high and invalidate a business use case.

What I have tended to see is these workloads are removed from these types of exercises and are then grouped together in a dedicated vSAN cluster that is smaller in nodes but higher in performance. This is often a cost-effective way to host these workloads.

The data set

We now find we have data on IO, throughput and capacity for each workload that we want to migrate. The question now is how to we format that data to allow us to use it in vSAN capacity planning. This will be customer specific as customers will have business cases built on different values, and/or run their estate in different methods. Such as in a service provider method where the infrastructure provides a set level or performance.

For this example, I will use some generic values. Throughput you can see in the capture below the vales that were calculated from the data.

Example Throughput values

Also see the IOPS values we calculated below:

Example IOPS Values

Now in these examples, we are simply looking at the peak values for both the IO and the throughput as well as looking at an average value for each of these. The average values can then be broken down further into the below.

By breaking the values down into different percentiles, we can then evaluate how much of the workload we can accommodate.

This is important as we may find that we have workloads that skew that value, and as pointed out in the previous section, we may want to validate those workloads to see if the values are valid. We may also find that sizing a vSAN cluster for 100% of these workloads increases cost massively. But to size, the cluster for 95% of workloads may reduce the cost significantly. If this is the case, then a business decision needs to be taken as to if the cluster is to accommodate these workloads or if an alternative vSAN cluster is designed in a cost-effective way to host them.

Capacity

Capacity is a little simpler to calculate and I am sure the majority of vSAN customers know how to do this. This is also not the focus of this post. I have included this short section to highlight that this is also a consideration, and although I am not going to cover how to calculate this, it is still a factor you need to consider and include in any planning.

What Next

After collecting and calculating the above requirements for the workloads, we need to look at how we size a cluster to accommodate the requirements. it is important to remember we are going to be calculating the requirements for an entire cluster, not just a single host.

How do we caculate this?

Stay tuned for my next post…………………

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: