Monday, December 14, 2015

Troubleshooting Storage Performance in vSphere

Troubleshooting Storage Performance in vSphere

I frequently present at the various VMware User Group (VMUG) meetings, VMworld and partner conferences.  If you have ever attended one of my talks, you will know it is like trying to drink from a fire hose, it is hard to cover everything in just a 45 min session. Therefore I will take the time here to  write a few blogs that go over the concepts discussed in these talks in more detail (or at least slower). One of the most popular yet very fast paced talks I present is the Troubleshooting Storage Performance in vSphere. I’ll slow things down a bit and discuss each topic here, this might be just a review for some of you but hopefully as we get into more details there will be some new nuggets of VMware specific information that can help even the more advanced storage folks. 
Today’s post is just the basics.  What is bad storage performance and where do I measure it?
Poor storage performance is generally the result of high I/O latency. vCenter or esxtop will report the various latencies at each level in the storage stack from the VM down to the storage hardware.  vCenter cannot provide information for the actual latency seen by the application since that includes the latency at the Guest OS and the application itself, and these items are not visible to vCenter. vCenter can report on the following storage stack I/O latencies in vSphere.
 Storage Stack Components in a vSphere environment
LatencyInStorageStack
GAVG (Guest Average Latency) total latency as seen from vSphere
KAVG (Kernel Average Latency) time an I/O request spent waiting inside the vSphere storage stack. 
QAVG (Queue Average latency) time spent waiting in a queue inside the vSphere Storage Stack.
DAVG (Device Average Latency) latency coming from the physical hardware, HBA and Storage device.


To provide some rough guidance, for most application workloads (typically 8k I/O size, 80% Random, 80% Read) we generally say anything greater than 20 to 30 ms of I/O Latency may be a performance concern. Of course as with all things performance related some applications are more sensitive to I/O latency then others so the 20-30ms guidance is a rough guidance rather than a hard rule. So we expect that GAVG or total latency as seen from vCenter should be less than 20 to 30 ms.  as seen in the picture, GAVG is made up of KAVG and DAVG.  Ideally we would like all our I/O to quickly get out on to the wire and thus spend no significant amount of time just sitting in the vSphere storage stack,  so we would ideally like to see KAVG very low.  As a rough guideline KAVG should usual be 0 ms and anything greater than 2ms may be an indicator of a performance issue. 
So what are the rule of thumb indicators of bad storage performance? 
•             High Device Latency: Device Average Latency (DAVG) consistently greater than 20 to 30 ms may cause a performance problem for your typical application. 
•             High Kernel Latency: Kernel Average Latency (KAVG) should usually be 0 in an ideal environment, but anything greater than 2 ms may be a performance problem.
So what can cause bad storage performance and how to address it, well that is for next time…
Poor storage performance is generally the result of high I/O latency, but what can cause high storage performance and how to address it?   There are a lot of things that can cause poor storage performance…
– Under sized storage arrays/devices unable to provide the needed performance
– I/O Stack Queue congestion
– I/O Bandwidth saturation, Link/Pipe Saturation
– Host CPU Saturation
– Guest Level Driver and Queuing Interactions
– Incorrectly Tuned Applications
– Under sized storage arrays (Did I say that twice!)
As I mentioned in the previous post the key storage performance indicators to look out for are 1. High Device Latency  (DAVG consistently greater than 20 to 30 ms) and 2. High Kernel Latency( KAVG greater than 2 ms). Once you have identified that you have High Latency you can now proceed to trying to understand why the latency is high and what is causing the poor storage performance. In this post, we will look at the top reason for high Device latency.
The Top reason for high device latency is simply not having enough storage hardware to meet your application’s needs (Yes, I have said it a third time now), that is a sure fire way to have storage performance issues.  It may seem basic, but too often administrators only size their storage on the capacity size they need to support their environment but not on the Performance IOPS/Latency/Throughput that they need.   When sizing your environment you really should consult your Application and Storage Vendor’s best practices and sizing guidelines to understand what storage performance your application will need any what your storage hardware can deliver.
How you configure your storage hardware, the type of drives you use, the raid configuration, the number of disk spindles in the array, etc… will all affect the maximum storage performance your hardware will be able to deliver.  Your storage vendor will be able to provide you the most accurate model and advice for the particular storage product you own, but if you need some rough guidance you can use the guidance provided in the chart below.
  Untitled-1 copy

The slide shows the general IOPs and Read & Write throughput you can expect per spindle depending on the RAID configuration and/or drive type you have in your array.    Also frequently I’m asked what is the typical I/O profile for a VM, the guidance varies greatly depending on the applications running in your environment, but a “typical” I/O workload for a VM would roughly be 8KB I/O size, 80% Random, 80% Read.  Storage intensive applications like Databases, Mail Servers,  Media Streaming, … have their own I/O profiles that may differ greatly from this “typical” profile.
One good way to make sure your storage is able to handle the demands of your datacenter, is to benchmark your storage.  There are several free and Open Source tools like IOmeter that can be used to stress test and benchmark your storage.  If you haven’t already taken a look at the I/O Analyzer tool delivered as a VMware Fling,  you might want to take a peek at it.  I/O Analyzer is a virtual appliance tool that provides a simple and standardized approach to storage performance analysis in VMware vSphere virtualized environments ( http://labs.vmware.com/flings/io-analyzer ).
Also when sizing your storage make sure your storage workloads are balanced “appropriately” across the paths in the environment, across the controllers and storage processors in the array and balanced and spread across the appropriate number of spindles in the array.  I’ll talk a bit more about “appropriately” balanced later on in this series as it varies depending on your storage array and your particular goals/needs.     
Simply sizing your storage correctly for the expected workload, in terms of size and performance capabilities, will go very far to making sure you don’t run into storage performance problems and making sure your Device Latency (DAVG) is less than that 20-30ms guidance.  There are other things to consider which we will see in future post, but sizing your storage is key.

Troubleshooting Storage Performance in vSphere (Part 3) – SSD Performance

While presenting the storage performance talks, I frequently get asked about Solid State Device (SSD) performance in a virtualized environment. Well obviously, SSD’s or EFD’s (Enterprise Flash Disks) are great for performance especially if you have storage intensive workloads. As seen in the previous post in this series, SSDs can provide significantly more IOPs and significantly lower latencies. But the two big questions are ”how much of a gain might I expect” and “how much SSD storage do I need to achieve that gain” when using SSDs in a virtualized environment.
There are two studies that do a great job at painting a good picture for the performance of SSDs in a virtualized environment, and answering those two questions. Both studies use VMware’s VMmark benchmark, a virtualization platform benchmark for x86-based computers used mostly by our hardware partners to determine the performance of their hardware platforms when running in a virtualized environment.
The first study answers the questions of how much of a gain might I be able to achieve from using SSDs in my environment. As with all performance data your mileage may vary, but by using VMmark which simulates a collection of different workloads typically seen in a vSphere environment we can form a general idea for the impact of SSDs in a “typical” virtualized environment.
SSDPerf1
The results of the study showed that the average improvement in score for the SSD configuration was approximately 25% when compared to traditional rotating storage. Also, SSDs allowed for more consolidation, the traditional storage couldn’t support the level of consolidation at the high end (while meeting the QoS required in the VMmark benchmark). The SSD configuration could not only support the higher consolidation of six VMmark workload tiles while meeting the QoS requirements, but it also improved the overall VMmark score slightly while supporting the heaver consolidation load.
The second study provides guidance for the question how many SSDs do I need. Again using the VMmark benchmark as the workload, our VMmark performance team studied the performance impact of SSDs using the auto-tiering capabilities of the storage array. Unlike the previous test where ALL the traditional storage was replaced with SSDs, this study only had SSD capacity for approximately 8% of the storage footprint of the test. So only 8% of the storage required for the workload could fit in the SSDs and the rest had to utilize the traditional rotating storage. The storage array, using its auto-tiering capability, intelligently detected which storage blocks were hot and promoted those hot blocks into the SSD storage.
SSDPerf2
The results were that with only 8% of the workload’s storage footprint being able to fit in the faster tier of SSD storage, the VMmark workload was still able to achieve the 25% plus improvement that was seen in the previous study where all the storage was replaced with SSDs. A 90/10 rule is observed here. 90% of your typical workload’s IOPs are generated from just 10% of that workload’s storage footprint.
Again these studies just provide some guidance which greatly depends on your workloads, but the two studies help answer those two big questions of “how much better” and “how many do I need”.


No comments:

Post a Comment