Wednesday, April 26, 2023

Troubleshooting ESXi Host SAN/Network Performance Issues: A Step-By-Step Guide with Sample Error Messages and Expected Values : "state in doubt"

Troubleshooting ESXi Host SAN/Network Performance Issues: A Step-By-Step Guide with Sample Error Messages and Expected Values


Introduction:


Degraded performance in virtualized environments can have a significant impact on business operations. In this blog post, we will walk through a real-world scenario where an organization experienced performance issues, and how they successfully diagnosed and resolved the problem. The issue started with noticing degraded performance across all systems, which led to checking the SAN, finding nothing wrong, and eventually examining ESXi logs. This step-by-step guide will cover how to identify and resolve network performance issues in ESXi environments, focusing on fiber connections and SFPs, and providing sample error messages and correct expected values.


Identifying the Issue:

In our scenario, the first clue was found in the vmkwarning.log on the ESXi hosts, where the following error message appeared: "state in doubt; requested fast path state update." This message suggested that there might be a communication issue between the ESXi hosts and the storage array.


Sample error message: "2023-04-26T09:32:15Z cpu3:2097242)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.5006016341e01555" state in doubt; requested fast path state update..."


Checking the Fiber Switch:

Upon investigating the fiber switch, frame drops were noticed on the ports corresponding to the ESXi hosts exhibiting the error message. Additionally, CRC errors were observed, indicating potential issues with the network connectivity or SFPs.


Sample error message: "Port 5: Rx CRC error counter: 150"


Inspecting the SFPs on the Fiber Switch:

The SFPs on the fiber switch were inspected for correct specifications, such as speed, wavelength, type, fiber mode, and distance rating. Some SFPs with values out of range were replaced, and the majority of the errors were resolved. However, a few connections still exhibited issues.


Correct expected values (example): Speed - 10Gbps, Wavelength - 850nm (multimode fiber), Type - SFP+, Fiber mode - MMF (multimode fiber), Distance rating - 300 meters


Inspecting the SFPs on the Server/Host Side:

After addressing the SFPs on the fiber switch side, attention was turned to the server/host side. The SFPs on the server/host side were inspected and compared to the switch-side SFP specifications. Any discrepancies or damaged SFPs were replaced, and the fiber connectors were cleaned.


Correct expected values (example): Speed - 10Gbps, Wavelength - 850nm (multimode fiber), Type - SFP+, Fiber mode - MMF (multimode fiber), Distance rating - 300 meters


Monitoring the Connections:

Following the replacement of the SFPs on the server/host side, the connections were monitored for improvements in CRC errors, dropped frames, and overall performance. In our scenario, the remaining issues were resolved, and system performance returned to normal.


Sample improvement: "Port 5: Rx CRC error counter: 0"


Conclusion:


In this blog post, we have provided a step-by-step guide to diagnosing and resolving network performance issues in ESXi environments, complete with sample error messages and correct expected values. By systematically investigating the fiber switch, SFPs, and network connections, it is possible to identify and resolve problems that may be affecting system performance. This approach can help IT administrators maintain optimal performance in their virtualized environments and ensure smooth business operations.


#ESXiPerformance #VMwareTroubleshooting #FiberSwitch #SFPs #NetworkConnectivity #DataCenter #Virtualization #StorageArray #NetworkErrors #ITInfrastructure

Monday, April 24, 2023

Unleashing the Power of vCenter: Top 5 Third-Party Tools You Should Know About

Unleashing the Power of vCenter: Top 5 Third-Party Tools You Should Know About


Introduction:

VMware's vCenter Server is undoubtedly an impressive virtualization management platform, but did you know that there's a vibrant ecosystem of third-party tools that can make your vCenter experience even better? In this editorial, we'll explore five fantastic tools that have won the hearts of vCenter users by offering powerful features, deep insights, and invaluable enhancements to virtual infrastructure management.


Veeam Backup & Replication: The Guardian of Your Virtual World

When it comes to safeguarding your precious virtual data, Veeam Backup & Replication stands as a sentinel, armed with robust and efficient data protection capabilities. Seamlessly integrating with vCenter, Veeam offers users fast backup and recovery, instant VM recovery, and automated testing. With Veeam watching over your virtual kingdom, you can sleep soundly, knowing your data is secure and recoverable.


SolarWinds Virtualization Manager: The All-Seeing Oracle

For those seeking deeper understanding and mastery of their virtual environment, SolarWinds Virtualization Manager offers unparalleled visibility. This comprehensive tool comes equipped with powerful alerting and reporting features, capacity planning capabilities, and an intuitive dashboard. With SolarWinds by your side, you'll gain the wisdom and foresight to optimize your vCenter environment and avert potential performance pitfalls.


Turbonomic: The Master of Resource Equilibrium

In the ever-changing world of virtual resources, Turbonomic (formerly VMTurbo) stands as a beacon of balance, automatically adjusting resource allocation to ensure optimal performance and efficiency. Users extol its ability to prevent bottlenecks, reduce resource contention, and lower infrastructure costs. Turbonomic is like having a skilled maestro orchestrating your vCenter environment, fine-tuning resources in harmony with your applications' needs.


Runecast Analyzer: The Proactive Protector

For those who believe in anticipating and averting problems before they arise, Runecast Analyzer is the perfect ally. This proactive monitoring and analysis tool helps you identify and mitigate potential issues in your virtual infrastructure. Users appreciate its ability to detect configuration problems, security vulnerabilities, and compliance violations. With Runecast Analyzer in your corner, you'll be well-prepared to tackle the challenges lurking in your vCenter environment.


RVTools: The Virtualization Cartographer

Mapping the vast landscape of your vCenter environment can be a daunting task, but RVTools is here to help. This lightweight utility provides extensive information on virtual machines, hosts, datastores, and networks, making it easier than ever to navigate your virtual infrastructure. Users love its simplicity, ease of use, and ability to export data to various formats, such as Excel and CSV. With RVTools, you'll always have an up-to-date map of your virtual world at your fingertips.


vCenter Server is a powerful platform on its own, but these five third-party tools can take your virtualization management experience to new heights. Each offers unique benefits and capabilities, making them invaluable additions to any vCenter user's arsenal. By exploring and embracing these tools, you can unlock the full potential of your virtual infrastructure, optimize performance, and ensure the security and stability of your virtual environment.

Troubleshooting flow chart: Identifying Fibre Channel, iSCSI, and NFS storage issues on ESX/ESXi hosts

Identifying Fibre Channel, iSCSI, and NFS storage issues on ESX/ESXi hosts(1003659)

Symptoms

  • No targets from an array can be seen by:
    • All of the ESX/ESXi hosts
    • All of the ESX/ESXi hosts on a specific connection or LUN
    • One ESX/ESXi host
  • Targets on the storage array are visible, but one or more LUNs are not
  • LUN not visible
  • LUN cannot connect
  • Connectivity issues to the storage array
  • ESX/ESXi host initiators are not logging into the array
  • The share cannot be mounted by the ESX/ESXi host
  • The share is mounted, but nothing can be written to it
  • You see one or more of the following errors:
    • Unknown inaccessible
    • SCSI: 4506: Cannot find a path to device vmhbax:x:x in a good state
    • WARNING: LVM: 4844: vmhbaH:T:L:P detected as a snapshot device. Disallowing access to the LUN since resignaturing is turned off.
    • Date esx vmkernel: Time cpu3: 10340 SCSI: 5637: status SCSI LUN is in snapshot state, rstatus 0xc0de00 for vmhbax:x:x. residual R 999, CR 8-, ER3
    • Date esx vmkernel: Time cpu3: world ID SCSI 6624: Device vmhbax:x:x. is a deactivated snapshot

Purpose

This article helps you identify problems related with the storage subsystem of ESX/ESXi.

Resolution

Troubleshooting ESX host storage issues begins with identifying how far reaching (the scope) the problem is. In many cases, a detected problem may be misidentified until the scope has been ascertained.
 
To identify the scope of the problem:
  1. Verify that the storage device cannot be seen by any or a subset of the ESX cluster. If so, select the appropriate storage technology:
  2. Verify that no more than a single ESX host cannot see the shared storage. If so, select the appropriate storage technology:
  3. Verify that the LUN is presented and available. For more information, see Troubleshooting LUN connectivity issues (1003955)
     
  4. Verify that the ESX host cannot see the datastore:

Additional Information

Troubleshooting flow chart:
 
 

Tags

shared-storage  storage-connectivity  iscsi-connectivity  nfs-connectivity  fibre-channel-connectivity  lun-connectivity


http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003659

VCP5-DCV Certification

What is the process for registering for a VMware certification exam?
  • Step 1: Go to certification landing page such as “vmware.com/go/vcp” and click “Register.”
  • Step 2: Sign-in to myLearn or create myLearn account. (New account creation includes authentication code mailed to email address.)
  • Step 3: Complete the request to register by confirming First and Last Name is the same as on your government issued ID.
  • Step 4: Create a new Pearson VUE (VMware testing) web account by providing First Name, Last Name, and Candidate ID as it was displayed during Step 3.
    For returning VMware test takers, go to pearsonvue.com/vmware and sign-in using your existing Pearson VUE (VMware testing) web account username and password.
  • Step 5: Proceed with the selection of the location, date and time of your exam, as well as the payment process as directed by Pearson VUE.

VMware Certification Exam Registration Details in English.


What resources are available to assist in certification exam preparation?
Exam preparation resources include:

swap to SSD

Swap to host cache aka swap to SSD?

Before we dive in to it, lets spell out the actual name of the feature “Swap to host cache”. Remember that, swap to host cache!
I’ve seen multiple people mentioning this feature and saw William posting a hack on how to fool vSphere (feature is part of vSphere 5 to be clear) into thinking it has access to SSD disks while this might not be the case. One thing I noticed is that there seems to be a misunderstanding of what this swap to host cache actually is / does and that is probably due to the fact that some tend to call it swap to SSD. Yes it is true, ultimately your VM would be swapping to SSD but it is not just a swap file on SSD or better said it is NOT a regular virtual machine swap file on SSD.
When I logged in to my environment first thing I noticed was that my SSD backed datastore was not tagged as SSD. First thing I wanted to do was tag it as SSD, as mentioned William already described this in his article and it is well documented in our own documentation as well so I followed it. This is what I did to get it working:
  • Check the NAA ID in the vSphere UI
  • Opened up an SSH session to my ESXi host
  • Validate which SATP claimed the device:
    esxcli storage nmp device list
    In my case: VMW_SATP_ALUA_CX
  • Verify it is currently not recognized as SSD by typing the following command:
    esxcli storage core device list -d naa.60060160916128003edc4c4e4654e011
    should say: “Is SSD : False”
  • Set “Is SSD” to true:
    esxcli storage nmp satp rule add -s VMW_SATP_ALUA_CX  –device naa.60060160916128003edc4c4e4654e011  –option=enable_ssd
  • I reloaded claim rules and ran them using the following commands:
    esxcli storage core claimrule load
    esxcli storage core claimrule run
  • Validate it is set to true:
    esxcli storage core device list -d naa.60060160916128003edc4c4e4654e011
  • Now the device should be listed as SSD
Next would be to enable the feature… When you go to your host and click on the “Configuration Tab” there should be a section called “Host Cache Configuration” on the left. When you’ve correctly tagged your SSD it should look like this:
Please note that I already had a VM running on the device and hence the reason it is showing some of the space as being in use on this device, normally I would recommend using a drive dedicated for swap. Next step would be enabling the feature and you can do that by opening the pop-up window (right click your datastore and select “Properties”). This is what I did:
  • Tick “Allocate space for host cache”
  • Select “Custom size”
  • Set the size to 25GB
  • Click “OK”
Now there is no science to this value as I just wanted to enable it and test the feature. What happened when we enabled it? We allocated space on this LUN so something must have been done with it? I opened up the datastore browser and I noticed a new folder was created on this particular VMFS volume:
Not only did it create a folder structure but it also created 25 x 1GB .vswp files. Now before we go any further, please note that this is a per host setting. Each host will need to have its own Host Cache assigned so it probably makes more sense to use a local SSD drive instead of a SAN volume. Some of you might say but what about resiliency? Well if your host fails the VMs will need to restart anyway so that data is no longer relevant, in terms of disk resiliency you should definitely consider a RAID-1 configuration. Generally speaking SAN volumes are much more expensive than local volumes and using local volumes also removes the latency caused by the storage network. Compared to the latency of a SSD (less than 100 μs), network latency can be significant. So lets recap that in a nice design principal:
Basic design principle
Using “Swap to host cache” will severely reduce the performance impact of VMkernel swapping. It is recommended to use a local SSD drive to elimate any network latency and to optimize for performance.
How does it work? Well fairly straight forward actually. When there is severe memory pressure and the hypervisor needs to swap memory pages to disk it will swap to the .vswp files on the SSD drive instead. Each of these, in my case, 25 files are shared amongst the VMs running on this host. Now you will probably wonder how you know if the host is using this Host Cache or not, that can of course simply be validated by looking at the performance statistics within vCenter. It contains a couple of new metrics of which “Swap in from host cache” and “Swap out to host cache” (and the “rate”…) metrics are most important to monitor. (Yes, esxtop has metrics as well to monitor it namely LLSWR/s  and LLSWW/s)
What if you want to resize your Host Cache and it is already in use? Well simply said the Host Cache is optimized to allow for this scenario. If the Host Cache is completely filled memory pages will need to be copied to the regular .vswp file. This could mean that the process takes longer than expected and of course it is not a recommended practice as it will decrease performance for your VMs as these pages more than likely at some point will need to be swapped in. Resizing however can be done on the fly, no need to vMotion away your VMs. Just adjust the slider and wait for the process to complete. If you decide to complete remove all host cache for what ever reason than all relevant data will be migrated to the regular .vswp.
What if the Host Cache is full? Normally it shouldn’t even reach that state, but when you run out of space in the host cache pages will be migrated from your host cache to your regular vswap file and it is first in first out in this case, which should be the right policy for most workloads. Now chances of course of having memory pressure to the extend where you fill up a local SSD are small, but it is good to realize what the impact is. If you are going down the path of local SSD drives with Host Cache enabled and will be overcommitting it might be good to do the math and ensure that you have enough cache available to keep these pages in cache rather than on rotating media. I prefer to keep it simple though and would probably recommend to equal the size of your hosts memory. In the case of a host with 128GB RAM that would be a 128GB SSD. Yes this might be overkill, but the price difference between 64GB and 128GB is probably neglect-able.
Basic design principle
Monitor swap usage. Although “Swap to host cache” will reduce the impact of VMkernel swapping it will not eliminate it. Take your expected consolidation ratio into account including your HA (N-X) strategy and size accordingly. Or keep it simple and just use the same size as physical memory.
One interesting use case could be to place all regular swap files on very cheap shared storage (RAID5 of SATA drives) or even local SATA storage using the “VM swapfile location” (aka. Host local swap) feature. Then install a host cache for any host these VMs can be migrated to. This should give you the performance of a SSD while maintaining most of the cost saving of the cheap storage. Please note that the host cache is a per-host feature. Hence in the time of a vMotion all data from the cache will need to be transferred to the destination host. This will impact the time a vMotion takes. Unless your vMotions are time critical, this should not be an issue though. I have been told that VMware will publish a KB article with advise how to buy the right SSDs for this feature.
Summarizing, Swap to SSD is what people have been calling this feature and that is not what it is. This is a mechanism that caches memory pages to SSD and should be referred to as “Swap to host cache”. Depending on how you do the math all memory pages can be swapped to and from SSD. If there is insufficient space available memory pages will move over to the regular .vswp file. Use local SSD drives to avoid any latency associated with your storage network and to minimize costs.

Is vCenter Right for You? Weighing the Pros and Cons

 Is vCenter Right for You? Weighing the Pros and Cons


Introduction:

In the vast realm of virtualization, VMware's vCenter Server stands as a towering presence, offering a comprehensive suite of tools to manage and optimize your virtual infrastructure. But is it the right choice for your organization? In this editorial, we'll explore the pros and cons of vCenter to help you decide if it's the perfect fit for your virtualization needs.


Pros:


Centralized Management:

One of vCenter's most significant selling points is its ability to provide centralized management for your entire VMware vSphere environment. This single pane of glass approach streamlines administrative tasks and simplifies the overall management of your virtual infrastructure.


Advanced Features:

vCenter offers a treasure trove of advanced features, such as vMotion, Storage vMotion, Distributed Resource Scheduler (DRS), and High Availability (HA), which can elevate your virtual environment's efficiency, performance, and resilience.


Scalability:

vCenter is designed to grow with your organization, supporting thousands of virtual machines and ESXi hosts. This scalability ensures that your virtual infrastructure can evolve to meet changing demands and business needs.


Extensive Ecosystem:

VMware's vast ecosystem of partners, integrations, and third-party solutions means that vCenter can easily fit into your existing IT landscape. This extensive ecosystem allows you to leverage additional tools and services to further enhance and customize your virtual environment.


Robust Security:

vCenter includes robust security features like role-based access control (RBAC), which allows you to define granular permissions for users and groups. This level of control helps maintain the security and compliance of your virtual infrastructure.


Cons:


Cost:

One potential drawback of vCenter is its cost. Depending on your organization's size and requirements, the licensing fees and hardware investments associated with vCenter can be substantial. It's essential to weigh the benefits against the costs when considering vCenter as your virtualization management solution.


Complexity:

vCenter's myriad of features and capabilities can be both a blessing and a curse. While powerful, the complexity of vCenter may present a steep learning curve for IT teams that are new to virtualization or transitioning from a different platform.


Vendor Lock-In:

By choosing vCenter, you are committing to VMware's virtualization ecosystem. While this has its benefits, it can also limit your flexibility in terms of adopting alternative virtualization solutions or integrating with non-VMware platforms.


Resource Consumption:

vCenter can consume significant resources, particularly if deployed as a virtual appliance. This consumption may impact the performance of your virtual infrastructure, especially if hardware resources are limited or not properly allocated.


Conclusion:

vCenter Server is undeniably a powerful and feature-rich solution for managing your virtual infrastructure. However, it's crucial to consider both its advantages and drawbacks when determining if it's the right fit for your organization. By carefully weighing the pros and cons, you can make an informed decision that best aligns with your organization's unique virtualization needs and goals.

The Perilous Path to vCenter Nirvana: Top 10 Pitfalls and How to Avoid Them

The Perilous Path to vCenter Nirvana: Top 10 Pitfalls and How to Avoid Them



The Perils of Poor Planning:

The road to vCenter nirvana begins with a solid plan, lest you end up in the wilderness of inefficient resource allocation and suboptimal performance. Invest time in mapping out your journey, considering factors like scalability, networking, and storage. Keep VMware's best practices close to your heart, and the path shall be clear.


The Hardware Hunger Games:

Skimping on hardware resources can lead to sluggish performance and instability in your vCenter kingdom. Ensure your hardware is up to the challenge, meeting minimum requirements and leaving room for growth. May the odds be ever in your favor!


Network Nightmares:

Misconfigured networks can transform your virtual dreams into a horrifying performance hellscape. Don your armor and wield your knowledge of VLANs, subnets, and firewall rules to vanquish these demons and create a network that's the envy of the virtual realm.


The Haunting of High Availability:

Disregard high availability and redundancy at your peril, for downtime and data loss are the specters that await the unprepared. Banish these phantoms by invoking the powerful spells of vCenter High Availability (vCHA) and vSphere High Availability (HA).


The Security Slip-up:

Neglecting security in your vCenter fortress is an open invitation to nefarious intruders. Strengthen your defenses with role-based access control (RBAC), timely software updates, and a steadfast adherence to security best practices.


Storage Shenanigans:

Storage misconfiguration is a mischievous imp that can wreak havoc on your vCenter domain. Master the art of Storage vMotion and embrace the magical powers of vSAN to optimize storage and banish bottlenecks to the abyss.


The Overcommitment Ogres:

Overcommitting resources like CPU and memory can summon the ogres of poor performance and instability. Keep a watchful eye on resource utilization and set appropriate boundaries to maintain harmony in your virtual kingdom.


The Mystery of Monitoring:

Without proper monitoring, your vCenter realm risks falling into chaos and disarray. Arm yourself with tools like vRealize Operations Manager and vSphere Update Manager (VUM) to maintain vigilance and restore order in times of trouble.


The Documentation Dilemma:

Forsaking documentation and training leaves your team stranded in the dark woods of inefficiency and confusion. Illuminate the path with up-to-date documentation and ensure your fellow adventurers are well-versed in the ways of vCenter.


The Forgotten Fountain of Knowledge:

Ignoring the wisdom of VMware's support and the wider community is like refusing to drink from a fountain of knowledge. Quench your thirst for success by tapping into official documentation, support, and community forums.


Conclusion:

The journey to vCenter nirvana may be fraught with challenges and pitfalls, but with careful planning, cunning strategies, and the support of your fellow adventurers, you can conquer the virtual landscape. So, strap on your armor, pick up your trusty VMware manual, and prepare to embark on the epic quest for a robust, secure, and efficient vCenter infrastructure.