Monday, May 12, 2014

VMKcore partitions on ESXi hosts with non-local Disks

When running ESXi from local storage, a VMKcore partition is created during install.
If your ESXi Host 5.0/5.1/5.5 experiences a Purple Screen Of Death (PSOD), it hopefully creates a diagsnostic coredump. This coredump contains useful information for root cause analysis.

When a PSOD should occur, you can retrieve the dump information using the esxcfg-dumppart command: esxcfg-dumppart –log <ESX dump file> or esxcfg-dumppart –L <ESX dump file> from a shell session.

If there is no available disk partition for a coredump on your ESXi host, such as in Auto-Deploy or "USB/Memcard installs" where there is no local disks, you will get the following error message:

“No vmkcore disk partition is available and no network coredump server has been configured. Host core dumps cannot be saved.”

In such configuration cases, it is better to move the core dumps to a datastore. This has to be a VMFS volume, which rules out NFS. Since the vmkcore dump partition has to be available at boot time, software iSCSI is ruled out too. Only hardware iSCSI or FC LUNs are possible.

Setting VMKcore partition
The following steps are needed to configure the vmkcore partition. In my example I’m using a 10GB LUN provisioned by iSCSI.
Create the LUN
On my shared storage I created a 10GB iSCSI target and assigned it to my ESXi host. Then on the ESXi host you add the iSCSI target. Do a rescan and then add the iSCSI target like you would normally add a new datastore by pressing the “Add Storage” option in the Storage menu on the configuration tab. Choose to add a Disk/LUN and name it something like: vmkcore-esx01. After a rescan the LUN should be available in your storage view.
Change the partition type
Now the datastore needs to have the disk type changed. To do this you will have to logon to the ESXi host using tech support mode. After you are logged in, list all partitions using the fdisk -l command. You will now see a list of partitions in which you should search for your 10GB disk. In my case it looked like:
Disk /dev/disks/naa.5000144f33903730: 10.7 GB, 10737418240 bytes
255 heads, 63 sectors/track, 1305 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Now to change the partition type, run fdisk /dev/disks/naa.5000144f33903730 (copy it from your fdisk list). In fdisk hit “t” to change the partition’s ID, then hit “fc” to change the partition type to VMKcore. Now hit “w” to write the partition table and exit fdisk.
Set and activate the partition
The last step is now to tell ESXi to use a new vmkcore partition using the following command. First we double check for suitable vmkcore partitions:
esxcfg-dumppart –f
If the fdisk action went well, you should now see the /dev/disks/naa.5000144f33903730 partition again in the list. To set the partition use the following command:
esxcfg-dumppart -s naa.5000144f33903730:1
Now, after playing with this in my lab for over 2 hours I received the message: “Unable to set dump partition naa.5000144f33903730:1. Error Message was: Unsupported disk type: Software iSCSI LUNs are not supported”. So this last part is flying blind on the docs.
Last step is now to activate the partition using the following command:
esxcfg-dumppart -a naa.5000144f33903730:1
Reading the dump file
After a PSOD has occurred log in to the ESXi host using Tech Support mode. First step is to list the dump partition that is active and then copy the dump to a different volume and extract the logs.
-          esxcfg-dumppart –l
-          esxcfg-dumppart –copy –devname /vmfs/devices/disks/naa.xxxxx:x –newonly –zdumpname /vmfs/volumes/nfs-StorCent03/esxdump/esxdump
-          cd /vmfs/volumes/nfs-StorCent03/esxdump/esxdump
-          esxcfg-dumppart –L /vmfs/volumes/nfs-StorCent03/esxdump/esxdump
You will now find a vmkernel-log.1 file that you can use to examine why the PSOD happened.


Other useful shell or ssh console commands related to this issue:

You can display the currently active diagnostic partition with the following command (via console session to your ESXi host):
        esxcli system coredump partition get
You will see an output like:
coredump1
If you want your ESXi host to select and activate an accessible partition automatically, use the following command (you need a partition with at least 100 MB of free space):
         esxcli system coredump partition set –enable=true –smart
If you want to define a dedicated partition for the diagnostic coredump use these commands:
First list all accessible diagnostic partitions:
          esxcli system coredump partition list
You will see an output like:
coredump2
Now specify a partition you want:
          esxcli system coredump partition set –partition=”device_path_name”
In this example we configure mpx.vmhba32:C0:T0:L0:7 as a coredump partition:
And now we activate the specified partition using:
          esxcli system coredump partition set –enable true
To validate our configuration use:
          esxcli system coredump partition list
We should get something like:
coredump4



The following is VMWare's recommendation on diagnostic partitions:

A 100MB diagnostic partition for each host is recommended. If more than one ESX/ESXi host uses the same LUN as the diagnostic partition, that LUN must be zoned so that all the ESX/ESXi host can access it. Each host needs 100MB of space, so the size of the LUN determines how many servers can share it. Each ESX/ESXi host is mapped to a diagnostic slot. VMware recommends at least 16 slots (1600MB) of disk space if servers share a diagnostic partition. You can set up a SAN LUN with FibreChannel or hardware iSCSI. SAN LUNs accessed through a software iSCSI initiator are not supported.

Caution If two hosts that share a diagnostic partition fail and save core dumps to the same slot, the core dumps might be lost. To collect core dump data, reboot a host and extract log files immediately after the host fails. If another host fails before you collect the diagnostic data of the first host, the second host does not save the core dump.

More information can be found from VMWare's vSphere 5 document center here:
http://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.wssdk.pg.doc_50%2FPG_Ch8_Storage.10.10.html




No comments:

Post a Comment