Pages

Tuesday, 7 October 2014

PDL & APD in VMware

All Paths Down (APD) is an issue which has come up time and time again, and has impacted a number of our customers. Let's start with a brief description about what All Paths Down (APD) actually is & how it occurs, & what impact it has on the host. Then we'll get into how we have improved the behaviour in 5.0.
A brief overview of APD
APD is what occurs on an ESX host when a storage device is removed in an uncontrolled manner from the host (or the device fails), and the VMkernel core storage stack does not know how long the loss of device access will last. A typical way of getting into APD would be a Fiber Channel switch failure or (in the case of an iSCSI array) a network connectivity issue.  Note that if careful consideration is given to the redundancy of your multi-pathing solution and our best practices for setting up switch and HBA redundancy is followed, it could help a lot in avoiding this situation.
The APD condition could be transient, since the device or switch might come back. Or it could be permanent in so far as the device might never come back. In the past, we kept the I/O queued indefinitely, and this resulted in I/Os to the device hanging (not acknowledged). This became particularly problematic when someone issued a rescan of the SAN from a host or cluster which was typically the first thing customer tried when they found a device was missing. The rescan operation caused hostd to block waiting for a response from the devices (which never comes). Because hostd is blocked waiting on these responses, it can't be used by other services, like the vpx agent (vpxa) which is responsible for communication between the host and vCenter. The end result is the host becoming disconnected from vCenter. And if the device is never coming back, well we're in a bit of a pickle! :-(
It should also be noted that hostd could also grind to a halt even without a rescan of the SAN being initiated.  The problem is that hostd has a limited number of worker threads. If enough of these threads get stuck waiting for I/O to a device that is not responding, hostd will eventually be unable to communicate to anything else, including healthy devices, because it doesn't have any free worker threads to do any work.
 
APD Handling before vSphere 5.0
To alleviate some of the issues arising from APD, a number of advanced settings were added which could be tweaked by the admin to mitigate this behaviour. Basically this involved not blocking hostd when it was scanning the SAN, even when it came across a non-responding device (i.e. a device in APD state). This setting was automatically added in ESX 4.1 Update 1 and ESX 4.0 Update 3. Previous versions required customers to manually add the setting. This is discussed in greater detail in KB 1016626. 
 
APD Handling Enhancements in vSphere 5.0 – Introducing Permanent Device Loss (PDL)
In vSphere 5.0, we have made improvements to how we handle APD. First, what we've tried to do is differentiate between a device to which connectivity is permanently lost, and a device to which connectivity is transiently or temporarily lost. We now refer to a device which is never coming back as a Permanent Device Loss (PDL).
  • APD is now considered a transient condition, i.e. the device may come back at some time in the future.
  • PDL is considered a permanent condition where the device is never coming back.
As mentioned earlier, I/O to devices which were APD would be queued indefinitely. With PDL devices (those devices which are never coming back), we will now fail the I/Os to those devices immediately. This means that we will not end up in a situation where processes such as hostd get blocked waiting on I/O to these devices, which also means that we don't end up in the situation where the host disconnects from vCenter.
This begs the question – how do we differentiate between devices with are APD or PDL?
The answer is via SCSI sense codes. SCSI devices can indicate PDL state with a number of sense codes returned on a SCSI command. One such sense code is 5h / ASC=25h / ASCQ=0 (ILLEGAL REQUEST / LOGICAL UNIT NOT SUPPORTED). The sense code returned is a function of the device. The array is in the best position to determine if the requests are for a device that no longer exists, or for a device that just has an error/problem. In fact, in the case of APD, we do not get any sense code back from the device.
When the last path to the device in PDL returns the appropriate sense code, internally the VMkernel changes the device state to one which indicates a PDL. It is important to note that ALL paths must be in PDL for the DEVICE to become PDL. Once a device is in this state, commands issued to that device will return with VMK_PERM_DEV_LOSS. In other words, we now fail the I/Os immediately rather than have I/Os hanging. This will mean that with PDL, hostd should never become blocked, and no hosts should disconnect from vCenter. Hurrah!
None of these distinctions between APD and PDL are directly visible to the Virtual Machines that are issuing I/Os. SCSI commands from a VM to an APD/PDL device are simply not responded to by the VMkernel, and when they timeout, it elicits a retransmit attempt from the host. In other words, VMs retry their I/O indefinitely. That is why in most cases, if the device which was in APD does comes back, the VMs continue to run from where they left off – this is one of the really great features of virtualization in my opinion (and has saved many an admin who inadvertently disconnected the wrong cable or offlined the wrong device) :-)
 
Best Practice to correctly remove a device & avoid APD/PDL
There has never been an intuitive or well defined way to remove a storage device from an ESX in the past. Now we have a controlled procedure to do this in 5.0.
5.0 introduces two new storage device management techniques. You now have the ability to mount/unmount a VMFS volume &attach/detach a device. Therefore, if you want to remove the device completely, the first step is to  unmount the volume. In this example, I have a NetApp device on which a VMFS-5 filesystem has been created. First, get the NAA id of the underlying LUN as you will need this later. This can be found by clicking on the Properties of the datastore, & looking at the extent details:
Naa-id
Before proceeding, make sure that there are no running VMs on the datastore, that the datastore is not part of a datastore cluster (unused by Storage DRS), is not used by vSphere HA as a heartbeat datastore & does not have Storage I/O Control enabled.
With the NAA id noted, right click on the volume in the Configuration > Storage > Datastores view, and select Unmount:
Unmount
When the unmount operation is selected, a number of checks are done on the volume to ensure that it is in a state that allows it to be unmounted, i.e. no running VMs, Storage I/O Control not enabled, not used as a Heartbeat Datastore, not managed by Storage DRS.
  Coinfirm-unmount
If the checks pass, click OK & then the volume is unmounted.
Umount3
The CLI command to do an unmount is esxcli storage filesystem unmount if you prefer to do it from the ESXi shell. When that the volume is safely unmounted, the next step is to detach it from the host. This can also done either via the CLI or via the UI in the Configuration > Storage window, but the view must be changed to Devices rather than Datastores. Click on the Devices button, select the correct device using the NAA id noted previously, right click and select Detach:
Detach-new1
The detach will check that the volume is indeed in a state that allows it to be detached:
Detach-checks
The same task can be done via the ESXi shell. The command that I need to use to do a detach is esxcli storage core device set –state=off. Note how the Status changes from on to off in the following commands:
~ # esxcli storage core device list -d naa.60a98000572d54724a34642d71325763
naa.60a98000572d54724a34642d71325763
   Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d54724a34642d71325763)
   Has Settable Display Name: true
   Size: 3145923
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.60a98000572d54724a34642d71325763
   Vendor: NETAPP
   Model: LUN
   Revision: 7330
   SCSI Level: 4
   Is Pseudo: false
   Status: on
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Thin Provisioning Status: yes
   Attached Filters:
   VAAI Status: unknown
   Other UIDs: vml.020000000060a98000572d54724a34642d713257634c554e202020
 
~ # esxcli storage core device set –state=off -d naa.60a98000572d54724a34642d71325763
 
~ # esxcli storage core device list -d naa.60a98000572d54724a34642d71325763
naa.60a98000572d54724a34642d71325763
   Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d54724a34642d71325763)
   Has Settable Display Name: true
   Size: 3145923
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path:
   Vendor: NETAPP
   Model: LUN
   Revision: 7330
   SCSI Level: 4
   Is Pseudo: false
   Status: off
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Thin Provisioning Status: yes
   Attached Filters:
   VAAI Status: unknown
   Other UIDs: vml.020000000060a98000572d54724a34642d713257634c554e202020
~ #
This device is now successfully detached from the host. It remains visible the UI at this point:
Detach-new2
You can now do a rescan of the SAN and this will safely remove the device from the host.
 
APD may still occur
The PDL state is definitely a major step in handling APD conditions. The unmount/detach mechanism should also alleviate certain APD conditions. However there is still a chance that APD conditions can occur. For instance, if the LUN fails in a way which does not return the sense codes expected by PDL, then you could still experience APD in your environment. VMware is continuing to work on the APD behaviour to mitigate the impact it might have on your infrastructure.
 
Detecting PDL Caveat
There is one important caveat to our ability to detect PDL. Some iSCSI arrays map a Lun to Target as a one-to-one relationship, i.e. there is only ever a single LUN per Target.
In this case, the iSCSI arrays do not return the appropriate SCSI sense code, so we cannot detect PDL on these arrays types.
However most other storage arrays on our HCL should be able to provide SCSI sense code to enable the VMkernel detect PDL.
Following on from the 5.0 APD handling improvements, what we want to achieve in vSphere 5.1 is as follows:
  • Handle more complex transient APD conditions, and not have hostd getting stuck indefinitely when devices are removed in an uncontrolled manner.
  • Introduce some sort of PDL method for those iSCSI arrays which present only one LUN for target. These arrays were problematic for APD handling, since once the LUN went away, so did the target, and we had no way of getting back any SCSI sense codes.
It should be noted that in vSphere 5.0U1, we fixed an issue with vSphere correctly detecting PDL, and restarting VMs on other hosts in a vSphere HA cluster which may not have this APD state. This enhancement is also in 5.1.
Complex APD
As I have already mentioned, All Paths Down affects more than just Virtual Machine I/O. It can also affect hostd worker threads, leading to host disconnects from vCenter in worst case scenarios.  It can also affect vmx I/O when updating Virtual Machine configuration files. On occasion, we have observed scenarios where the .vmx file was affected by an APD condition.
In vSphere 5.1, a new timeout value for APD is being introduced. There will be a new global setting for this feature called Misc.APDHandlingEnable. If this value is set to 0, the current (5.0) behavior of retrying  failing I/Os forever will be used. If Misc.APDHandlingEnable is set to 1 (default), APD Handling will be enabled to follow the new model using the time out value Misc.APDTimeout.
This is set to 140 second timeout by default, tuneable. [The lower limit is 20 seconds but this is only for testing]. These settings (Misc.APDHandlingEnable & Misc.APDTimeout) are exposed in the vSphere UI. When APD is detected, the timer starts. After 140 seconds, the device is marked as APD Timeout.  Any further I/Os are fast-failed with a status of NO_CONNECT. This is the same sense code observed when an FC cable is disconnected from an FC HBA. This fast failing of I/Os prevents hostd from getting stuck waiting on I/O.  If any of the paths to the device recovers, subsequent I/Os to the device are issued normally and special APD treatment finishes.
Single-Lun, Single-Target
We also wanted to extend the PDL (Permanent Device Loss) detection to those arrays that only have a single LUN per Target. On these arrays, when the LUN disappears, so does the target so we could never get back a SCSI Sense Code as mentioned earlier.
Now in 5.1, the iSCSI initiator attempts to re-login to the target after a dropped session. If the device is not accessible, the storage system rejects our effort to access the storage. Depending on the response from the array, we can say the device is in PDL, not just unreachable.
I’m very pleased to see these APD enhancements in vSphere 5.1. The more that is done to mitigate the impact of APD, the better.
Reference Link:-

No comments:

Post a Comment