Pages

Wednesday, 22 August 2012

Investigating virtual machine file locks on ESX/ESXi

Solution

To prevent concurrent changes to critical virtual machine files and file systems, ESX hosts establish locks on these files. In certain circumstances these locks may not be released when the virtual machine is powered off. The files cannot be accessed by the servers while locked, and the virtual machine is unable to power on.
These virtual machine files are commonly locked for runtime:
  • <VMNAME>.vswp
  • <DISKNAME>-flat.vmdk
  • <DISKNAME>-<ITERATION>-delta.vmdk
  • <VMNAME>.vmx
  • <VMNAME>.vmxf
  • vmware.log  

Identifying the locked file

To identify the locked file, attempt to power on the virtual machine. During the power on process, an error may display or be written to the virtual machine's logs. The error and the log entry identify the virtual machine and files:
  1. Where applicable, open and connect the VMware Infrastructure (VI) or vSphere Client to the respective ESX host, VirtualCenter Server, or the vCenter Server hostname or IP address.
  2. Locate the affected virtual machine, and attempt to power it on.
  3. Open a remote console window for the virtual machine.
  4. If the virtual machine is unable to power on, an error on the remote console screen displays with the name of the affected file.

    Note: If an error does not display, proceed to these steps to review the vmware.log file of the virtual machine.

    1. Log in as root to the ESX host using an SSH client.
    2. Confirm that the virtual machine is registered on the server and obtain the full path to the virtual machine:

      • On an ESX host, run this command:

        # vmware-cmd -l

        The output returns a list of the virtual machines registered to the ESX host. Each line contains the full path of a virtual machine's .vmx file. The output will be similar to:

        /vmfs/volumes/<UUID>/<VMDIR>/<VMNAME>.vmx

        Note: Record this information as it is required in the remainder of this process. This is the <path.vmx>referenced in the remainder of the article. It is also case-sensitive.
      • On an ESXi host, run this command:

        # vim-cmd vmsvc/getallvms

        The output returns a list of the virtual machines registered to the ESX host. Each line contains the datastore and location within of a virtual machine's .vmx file. The output will be similar to:

        [<datastore>] <VMDIR>/<VMNAME>.vmx

        Verify that the affected virtual machine appears in this list. If it is not listed, the virtual machine is not registered on this ESX/ESXi host. The host on which the virtual machine is registered typically holds the lock. Ensure that you are connected to the proper host before proceeding.
    3. Move to the virtual machine's directory:

      # cd /vmfs/volumes/<datastore>/<VMDIR>
    4. Use a text viewer to read the contents of the vmware.log file. At the end of the file, look for error messages that identify the affected file.

Using the touch utility to determine if the file can be locked

The touch utility is designed to update the access and modification time stamp of the specified file or directory. As such, the command can be used to test the file and directory locking mechanism in the VMFS filesystem, where the procedure is expected to fail on locked files. Using touch is the preferred method because the changes to the resource are minimal.
To test the file or directory locking functionality, run this command:
# touch <filename>
Note: Performing a "touch *" command performs the operation on all files in the current directory.
The touch * command can result in these outcomes:
  • If the touch * command succeeds, then the command successfully made changes to the date/time stamp and has verified that the file can and has been locked (then unlocked). At this point, retry the virtual machine power-on operation to see if it succeeds.
  • If the touch * command fails with a device or resource busy message, it indicates that a process is maintaining a lock on the file or directory. This may be on any of the ESX hosts which have access to the file. If the message is reported, proceed to the next section.
  • If another error message is reported, it may indicate that the metadata pertaining to file or directory locking on VMFS may not be valid or corrupt. If this is the case, collect diagnostic information from the VMware ESX host and submit a support request. For more information, see Collecting diagnostic information for VMware products (1008524) and How to Submit a Support Request.

Locating the lock and removing it

Because a virtual machine can be moved between hosts, the host where the virtual machine is currently registered may not be the host maintaining the file lock. The lock must be released by the ESX host that owns the lock. This host is identified by the MAC address of the primary Service Console interface.

Note: Locked files can also be caused by backup programs keeping a lock on the file while backing up the virtual machine. If there are any issues with the backup it may result in the lock not being removed correctly.
In some cases you may need to disable your backup application or reboot the backup server to clear the hung backup.
This lock can be maintained by either the VMkernel (ESX/ESXi) or the Service Console (ESX) for any hosts connected to the same storage.
Note : VMware ESXi does not utilize a separate Service Console Operating System. This reduces the amount of lock troubleshooting to just the VMkernel. For example, Console OS troubleshooting methods such as using the lsof utility are not applicable to VMware ESXi hosts.
Start with identifying the server whose VMkernel may be locking the file. To identify the server:
  1. Report the MAC address of the lock holder by running the command (except on NFS volume):

    # vmkfstools -D /vmfs/volumes/<UUID>/<VMDIR>/<LOCKEDFILE.xxx>

    Note: Run this command on all commonly locked virtual machine files (as listed at start of Resolution) to ensure that all locked files are identified.
  2. For servers prior to VMware ESX/ESXi 4.1, this command writes the output of the above command to the system's logs. From ESX/ESXi 4.1, the output is also displayed on-screen. Included in this output is the MAC address of any host that is locking the .vmdk file. To locate this information, run this command:

    # tail /var/log/vmkernel (For ESX)
    # tail /var/log/messages (For ESXi)
    Note: If there is a high amount of logging activity and you are unable to locate lines similar to the example below, use theless command in lieu of tail. You may press G, once open, to immediately scroll to the bottom of the log's output. Use your arrow, page, or scroll keys to locate the relevant output.
    Look for lines similar to this:

    Hostname vmkernel: 17:00:38:46.977 cpu1:1033)Lock [type 10c00001 offset 13058048 v 20, hb offset 3499520
    Hostname vmkernel: gen 532, mode 1, owner 45feb537-9c52009b-e812- 00137266e200 mtime 1174669462]Hostname vmkernel: 17:00:38:46.977 cpu1:1033)Addr <4, 136, 2>, gen 19, links 1, type reg, flags 0x0, uid 0, gid 0, mode 600
    Hostname vmkernel: 17:00:38:46.977 cpu1:1033)len 297795584, nb 142 tbz 0, zla 1, bs 2097152
    Hostname vmkernel: 17:00:38:46.977 cpu1:1033)FS3: 132: <END supp167-w2k3-VC-a3112729.vswp>
    The second line (in bold) displays the MAC address after the word owner. In this example, the MAC address of the Service Console or vswif0 interface of the offending ESX Server is 00:13:72:66:E2:00. After logging into the server, the process maintaining the lock can be analyzed.

    Note: If this process does not reveal the MAC address, or the owner identifier is all zeroes, it is possible that it is a Service Console-based lock, an NFS lock, or a lock generated by another system or product that can use or read VMFS file systems. In other circumstances, the file is locked by a VMkernel child or cartel world and the offending host running the process/world must be rebooted to clear it.
  3. To determine if the MAC address corresponds to the host that you are currently logged into, see Identifying the ESX Service Console MAC address (1001167). If it does not, you must establish a console or SSH connection to each host that has access to this virtual machine's files.

    After identifying, unregister the virtual machine from the existing host and register it on the host holding the lock and then attempt to power on the virtual machine. You may have to set DRS to manual to ensure that the virtual machine powers on the correct host.
    If the virtual machine still does not get powered on, complete these procedures while logged in to the offending host.

    Note: If you have already identified a vmkernel lock on the file, skip the rest of the steps in this section.
  4. To check for Service Console-based locks on non-ESXi servers, run this command:

    # lsof | grep <name of locked file>
    You will see an output similar to:

    COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
    71fd60b6- 3631 root 4r REG 0,9 10737418240 23533 
    <name of locked file>
    Note: If there is no Service Console process locking the file, you should receive no printed output. If you receive any results, however, file a Support Request to identify the process and to determine the root-cause. If it is a third-party process, however, contact the appropriate vendor in order to determine the root-cause prior to killing the process ID, as it may occur again in the future.

    Stop the process ID and its lock using the kill command. From the above example, the process ID is 3631:

    # kill 3631
    Warning: Using the kill command will abruptly terminate all the running process for the virtual machine without generating any core dump to analyze the status later so be careful before using this command if you decide to troubleshoot the virtual machine state. To Collect diagnostic information by crashing a virtual machine see KB 2005715.

    After stopping the process, you can attempt to power on the virtual machine or access the file/resource.
  5. Check if the virtual machine still has a World ID assigned to it:

    For ESX/ESXi 4.x, run these commands on all ESX/ESXi hosts:

    # cd /tmp
    # vm-support -x

    The output will be similar to:

    Available worlds to debug:
    wid=<world id> <name of VM with locked file>

    On the ESX/ESXi host where the virtual machine is still running, kill the virtual machine, which releases the lock on the file. To kill the virtual machine, run this command:

    # vm-support -X <world id>
    Where the <world id> is the World ID of the virtual machine with the locked file.

    Note: This command takes 5-10 minutes to complete. Answer No to "Can I include a screenshot of the VM", and answerYes to all subsequent questions.

    After stopping the process, you can power on the virtual machine or access the file/resource

    For ESXi 5. 0, the esxcli command-line utility can be used locally or remotely to display a list of the virtual machines which are currently running on the host.
    Obtain a list of all running virtual machines, identified by their World ID, Cartel ID, display name and path to the .vmx configuration file using this command:

    # esxcli vm process list
    The output appears similar to this:

    VirtualMachineName
    World ID: 1268395
    Process ID: 0
    VMX Cartel ID: 1264298
    UUID: ab cd ef ...
    Display Name: <VirtualMachineName>
    Config File: </path/VirtualMachineName>.vmx

    Two worlds are listed. The first world number (in this example, 1268395) is the Virtual Machine Monitor (VMM) for vCPU 0. The second world number (in this example, 1264298) is the virtual machine cartel ID.
    On the ESX/ESXi host where the virtual machine is still running, kill the virtual machine, which releases the lock on the file. To kill the virtual machine, run this command:

    # e
    sxcli vm process kill --type soft --world-id 1268395
    For additional information, see Mapping a virtual machine world number to a virtual machine name (1001101).
  6. In ESXi 5.0 and ESXi 4.1, to find the owner of the locked file of a virtual machine, run this command:

    # vmkvsitools lsof | grep <Virtual Machine Name>
    You see an output similar to:

    11773 vmx 12 46 /vmfs/volumes/<Datastore Name>/VirtualMachineName/VirtualMachineName-flat.vmdk

    You can then run this command to get the PID of the process for the virtual machine:

    ps | grep <PID>

    You can kill the process with this command:

    kill -9 <PID>To generate the core dump after killing the running virtual machine (but hung and nonresponsive)  use the commandkill -6 <PID> or kill -11 <PID>.

    Note: In ESXi 4.1 and ESXi 5.0, you can use the k command in esxtop to send a signal to, and kill, a running virtual machine process. On the ESXi console, enter Tech Support mode and log in as root. For more information, see Tech Support Mode for Emergency Support (1003677).

    1. Run the esxtop utility using the esxtop command.
    2. Press c to switch to the CPU resource utilization screen.
    3. Press Shift+f to display the list of fields.
    4. Press c to add the column for the Leader World ID.
    5. Identify the target virtual machine by its Name and Leader World ID (LWID).
    6. Press k.
    7. At the World to kill prompt, type in the Leader World ID from step 5 and press Enter.
    8. Wait 30 seconds and validate that the process is not longer listed.
Removing the .lck file (NFS Only)
The files on the virtual machine may be locked via NFS storage. You can identify this by files denoted with .lck.#### (where #### refers to the World ID that has the file lock) at the end of the filename. This is an NFS file lock and is only listed when using the ls -la command as it is hidden file.
Caution: These can be removed safely only if the virtual machine is not running.
Note: VMFS volumes do not have .lck files. The locking mechanism for VMFS volumes is handled within VMFS metadata on the volume.

Determining if the file is being used by a running virtual machine

If the file is being accessed by a running virtual machine, the lock cannot be usurped or removed. It is possible that the lockholder host is running the virtual machine and has become unresponsive, or another running virtual machine has the disk incorrectly added to its configuration prior to power-on attempts.
To determine if the virtual machine processes are running:
  1. Determine if the virtual machine is registered on the host:

    • For ESX, run this command as the root user:

      # vmware-cmd -l
      Note: If the virtual machine is registered on more than one ESX host, see Virtual Machines appear to be running or registered on multiple ESX Servers (1005051).

    • For ESXi, run this command as the root user:

      # vim-cmd vmsvc/getallvms

      The output lists the Vmid for each virtual machine registered. Record this information as it is required in the remainder of this process on the ESXi server. 
  2. Assess the virtual machines current state on the host:

    • For ESX, run this command:

      # vmware-cmd <path.vmx> getstate

    • For ESXi, run this command:

      # vim-cmd vmsvc/power.getstate <vmid>
  3. To stop the virtual machine process, see Powering off an unresponsive virtual machine on an ESX host (1004340).

Determining if the .vmdk file is in use by other virtual machines

A lock on the .vmdk file can prevent a virtual machine from starting. However, since virtual machine disk files can be configured for use with any virtual machine, the file may be locked by another virtual machine that is currently running.
To determine if the virtual machine's disk file is configured for use on more than one virtual machine, run this command:
# egrep -i <DISKNAME>.vmdk /vmfs/volumes/*/*/*.vmx
Notes:
  • This command attempts to locate the specified disk name among all .vmx configuration files for the virtual machines that are visible to the ESX host. A Device or resource busy message is printed for each virtual machine that is running but not registered to this ESX host. You must run this command on each ESX host in the infrastructure or specifically on ESX hosts that have access to the storage containing the virtual machine's files.
  • If any additional virtual machines are configured to use the disk, determine if they are currently running. Powering off the other virtual machine using the disk file releases the lock. You must determine which virtual machine should have ownership of the file, then reconfigure your virtual machines to prevent this error from occurring again.
  • As part of their operation, many virtual machine backup solutions temporarily attach the virtual machine's .vmdk files to themselves. In such cases, if the backup fails and/or the host shuts down, the backup virtual machine may still have another virtual machine's vmdk file(s) attached. If that is the case, the other virtual machine is usually powered on first, which then creates a locked file condition when the backup virtual machine is attempted to be powered on. Check using Edit Settings on your backup solution's virtual machine to see if it has a hard disk attached that belongs to a different virtual machine. If it does, power down the backup virtual machine, select the appropriate disk and choose Remove to remove the disk from the virtual machine.

    Warning: Do not delete files from disk.
If the .vmdk file is not used by other virtual machines, confirm that there are no VMkernel or Service Console processes locking the file, per the above section, Locating the file lock and removing it . If a host can be determined however the specific offending VMkernel child process ID cannot be identified, the server requires rebooting to clear the lock.
Note: You can also try to migrate the virtual machine to another host and power it on. If that ESX host has the lock for the virtual machine, it should allow you to power it on.

Rebooting the ESX host which is locking the files

By this stage, you have already investigated for identifiable VMkernel and Service Console processes which have maintained locks upon the required files, however an unidentified child process still maintains the lock. You have identified the server via thevmkfstools -D command in earlier steps, the lsof utility yields no offending processes, and no other virtual machines are locking the file.
The server should be restarted to allow the virtual machine to be powered on again.
Note: Collect diagnostic information prior to rebooting if you wish to pursue a root-cause analysis with VMware Technical Support.
Migrate the virtual machines from the server and restart it using these processes:
  1. Migrate or vMotion all virtual machines from the host to alternate hosts.
  2. When the virtual machines have been evacuated, place the host into maintenance mode and reboot it.

    Note: If you have only one ESX host or do not have the ability to vMotion or migrate virtual machines, you must schedule downtime for the affected virtual machines prior to rebooting. When the host has rebooted, start the affected virtual machine again.

Check the integrity of the virtual machine configuration file (.vmx)

For more information on checking the integrity of the virtual machine configuration file, see Verifying ESX/ESXi virtual machine file integrity (1003743).  

Opening a Support Request

If your problem still exists after attempting the steps in this article, contact VMware Technical Support and file a Support Request:

Source:-
 

No comments:

Post a Comment