Tuesday, May 22, 2018

vSphere: Process Lock Issue

Recently, we faced an issue in vSphere 6.5, after cancelling an OVF export mid-process. In this blog we will go through the troubleshooting process in detail and the resolution. 

The Issue

One of our customers exported a Virtual Machine from vCenter and cancelled the process before completion. Then the VM could not be powered ON. The error received was:

 Error Stack:   
   An error was received from the ESX host while powering on VM TEST-VM.  
  Failed to start the virtual machine.  
  Module 'Disk' power on failed.   
  Cannot open the disk '/vmfs/volumes/5753f99f-d235ca6c-a280-d89d6779a96c/TEST-VM/TEST-VM_1.vmdk' or one of the snapshot disks it depends on.   
  Failed to lock the file  
  File system specific implementation of LookupAndOpen[file] failed  
  File system specific implementation of LookupAndOpen[file] failed  
  File system specific implementation of LookupAndOpen[file] failed  
  File system specific implementation of OpenFile[file] failed  

On checking the vmkernel log file we find the below issue.

 [root@test:/var/log] tail -f vmkernel.log   
 num 1 gblnum 0 gblgen 0 gblbrk 0] alloc owner 3833856  
 2019-11-12T11:38:08.515Z cpu18:72666)DLX: 4312: vol 'RAID5_DATASTORE2', lock at 94683136: [Req mode 1] Checking liveness:  
 2019-11-12T11:38:08.515Z cpu18:72666)[type 10c00001 offset 3763970 v 55, hb offset 3686970  
 gen 745, mode 2, owner 00000000-00000000-0000-000000000000 mtime 6167 num 1 gblnum 0 gblgen 0 gblbrk 0]  

On running the following command we can see that there is a Lock of type 10c00001 and mode 0. However the owner is 00000000-00000000-0000-000000000000.

 [root@test:/vmfs/volumes/5753f99f-d235ca6c-a280-d89d6779a96c/TEST-VM] vmkfstools -D /vmfs/volumes/5753f99f-d235ca6c-a280-d89d6779a96c/TEST-VM/TEST-VM.vmx   

 Lock [type 10c00001 offset 60387328 v 456, hb offset 3686970 gen 3, mode 0, owner 00000000-00000000-0000-000000000000 mtime 12601 num 0 gblnum 0 gblgen 0 gblbrk 0]   
 Addr <4, 121, 70>, gen 142, links 1, type reg, flags 0, uid 0, gid 0, mode 100755 len 3486, nb 1 tbz 0, cow 0, newSinceEpoch 1, zla 2, bs 8192  

The different mode types described on VMware site are as follows:

  • mode 0 = no lock
  • mode 1 = is an exclusive lock (vmx file of a powered on VM, the currently used disk (flat ir delta), *vswp, etc)
  • mode 2 = is a read-only lock (e.g. on the ..-flat.vmdk of a running VM with snapshots)
  • mode 3 = is a multi-writer lock. (e.g. used for MSCS clusters disks or FT VMs)

We can determine the owner (host) where the process is locked by running the following command with the offset value from previous output (offset 3686970).

 [root@test: /vmfs/volumes/5753f99f-d235ca6c-a280-d89d6779a96c/TEST-VM] hexdump -C /vmfs/volumes/5753f99f-d235ca6c-a280-d89d6779a96c/.vh.sf -n 512 -s 3686970  

 003ba000 00 40 5d 7e 00 00 00 00 00 00 00 00 89 00 00 00 |……;………|  
 003ba010 00 00 00 00 c9 dd 7f a1 5c 01 00 00 89 e8 ba 55 |….._../…….|  
 003ba020 e1 ff 30 43 c1 dd 7f 8e 3d 3a 21 1a a4 a6 8a 99 |..m....S...….|  
 003ba030 e0 00 00 00 9d 04 00 00 00 00 00 00 00 00 00 00 |…......……..|  
 003ba040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |…………….|  

By checking the MAC address of the hosts, we can verify the host where the process is locked. In this case it was the same host where the VM was located. The MAC address of one the host's NIC included (7f 8e 3d 3a), as indicated in the above output.

Troubleshooting Steps

First, we removed the VM (TEST-VM) from inventory. Then registered the VM to the same host. The problem was still persistent.

Second, removed the VM from inventory and created a new VM with the same virtual drives from the problematic VM (TEST-VM). While trying to power ON the new VM the problem still persisted.

Third, we restarted the ESXi host daemon 'hostd' and vCenter Agent 'vpxa' services on the host to no avail.

 /etc/init.d/hostd restart  
 /etc/init.d/vpxa restart  

Fourth, we restarted all the management services on the host. However, the problem was not resolved.

 services.sh restart  

Fifth, we rebooted the host and tried to power ON the VM. The issue was still there.

The Resolution

Since the OVF export process was run from vCenter, there was still a process that was locking the VM in vCenter and we were unable to power on the VM. We had already rebooted the host, so we could conclude that the host was not locking the VM.

We rebooted the vCenter and finally we could power ON the VM without any issues.

References


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.