Troubleshooting (Virtual Machine)

From vwiki
Jump to navigation Jump to search

See also Virtual Centre Troubleshooting

If all else fails you can always raise a VMware Service Request

Can't Connect to VM Console

Error connecting: Cannot connect to host... or Can't connect to MKS...

  • This is caused by a TCP connection failure to the ESX server the VM is hosted on. Using telnet or a port test utility, confirm you can connect on both TCP 902 and 443 from your machine to the ESX server.
  • If the problem is affecting a single ESX that previously worked, restart the management services on that ESX

Can't Deploy VM

The VirtualCenter server is unable to decrypt passwords stored in the customization specification

  • Bizarrely caused by the Virtual Centre running out of disk space, free up some space and all will be well.

A general system error occurred: Failed to create journal file provider

  • Check ESX disks are not full

Customization of the guest operating system 'winLonghornGuest' is not supported in this configuration. Microsoft Vista (TM) and Linux guests with Logical Volume Manager are supported only for recent ESX host and VMware Tools versions.

  • Caused by you trying to deploy a guest customised Windows 2008 template, where the OS of the source template is set to Windows 2008(!). Essentially Win2008 is only barely supported in ESX3.5. Setting the source machine to Vista should resolve this issue.
  • With Windows 2008 R2 templates the above fix has been seen to not work, in which case
    1. Deploy a clone (with no guest customisation)
    2. Perform a Sysprep

Can't Start VM

HA Admission Control

  • Can't start VM as doing so wouldn't leave enough failover capacity in order to be able to restart failed VM's should an ESX fail. Options are to
    • Reduce resource usage of VM's that are already running
    • Increase cluster capacity
    • Reduce the cluster's failover capacity, or allow constraints violations
  • If no VM's have been recently added to the cluster, its likely that the HA agent on one of the ESX's has stopped functioning, in which case, within the cluster, one of the ESX's will have a red warning/exclamation triangle. If so you can restart HA on that ESX;
    1. Highlight this VM, on the Summary tab you should see a notice regarding HA problems
    2. Run the Reconfigure for HA command, this will re-install the HA agent on the ESX

Failed to relocate virtual machine

  • DRS is attempting to relocate a VM at power up, and this relocation failing
    • Reattempt to power on machine
    • Manually migrate to a less loaded ESX and reattempt power on

Access to VMFS storage

  • ESX may have lost connectivity to VMFS partition on which VM resides

VMFS full

  • If VMFS is full, the ESX won't be able to write to the VM's logs when it starts it up, causing VM start-up to fail

ESX licensing

  • Either ESX isn't licensed, or has lost contact with the license server (VI3) for a long period of time

Waiting for question to be answered

  • Generally after changes (such as cold migrations or new deployments), a VM may need to have a question answered before it can continue to power on

Could not power on VM: No swap file. Failed to power on VM

  • The ESX you're starting the VM up on can't get proper access the VM's files, either because
    • The VM is already powered up on another ESX
    • The VM is already powered up (but shows as down on the VI Client)
    • The VM's files have been corrupted / locked

  1. Is the VM actually powered off?
  2. Has an ESX recently failed?
    • If the ESX the virtual is/was on has recently failed and HA's isolation response is set to leave powered-on then its possible that only the ESX's network connections have failed, and the virtual machines are still running on the ESX, but are isolated from the network.
      • To cause a full HA failover, pull the power cables out of the ESX to kill it completely
      • Alternatively, attempt to restore network connectivity to allow the VM's to be reachable again
    • If the ESX the virtual is/was on has recently failed its possible that the file lock times have not yet expired (or are being kept updated).
      • If you're able to get a console onto the failed ESX, ensure it has fully failed (powered off or PSOD). If not, power it off to ensure its not failed enough to stop VM's running, but not enough to stop updating the file locks. HA will restart the VM if its still a very recent failure, else restart the VM manually.

If there have been no ESX failures, then the VM's files may be corrupted. The VM can be re-registered by removing and re-adding it to the inventory, but the re-add may fail if the wrong files are corrupted. To investigate corruption further...

  • To test whether the ESX should be able to lock the VM's files use touch . Within the VM's directory, do touch *.vswp
    • If success, retry power on
    • If device or resource busy then the VM is probably owned by another ESX - find that ESX!
    • If Invalid argument then the file can't be accessed (eg corrupt or other storage problem)
  • Its also worth doing a touch on the following files, if they are not inaccessible then the VM may be recoverable. To work-around the .vswp issue, remove the reference to the file in the .vmx config file
    • touch *.vmx
    • touch *flat.vmdk
    • touch *delta.vmdk
    • touch vmware.log

For further info see - VMware KB10051 - Virtual machine does not power on because of missing or locked files

Cannot open the disk '/vmfs/volumes/.../MyVM-000001.vmdk' or one of the snapshot disks it depends on...

Cannot open the disk '/vmfs/volumes/.../MyVM-000001.vmdk' or one of the snapshot disks it depends on. Reason: The parent virtual disk has been modified since the child was deleted

  • The ESX can't work out the chain of vmdk's that make up the VM's disks, most likely because
    • Snapshot CID chain is corrupted
  1. You need to establish the chain of files, start by looking at the vmx file to work out the top vmdk, then track back through them until you get to the base disk.
    • Any vmdk files not referenced in this chain are erroneous and can be deleted (or better, moved to a temporary sub-folder)
    • Any delta file <= 16MB is effectively empty and can be skipped
  2. Now display the CID's stored and then work out their correct order
    • grep CID My-VM.vmdk My-VM-00000[1-9].vmdk
  3. You then need to edit the vmdk files to correct the CID chain
  4. Start the VM and confirm it's working as expected
  5. Create a new temporary snapshot, then remove it to clear them up

General system error occurred...

A general system error occurred: The system returned an error. Communication with the virtual machine might have been interrupted.

  • This error seems to be generally occurred when the ESX is having trouble launching the VM's processes, sometime because its having trouble reading the VM's VMX file.
    • If the problem is erratically effecting one or more VM's, its likely that the ESX's hostd process is struggling a bit - in which case restart the ESX management agents
    • If the problem is continually effecting one (or possibly more) VM's, the VM('s) config file may be corrupted, or storage may be experiencing problems.

Can't Stop / Power-Off a VM

This normally occurs because you've lost management (VI Client) access to the ESX, or the ESX doesn't appear to be aware that its running the VM, but it is (so appears Inaccessible via the VI Client). If you have access to the VM via the VI Client but can't power off, it'll probably be a permissioning issue. There is no way to gracefully shutdown a VM without access via the VI Client (or direct access to the VM via RDP, VNC, etc).

  1. SSH to the ESX you believe the VM is running on
  2. Find the path to the VM's config file
    • EG vmware-cmd -l | grep VM_Name
    • If the VM is not listed, the VM isn't registered to that ESX
  3. Instruct the ESX to power off the VM using the VMX path already found
    • EG vmware-cmd /path/to/VM_Name.vmx stop

If the above fails, you'll need to get a bit more forceful...

  1. Find the PID of the VM
    • EG ps -auxwww | grep VM_Name
  2. Kill the VM using the PID found (make sure you've got the right PID, you could kill the ESX by mistake!)
    • EG kill -9 1234

VM is Powered On, but appears Powered Off

The VM responds to ping and RDP/VNC/SSH etc (as appropriate) but is showing as down in the VI Client. Also see Confirm VM's Status on ESX

  1. Restart the management agents on the ESX and recheck

If that doesn't improve matters...

  1. Find the location of the vmx file for the VM (so it can be re-added to the inventory)
  2. Connect a VI Client to the ESX and unregister the VM (remove from inventory)
  3. Restart the management agents on the ESX
  4. Re-add the VM to the inventory

If running ESX4i see VMware KB 1033591 - Virtual machine appears powered off after restarting the management services on the host, but note that...

  • vMotion all powered-on VM's off the affected ESX first
  • Recover 1 VM at a time, and vMotion it off as soon as it is recovered (it may disappear when recovering the next VM)
  • Recovered VM's may end up with a state of Unknown on vCentre and ESX, in which case, remove from ESX inventory and re-add
  • Restart the ESX once all recovered

Can't VMotion a VM

VM network doesn't exist at destination

  • VM is using a particular port group which doesn’t exist on the destination ESX

ESX / network too busy

  • VMotion can’t copy across VMs memory contents/changes quickly enough. An alternative is to use a Low Priory VMotion, which is more likely to succeed, but may result in the VM experiencing temporary freezes (avoids full OS downtime, but not without impact to hosted applications)

ESXs can't communicate

  • ESXs need to be able to communicate via VMotion network. DNS problems and FQDN inaccuracies can also cause problems

VM is connect to CD-ROM/ISO

  • VMs CD-ROM is connecting to an ISO file via the host ESX, tying it to that ESX

Can't Increase a VM's Disk

A general system error occurred: Internal error

  • Can be caused by existing snapshots running on a VM
  • Check the ESX logs / available disk space etc


Can't Create Snapshot

Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine

Can't Delete/Commit Snapshot

If snapshot files are large then patience is of the essence, and if possible, shut the VM down 1st, or at the very least limit activity on the VM. To commit a snapshot in a running VM, first a new snapshot is started, then the original redo files are merged with the base disk(s), then the extra redo file is merged.

Operation timed-out

  • Not unusual for large (>10GB) redo files, the process continues and its just vCentre reporting it as a time-out
    • Check the VM's files for any activity (changes in disk sizes/timestamps), speed is dependant on redo size, storage speed, ESX load, VM activity (if possible shut the VM down before removing the snapshot)
    • Also see Snapshot Still Active?

No Snapshots Exist in Snaphot Manager (but still exist)

  • Can happen if a snapshot Delete (All) fails to complete properly (eg ESX pseudo-hangs and you restart the management agents)
    1. Backup and then delete the VM's VMSD file
    2. Start a new snapshot
    3. In snapshot manager use Delete All (not Delete!)
  • If this fails, check the ESX log to see what went wrong

Is Snapshot Still Active?

  1. Check Snapshot Manager, if there's snapshots listed then there are still active snapshots
  2. Open up Datastore Browser to the VM's folder, and see if any snapshot files exist, if not then there are no active snapshots
  3. Check the VM's VMX file, the VMDK filename(s) will be either a snapshot or normal flat base disk file
    • EG scsi0:0.fileName = "MyVM-000001.vmdk" ←←←←← Snapshot file (snapshot running)
    • EG scsi0:0.fileName = "MyVM-000001-delta.vmdk" ← Snapshot file (snapshot running)
    • EG scsi0:0.fileName = "MyVM.vmdk" ←←←←←←←←← Base disk file (no snapshot running)
    • EG scsi0:0.fileName = "MyVM-flat.vmdk" ←←←←←← Base disk file (no snapshot running)
  4. If there's no snapshots running, but snapshot files exist then the files can be deleted (if you're sure!)

Revert to Snapshot Causes Trust Relationship Failure

When reverting a VM that is a member of a Windows domain to a snapshot you can get the following errors at boot up, or when trying to logon

  • The trust relationship between this workstation and the primary domain failed
  • Windows cannot connect to the domain, either because the domain controller is down or otherwise unavailable, or because your computer account was not found. Please try again later. If this message continues to appear, contact your system administrator for assistance.

The problem is caused by the VM's computer account, which is used by the domain client/snapshotted machine to access the domain controller, having an invalid password. Domain member servers periodically change the password they use to connect to the domain with (by default every 30 days). So if a VM is snapshotted, then following that updates its computer account password; on a revert to snapshot it will revert to the old invalid password and be unable to login to the domain.

  • To resolve:
    1. The machine needs to be taken off the domain, and put back on (you'll need a domain account with rights to do this)
  • To prevent: - see note below
    • Disable machine account password changes
      1. On the domain member machine update the registry
      2. HKLM\SYSTEM\CurrentControlSet\Services\NetLogon\Parameters\DisablePasswordChange to 1
    • Reduce machine account password change frequency
      1. On the domain member machine update the registry
      2. HKLM\SYSTEM\CurrentControlSet\Services\NetLogon\Parameters\MaximumPasswordAge to a higher value (in days), eg 60

The prevention options reduce domain security !
They should only be actioned if you understand the risks and are not breaching any security policies that may in force at your organisation.

If its not a regular occurrence, its probably best to just live the problem, and resolve when required. Snapshots should not be allowed to run for many days in normal operation, which means that the problem should not occur frequently in a well run environment.

Further reading...

Can't Customise

Windows setup could not configure Windows to run on this computer's hardware
Windows could not complete the installation. To install Windows on this computer, restart the installation.

  • The guest customisation is failing because either
    • The virtual hardware has changed (especially disk type) since the original machine was created
    • Sysprep can't customise the machine because it doesn't have administrator rights, this can occur where a DC's users have been offloaded to LDS

VMTools Automatic Cursor Release Not Working

Sometimes the console automatic cursor release (which allows you to seamlessly switch focus from a VM console to your desktop by moving your mouse, avoiding having to use CTRL+ALT) sometimes doesn't work. Seems to be more common with VM's deployed from templates/cloned from VM's.

To resolve...

  1. Uninstall VM Tools
  2. Reboot
  3. Install VM Tools
  4. Reboot

Confirm VM's Status on ESX

The following commands take you through confirming the status of a VM, as determined by the ESX

  1. Get list of VM's registered to ESX to check ESX believes its hosting the VM
    • vm-support -x
  2. Get the VM's PID
    • vim-cmd vmsvc/getallvms | grep <VM name>
  3. Get the state of VM (as the ESX believes)
    • vim-cmd vmsvc/power.getstate <vmid>
  4. Check if the ESX has any running processes for the VM (in which case its powered on, regardless of the above)
    • ps | grep <VM name>

To check that a VM is being locked by the ESX you're on

  1. Get the lock info for the VM's disk (use the 1st if there's numerous)
    • vmkfstools -D <VM-name>-flat.vmdk
  2. Pick out the MAC address from the lock info (78e7d192a548 in example below)
  3. List the NIC info for the ESX
    • esxcfg-vmknic -l
Lock [type 10c00001 offset 72968192 v 470, hb offset 3985408
gen 583, mode 1, owner 4d2dcc7b-20fb6d90-2b80-78e7d192a548 mtime 25711553]
Addr <4, 151, 197>, gen 299, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 37580963840, nb 17688 tbz 0, cow 0, zla 3, bs 2097152