Virtualization The Future: June 2014

Sunday, 29 June 2014

ESXi/ESX host disconnects from vCenter Server 60 seconds after connecting (1029919)

Symptoms

An ESXi/ESX host is successfully added to the vCenter Server inventory but enters a Not Responding or Disconnected state after one minute.
You can use the vSphere Client to successfully connect to the ESXi/ESX host directly.
In the vpxd.log file, you see entries similar to:

2012-04-02T13:07:49.438+02:00 [02248 info 'Default' opID=66183d64] [VpxLRO] -- BEGIN task-internal-252 -- -- vim.SessionManager.acquireSessionTicket -- 52fa8682-47e0-2566-fb05-6192cb2c22f9(5298e245-ffb6-f7f8-e8a0-dedfbe369255)
2012-04-02T13:07:49.579+02:00 [02068 info 'Default'] [VpxLRO] -- BEGIN task-internal-253 -- host-94 -- VpxdInvtHostSyncHostLRO.Synchronize --
2012-04-02T13:07:49.579+02:00 [02068 warning 'Default'] [VpxdInvtHostSyncHostLRO] Connection not alive for host host-94
2012-04-02T13:07:49.579+02:00 [02068 warning 'Default'] [VpxdInvtHost::FixNotRespondingHost] Returning false since host is already fixed!
2012-04-02T13:07:49.579+02:00 [02068 warning 'Default'] [VpxdInvtHostSyncHostLRO] Failed to fix not responding host host-94
2012-04-02T13:07:49.579+02:00 [02068 warning 'Default'] [VpxdInvtHostSyncHostLRO] Connection not alive for host host-94
2012-04-02T13:07:49.579+02:00 [02068 error 'Default'] [VpxdInvtHostSyncHostLRO] FixNotRespondingHost failed for host host-94, marking host as notResponding
2012-04-02T13:07:49.579+02:00 [02068 warning 'Default'] [VpxdMoHost] host connection state changed to [NO_RESPONSE] for host-94
2012-04-02T13:07:49.610+02:00 [02248 info 'Default' opID=66183d64] [VpxLRO] -- FINISH task-internal-252 -- -- vim.SessionManager.acquireSessionTicket -- 52fa8682-47e0-2566-fb05-6192cb2c22f9(5298e245-ffb6-f7f8-e8a0-dedfbe369255)
2012-04-02T13:07:49.719+02:00 [02068 info 'Default'] [VpxdMoHost::SetComputeCompatibilityDirty] Marked host-94 as dirty.

Cause

This issue may occur if heartbeat packets are not received from the host before the one minute timeout period expires. These heartbeat packets are UDP packets sent over port 902.

This issue may also occur when the Windows firewall is enabled and the ports are not configured.

Resolution

To resolve this issue, check the Windows Firewall on the vCenter Server machine. If ports are not configured, disable the Windows Firewall.

If ports are configured, verify if network traffic is allowed to pass from the ESXi/ESX host to the vCenter Server system, and that it is not blocking UDP port 902.

To perform a basic verification from the guest operating system perspective:

Click Start > Run, type wf.msc, and click OK. The Windows Firewall with Advanced Security Management console appears.
In the left pane, click Inbound Rules.
Right-click the VMware vCenter Server -host heartbeat rule and click Properties.
In the Properties dialog, click the Advanced tab.
Under Profiles, ensure that the Domain option is selected.

You can use Wireshark to verify if the network allows bi-directional traffic.

Note: VMware does not endorse or recommend any particular third-party utility.

To verify if bi-directional traffic is allowed:

Download Wireshark from http://www.wireshark.org/ and install it on the vCenter Server system.
On ESXi, enable Tech Support Mode. For more information on enabling Tech Support Mode, see:
- For ESXi 4.1 and 5.x: Using Tech Support Mode in ESXi 4.1 and ESXi 5.x (1017910)
- For ESXi 4.0: Tech Support Mode for Emergency Support (1003677)
Download the Python script attached to this article (udp_client.py) to the ESXi/ESX system in question.
Edit the udp_client.py script on the ESXi/ESX host with a text editor. Modify the line, "host = '192.168.1.1'" and replace192.168.1.1 with the IP address of the vCenter Server system.
Start Wireshark on the vCenter Server system.
1. In the Filter field, enter ip.src==IP_of_host and udp.port==902. Replace IP_of_host with the IP address of the ESXi/ESX host in question.
2. Click Apply.
3. From the Capture menu, select Interfaces and click Start next to the NIC used for vCenter Server IP traffic.
From the ESXi/ESX host, run this command:

python udp_client.py

The total number of packets sent, the port, and the destination address are displayed.
On the vCenter Server system, watch the Wireshark screen for any packets showing up that match the filter applied.
If no packets are received, this indicates that something is blocking UDP traffic over port 902 from the ESXi/ESX host to the vCenter Server system. Inspect the physical networking environment and any software-based firewall on the vCenter Server system.

Ensure that these ports are open in the firewall between vCenter Server and the ESXi/ESX hosts:

902 - UDP & TCP
443 - TCP

For more information, see TCP and UDP Ports required to access vCenter Server, ESX hosts, and other network components (1012382).

Additional Information

This article does not assist you in troubleshooting ESXi 3.5 hosts that are disconnecting from vCenter Server after one minute as a Python interpreter is not installed on ESXi 3.5.

For more information on a similar issue, see ESXi/ESX host disconnects from vCenter Server after adding or connecting it to the inventory (2040630).

Source:-

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1029919

ESXi 5.0 hosts are marked as Not Responding 60 seconds after being added to vCenter Server (2020100)

Symptoms

When an ESXi 5.0 host is added to vCenter Server 5.0, the host is added correctly but approximately 60 seconds later it is marked asNot Responding.
Disabling the ESXi 5.0 firewall allows the host to connect.
When ESXi 3.x and 4.x hosts are added, they do not exhibit the same behavior.

Cause

The ESXi host sends UDP heartbeats to the vCenter Server, and by default this traffic is sent over port 902. There is also a rule in the ESXi firewall to allow for vCenter Server heartbeat traffic. If the vCenter Server has been configured to send traffic over an alternate port, that traffic will be blocked.

Resolution

There are two methods to resolve this issue:

Note: Before beginning the resolution steps, determine which port vCenter Server is currently using to send traffic.

To determine the traffic port in use:

Connect to the ESXi 5.0 host using SSH. For more information, see Using ESXi Shell in ESXi 5.0 and 5.1 (2004746).
Use the less or grep command to to determine the port in use:

less /etc/vmware/vpxa/vpxa.cfg

or:

grep serverPort /etc/vmware/vpxa/vpxa.cfg

The port number in use is contained in the serverPort tags. For example:

<vpxa> <bundleVersion>1000000</bundleVersion> <datastorePrincipal>root</datastorePrincipal> <hostIp>xxx.xxx.xxx.xxx</hostIp> <hostKey>52db3386-b766-889a-d778-da0c8851c81e</hostKey> <hostPort>443</hostPort> <licenseExpiryNotificationThreshold>15</licenseExpiryNotificationThreshold> <memoryCheckerTimeInSecs>30</memoryCheckerTimeInSecs> <serverIp>xxx.xxx.xxx.xxx</serverIp> <serverPort> 9020</serverPort> </vpxa>

In this example, serverPort is set to 9020, instead of the default port number 902.

Method 1: Add a firewall rule to the ESXi host to allow traffic on the alternate heartbeat port

To add a firewall rule to the ESXi host:

Connect to the ESXi 5.0 host using SSH. For more information, see Using ESXi Shell in ESXi 5.0 and 5.1 (2004746).
Navigate to the /etc/vmware/firewall/ directory:

cd /etc/vmware/firewall/
Create and edit a new file named heartbeat.xml using the vi command:

vi heartbeat.xml
Enter the configuration info into the file as shown in this example:

 <ConfigRoot> <service> <id>nondefheartbeat</id> <rule id='0000'> <direction>inbound</direction> <protocol>udp</protocol> <porttype>dst</porttype> <port> 9020</port> </rule> <rule id='0001'> <direction>outbound</direction> <protocol>udp</protocol> <porttype>dst</porttype> <port> 9020</port> </rule> <enabled>false</enabled> <required>false</required> </service> </ConfigRoot>

Note: The alternate port number used in this example is 9020. Be sure to use the port number you determined earlier for your configuration.
Save and close the file.
Enable the new firewall rule by running these commands:

esxcli network firewall unload esxcli network firewall load esxcli network firewall refresh

Method 2: Configure vCenter Server to use the default port number 902

Note s:

Ensure that the heartbeat firewall rule is also set to default port 902 prior to changing the port for vCenter Server.
Before changing the port number back to the default 902, ensure that no other application installed on vCenter Server is using this port.
This procedure modifies the Windows registry. Before making any registry modifications, ensure that you have a current and valid backup of the registry and the virtual machine. For more information on backing up and restoring the registry, see the Microsoft Knowledge Base article 136393.

The preceding link was correct as of April 16, 2013. If you find the link is broken, provide feedback and a VMware employee will update the link.

To configure vCenter Server to use port number 902:

Stop the VMware VirtualCenter Server service. For more information, see Stopping, starting, or restarting vCenter services (1003895).
Click Start > Run, type regedit, and click OK. The Registry Editor window opens.
Navigate to:

HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\VMware VirtualCenter
Modify the registry key heartbeatport and change the value to 902.
Change the windows firewall to accept port 902:
1. Navigate to Windows Firewall > Allow a program or feature through Windows firewall.
2. Select VMware vCenter Server > Host Heartbeat.
3. Click Details and change the port to 902 .
Start the VMware VirtualCenter Server service.

Note: The ESXi host may show as disconnected. Reconnect the host so that the new configuration info is saved to the vpxa.cfg file.

Source:-
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2020100

&src=vmw_so_vex_ragga_1012

Heartbeat Communication between vCenter and ESXi

Relationship in between Org Networks and Org VDC's

Saturday, 28 June 2014

Troubleshooting a virtual machine that has stopped responding (1007819)

Symptoms

A virtual machine running on VMware ESX/ESXi does not respond to any external input or exhibit any activity. Specifically:

Guest OS does not respond to keyboard or mouse activity at the console.
Guest OS does not respond to network communication, including ping, RDP, SSH, etc.
Virtual machine console screen is static, and does not change or refresh.
Tasks performed on the virtual machine fail, timeout, or do not start.
Virtual machine does not produce network or disk traffic.

Purpose

This article provides steps to isolate possible causes of a vSphere virtual machine becoming unresponsive.

An unresponsive virtual machine does not respond to any connection attempts and may be unable to respond to any attempts to power cycle it. There are a variety of reasons a virtual machine can end up in an unresponsive state. This article enables you to identify and resolve these common causes, and, when resolved, return the virtual machine to an operational state.

It is possible to hard power off a virtual machine without troubleshooting the cause, but this will prevent collection and analysis of information which could assist with determining the root cause of the outage. For more information about shutting down the virtual machine, see Powering off a virtual machine on an ESXi host (1014165) and Powering off an unresponsive virtual machine on an ESX host (1004340).

This article assumes that the issue is currently occurring. If troubleshooting an issue that occurred in the past, some required information may be unavailable.

Resolution

The services a virtual machine provides may become unresponsive or unreachable due to a number of causes, including problems with the applications or guest OS within the virtual machine, problems with the virtual machine monitor or virtual devices, resource contention on the host, or issues with underlying storage or networking infrastructure.

If the guest OS is producing any activity, it is successfully running. In this case, unresponsiveness is likely due to a connectivity problem or resource contention, or is specific to a higher-level component such as an application or service running within the guest OS.

Validate the scope

It is important to have accurate symptoms and an understanding of the scope of a problem. To confirm the scope of the problem, work through these checks:

Confirm that the virtual machine is actually unresponsive. It is possible that the virtual machine is not responding via one interface, but is functioning correctly on others. For more information on testing whether a virtual machine is genuinely unresponsive, seeConfirming whether virtual machine is unresponsive (1007802).

If a virtual machine is responsive, but performing poorly, see Troubleshooting ESX virtual machine performance issues (2001003).
Verify that the virtual machine is powered on. If the virtual machine has been powered off unexpectedly, power it back on and then troubleshoot the cause of the unexpected shutdown. For more information, see:
- Powering on an ESX/ESXi host's virtual machine (1003738)
- Determining why a virtual machine was powered off or restarted (1019064).
Note: If a virtual machine is powered off and cannot be powered back on, see Troubleshooting a virtual machine that is unable to power on (2001005).
Determine whether this issue is affecting multiple virtual machines or just one. If multiple virtual machines are affected, consider the similarities between the affected virtual machines when attempting to narrow the potential scope. In particular, focus on shared infrastructure which the group of affected virtual machines depend on, and whether all virtual machines depending on that common infrastructure are affected. For more information, see Assessing commonalities of an outage affecting multiple virtual machines (1019000).
Determine whether the guest OS is responsive to interaction at the virtual machine console. If an issue has been isolated to the guest OS or applications within the virtual machine, and the guest OS is responsive at the console, interact with the guest OS at the console to address the problem. For more information, see Troubleshooting virtual machine network connection issues (1003893).
Determine whether the guest OS or its application services are responsive to interaction via the network. If the guest OS or services respond to network communication but the console is unresponsive or non-functional, see Cannot open the virtual machine console (749640) or Ensuring that a virtual machine is not inaccessible due to a VMware vCenter or VirtualCenter issue (1007808).
Determine whether the guest OS has reported any critical errors to the console, and is sitting in a halted state. For more information, see Identifying critical Guest OS failures within virtual machines (1003999).
Determine whether the ESX/ESXi host is unresponsive too. If the host is unresponsive as well, the scope is larger than initially assumed. For more information, see Determining why an ESX/ESXi host does not respond to user interaction at the console (1017135).

Identify the cause

At this point, you have established that one or more virtual machines are unresponsive at both the virtual console and via the network. The host itself is responsive. A problem may exist with resource accessibility or contention, or with underlying storage or networking infrastructure.

To identify the cause:

Determine whether the problem is triggered by an operation or task being performed on the virtual machine. For example, snapshot and vMotion operations both stun a virtual machine for brief periods of time while memory state is copied across the network or to disk. For more information, see Taking a snapshot with virtual machine memory stuns the virtual machine while the memory is written to disk (1013163).
Some common configuration errors can lead to a virtual machine becoming unresponsive, such as while waiting for a resource. Review the virtual machine and host configuration. For more information, see:
- Common ESX/ESXi host configuration issues which can cause virtual machines to become unresponsive (1007813)
- Common ESX/ESXi virtual machine configuration issues which can cause virtual machines to become unresponsive (1007814)
Virtual machines depend on functional backing infrastructure. If there is an issue with the backing storage or networking infrastructure which the virtual machine depends on, the virtual hardware which a virtual machine presents to the guest OS may be impacted. Address the underlying storage or networking issue. For more information, see:
Virtual machines depend on available host resources (CPU, Memory), and the guest OS consumes those resources. A problem with resource availability or scheduling inside or outside the virtual machine may cause it to become unresponsive. The virtual machine may also be blocking on unavailable resources or spinning at 100% vCPU utilization. For more information, see Troubleshooting a virtual machine that has stopped responding: VMM and Guest CPU usage comparison (1017926).

Action Plan

At this point, you have established that the host running the virtual machine(s) is both responsive and not encountering any shared storage or networking infrastructure issues. The guest OS has not failed with a critical error, but remains unresponsive at the virtual machine console and via the network.

Take action to recover or collect information about the unresponsive virtual machine based on the architectural layer which is suspect:

If an issue has been isolated to the guest OS, or the %RUN is relatively high, but the virtual machine monitor is functioning correctly, move investigation to within the virtual machine's guest OS or applications. A guest OS can become unresponsive inside a virtual machine in the same way it can on physical hardware. For more information, see Troubleshooting unresponsive guest operating system issues (1007818).
1. Collect performance data while the problem is happening. For more information, see Using performance collection tools to gather data for fault analysis (1006797).
2. Attempt to manually induce a panic of the kernel inside the guest OS to collect additional information about its internal state. For more information, see:
  - Using virtual NMI facilities to troubleshoot unresponsive virtual machines on ESX/ESXi (1009187)
  - Microsoft article 927069: How to generate a complete crash dump file or a kernel crash dump by using an NMI on a Windows-based system
  - Microsoft article 303021: How to generate a memory dump file when a server stops responding
  - Linux Documentation Project article: Magic SysRq key
    
    Note: The preceding links were correct as of August 31, 2011. If you find a link is broken, provide feedback and a VMware employee will update the link.
  If useful diagnostic information is produced by the guest OS in response to one of these events, engage the guest OS vendor to investigate further.
3. If step 2 does not produce useful information, suspend the virtual machine to collect information about its internal state and open a case with VMware Support. For more information, see:
  1. Suspend the virtual machine and collect the .vmss suspend state file. For more information, see Suspending a virtual machine on ESX/ESXi to collect diagnostic information (2005831).
  2. Collect logs from the host running the virtual machine. For more information see Collecting diagnostic information for VMware products (1008524).
  3. Power the virtual machine back on, then reset it.
  4. Engage VMware Support, providing the information collected in steps 1, 3a and 3b. For more information, see How to File a Support Request.
  Note: If the virtual machine cannot be suspended because another management task is in progress, see Collecting information about tasks in VMware ESX and ESXi (1013003) and Restarting the Management agents on an ESX or ESXi Server (1003490). If attempts to suspend the virtual machine fail and no management task appears to be present, skip to the next section and attempt to crash the virtual machine.
If an issue has been isolated to the virtual machine monitor, or the %WAIT is relatively high, or attempts to suspend the virtual machine have failed, collect performance data and forcefully crash the virtual machine to collect additional information about its internal state.
1. Collect performance data while the problem is happening. For more information, see Using performance collection tools to gather data for fault analysis (1006797).
2. Crash the virtual machine to collect information about its internal state. For more information, see Crashing a virtual machine on ESX/ESXi to collect diagnostic information (2005715).
  
  Note: If attempts to crash the virtual machine fail, skip to the next section and attempt to crash the host.
3. Engage VMware Support, providing the information collected in steps 1 and 2. For more information, see How to File a Support Request.
If an issue has been isolated to the virtual machine monitor, but attempts to suspend or crash the virtual machine fail, this reflects a problem with the VMkernel. Collect a log bundle from the host, evacuate all unaffected virtual machines from the host, and use an NMI to intentionally generate a purple diagnostic screen.
1. Collect performance data while the problem is happening. For more information, see Using performance collection tools to gather data for fault analysis (1006797).
2. Move all unaffected virtual machines off of the host using vMotion. If possible, use Maintenance Mode to prevent additional virtual machines from being started on the host.
3. Configure the host to panic on receiving a non-maskable interrupt, and then issue an NMI to trigger a panic. For more information, see Using hardware NMI facilities to troubleshoot unresponsive hosts (1014767).
4. After the host has generated a purple diagnostic screen and completed dump of diagnostic information, take a screenshot or photograph of the console and restart the host.
5. Collect diagnostic information from the host. For more information, Collecting diagnostic information from an ESX or ESXi host that experiences a purple diagnostic screen (1004128).
6. Engage VMware Support, providing the information collected in steps 1, 4 and 5. For more information, see How to File a Support Request.

Virtualization The Future

Pages

Translate

Total Pageviews

My YouTube Channel

Sunday, 29 June 2014

ESXi/ESX host disconnects from vCenter Server 60 seconds after connecting (1029919)

Symptoms

Cause

Resolution

Additional Information

ESXi 5.0 hosts are marked as Not Responding 60 seconds after being added to vCenter Server (2020100)

Symptoms

Cause

Resolution

Method 1: Add a firewall rule to the ESXi host to allow traffic on the alternate heartbeat port

Method 2: Configure vCenter Server to use the default port number 902

Heartbeat Communication between vCenter and ESXi

Relationship in between Org Networks and Org VDC's

Saturday, 28 June 2014

Troubleshooting a virtual machine that has stopped responding (1007819)

Symptoms

Purpose

Resolution

Validate the scope

Identify the cause

Action Plan

Tags

See Also