Friday, March 15, 2013

Vcenter disconnects troubleshooting.

Recently we found from a customer where the ESXi connection to vcenter drops during backup or
during vmotion.


1. Vmware might not support more than 100 connections per vmk port.
2. It is good to seperate vmotion into a seperate vmk nic  and vlan than being with management port. ( make sure vmks are in different subnet)
3. Firewall in ESXi might not be able to handle the traffic
4. Check the ethtool for any network drops. ( checking duplicate IPs always helps)
5. NFS datastore has more than 20,000 files. (ls -lR |wc -l)
6. ESXi resources (cpu/memory - there might be other performance issues)
7. Disable HA/DRS to see if disconnects disappear. (related to the datastores with
a lot of files)
8. vpxa/vpxd crashes.


1. How to check the number of connections per vmk ?

Quick script:

~ # esxcli network ip connection list | sort -k 4| awk '{print $4}'|cut -d ":" -f 1| uniq -c
 1 ------------------
394 10.X.Y.150   <<<< all were stuck at FIN_WAIT_2
1 Send
~ # esxcli network ip connection list | sort -k 4

2. Seperate the vmotion and management vmknic/vlan

3. Firewall settings:
~ # esxcli network firewall get
   Default Action: PASS
   Enabled: false 
   Loaded: true ( it should be loaded for HA to work)

4.  This will determine any errors or packet drops - due to switch /cable/port

     "ethtool -S vmnic0 |grep error| grep -v :\ 0"
     "ethtool -S vmnic0 |grep rx_no_buffer_count | grep -v :\ 0"
     "ethtool -S vmnic0 |grep drop"

5. Duplicate IP address - run arp commands from different node, when pinging ESXi.
or ssh to ESXi and see if keeps disconnecting.

 esxcli  network ip neighbor list -- see if any MAC entries are flapping .
arp -a (linux)

6. Make sure esxcfg-vmknic -l ( all the ips don't have duplicate ip/conflict/multiple vmks in same subnet) 
7. Finally review vpxa.log/vmkernel.log(APD) and hostd.log on ESXi as well as vpxd.log in vcenter. - vmkernel logs for vpxa crash

8. Fix when it is hung is to restart services ( restart)

PR 831801: The default value of FIN_WAIT_2 timer was erroneosly set to TCPTV_KEEPINTVL * TCP_KEEPCNT = 75* 0x400. This discrepancy results in the socket at FIN_WAIT_2 state to exist for a much longer time and if multiple such sockets are accumulated, they might impact new socket creation.

10. Disable firewall on ESXi  ( use these rules exactly, so HA will work)
 # esxcli network firewall get
   Default Action: DROP or PASS
   Enabled: false
   Loaded: true

12. Increase VPXA thread.

13. NFS datastore has more than 20,000 files - vpxa could hang.

14. If there is memory contention, make sure ESXi has 2G of memory reserved.