Friday, March 15, 2013

Vcenter disconnects troubleshooting.

Recently we found from a customer where the ESXi connection to vcenter drops during backup or
during vmotion.

Synopsys:

1. Vmware might not support more than 100 connections per vmk port.
2. It is good to seperate vmotion into a seperate vmk nic  and vlan than being with management port. ( make sure vmks are in different subnet)
3. Firewall in ESXi might not be able to handle the traffic
4. Check the ethtool for any network drops. ( checking duplicate IPs always helps)
5. NFS datastore has more than 20,000 files. (ls -lR |wc -l)
6. ESXi resources (cpu/memory - there might be other performance issues)
7. Disable HA/DRS to see if disconnects disappear. (related to the datastores with
a lot of files)
8. vpxa/vpxd crashes.

Details:

1. How to check the number of connections per vmk ?

Quick script:


~ # esxcli network ip connection list | sort -k 4| awk '{print $4}'|cut -d ":" -f 1| uniq -c
 1 ------------------
14 0.0.0.0
394 10.X.Y.150   <<<< all were stuck at FIN_WAIT_2
40 127.0.0.1
7 192.168.5.1
1 Send
~ # esxcli network ip connection list | sort -k 4

2. Seperate the vmotion and management vmknic/vlan

3. Firewall settings:
~ # esxcli network firewall get
   Default Action: PASS
   Enabled: false 
   Loaded: true ( it should be loaded for HA to work)

4.  This will determine any errors or packet drops - due to switch /cable/port

     "ethtool -S vmnic0 |grep error| grep -v :\ 0"
     "ethtool -S vmnic0 |grep rx_no_buffer_count | grep -v :\ 0"
     "ethtool -S vmnic0 |grep drop"

5. Duplicate IP address - run arp commands from different node, when pinging ESXi.
or ssh to ESXi and see if keeps disconnecting.

 esxcli  network ip neighbor list -- see if any MAC entries are flapping .
arp -a (linux)



6. Make sure esxcfg-vmknic -l ( all the ips don't have duplicate ip/conflict/multiple vmks in same subnet) 
7. Finally review vpxa.log/vmkernel.log(APD) and hostd.log on ESXi as well as vpxd.log in vcenter. - vmkernel logs for vpxa crash

8. Fix when it is hung is to restart services ( services.sh restart)

PR 831801: The default value of FIN_WAIT_2 timer was erroneosly set to TCPTV_KEEPINTVL * TCP_KEEPCNT = 75* 0x400. This discrepancy results in the socket at FIN_WAIT_2 state to exist for a much longer time and if multiple such sockets are accumulated, they might impact new socket creation.

10. Disable firewall on ESXi  ( use these rules exactly, so HA will work)
 # esxcli network firewall get
   Default Action: DROP or PASS
   Enabled: false
   Loaded: true

12. Increase VPXA thread.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2009217

13. NFS datastore has more than 20,000 files - vpxa could hang.

14. If there is memory contention, make sure ESXi has 2G of memory reserved.