nutanix: Vcenter disconnects troubleshooting.

Recently we found from a customer where the ESXi connection to vcenter drops during backup or

during vmotion.

Synopsys:

1. Vmware might not support more than 100 connections per vmk port.

2. It is good to seperate vmotion into a seperate vmk nic and vlan than being with management port. ( make sure vmks are in different subnet)

3. Firewall in ESXi might not be able to handle the traffic

4. Check the ethtool for any network drops. ( checking duplicate IPs always helps)

5. NFS datastore has more than 20,000 files. (ls -lR |wc -l)

6. ESXi resources (cpu/memory - there might be other performance issues)

7. Disable HA/DRS to see if disconnects disappear. (related to the datastores with
a lot of files)

8. vpxa/vpxd crashes.

Details:

1. How to check the number of connections per vmk ?

(http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2021629)

Quick script:

~ # esxcli network ip connection list | sort -k 4| awk '{print $4}'|cut -d ":" -f 1| uniq -c

1 ------------------
14 0.0.0.0
394 10.X.Y.150 <<<< all were stuck at FIN_WAIT_2
40 127.0.0.1
7 192.168.5.1
1 Send

~ # esxcli network ip connection list | sort -k 4

2. Seperate the vmotion and management vmknic/vlan

3. Firewall settings:

~ # esxcli network firewall get

Default Action: PASS

Enabled: false

Loaded: true ( it should be loaded for HA to work)

4. This will determine any errors or packet drops - due to switch /cable/port

"ethtool -S vmnic0 |grep error| grep -v :\ 0"

"ethtool -S vmnic0 |grep rx_no_buffer_count | grep -v :\ 0"

"ethtool -S vmnic0 |grep drop"

5. Duplicate IP address - run arp commands from different node, when pinging ESXi.
or ssh to ESXi and see if keeps disconnecting.

esxcli network ip neighbor list -- see if any MAC entries are flapping .
arp -a (linux)

6. Make sure esxcfg-vmknic -l ( all the ips don't have duplicate ip/conflict/multiple vmks in same subnet)

7. Finally review vpxa.log/vmkernel.log(APD) and hostd.log on ESXi as well as vpxd.log in vcenter. - vmkernel logs for vpxa crash

8. Fix when it is hung is to restart services ( services.sh restart)

9. http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2019108

PR 831801: The default value of FIN_WAIT_2 timer was erroneosly set to TCPTV_KEEPINTVL * TCP_KEEPCNT = 75* 0x400. This discrepancy results in the socket at FIN_WAIT_2 state to exist for a much longer time and if multiple such sockets are accumulated, they might impact new socket creation.

10. Disable firewall on ESXi ( use these rules exactly, so HA will work)

# esxcli network firewall get
   Default Action: DROP or PASS
   Enabled: false
   Loaded: true

12. Increase VPXA thread.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2009217

13. NFS datastore has more than 20,000 files - vpxa could hang.

14. If there is memory contention, make sure ESXi has 2G of memory reserved.

nutanix

Friday, March 15, 2013

Vcenter disconnects troubleshooting.

1 comment: