Recently we found from a customer where the ESXi connection to vcenter drops during backup or
during vmotion.
Synopsys:
1. Vmware might not support more than 100 connections per vmk port.
2. It is good to seperate vmotion into a seperate vmk nic and vlan than being with management port. ( make sure vmks are in different subnet)
3. Firewall in ESXi might not be able to handle the traffic
4. Check the ethtool for any network drops. ( checking duplicate IPs always helps)
5. NFS datastore has more than 20,000 files. (ls -lR |wc -l)
6. ESXi resources (cpu/memory - there might be other performance issues)
7. Disable HA/DRS to see if disconnects disappear. (related to the datastores with
a lot of files)
a lot of files)
8. vpxa/vpxd crashes.
Details:
1. How to check the number of connections per vmk ?
Quick script:
~ # esxcli network ip connection list | sort -k 4| awk '{print $4}'|cut -d ":" -f 1| uniq -c
1 ------------------
14 0.0.0.0
394 10.X.Y.150 <<<< all were stuck at FIN_WAIT_2
40 127.0.0.1
7 192.168.5.1
1 Send
14 0.0.0.0
394 10.X.Y.150 <<<< all were stuck at FIN_WAIT_2
40 127.0.0.1
7 192.168.5.1
1 Send
~ # esxcli network ip connection list | sort -k 4
2. Seperate the vmotion and management vmknic/vlan
3. Firewall settings:
~ # esxcli network firewall get
Default Action: PASS
Enabled: false
Loaded: true ( it should be loaded for HA to work)
4. This will determine any errors or packet drops - due to switch /cable/port
"ethtool -S vmnic0 |grep error| grep -v :\ 0"
"ethtool -S vmnic0 |grep error| grep -v :\ 0"
"ethtool -S vmnic0 |grep rx_no_buffer_count | grep -v :\ 0"
"ethtool -S vmnic0 |grep drop"
5. Duplicate IP address - run arp commands from different node, when pinging ESXi.
or ssh to ESXi and see if keeps disconnecting.
esxcli network ip neighbor list -- see if any MAC entries are flapping .
arp -a (linux)
6. Make sure esxcfg-vmknic -l ( all the ips don't have duplicate ip/conflict/multiple vmks in same subnet)
5. Duplicate IP address - run arp commands from different node, when pinging ESXi.
or ssh to ESXi and see if keeps disconnecting.
esxcli network ip neighbor list -- see if any MAC entries are flapping .
arp -a (linux)
6. Make sure esxcfg-vmknic -l ( all the ips don't have duplicate ip/conflict/multiple vmks in same subnet)
7. Finally review vpxa.log/vmkernel.log(APD) and hostd.log on ESXi as well as vpxd.log in vcenter. - vmkernel logs for vpxa crash
8. Fix when it is hung is to restart services ( services.sh restart)
PR 831801: The default value of FIN_WAIT_2 timer was erroneosly set to TCPTV_KEEPINTVL * TCP_KEEPCNT = 75* 0x400. This discrepancy results in the socket at FIN_WAIT_2 state to exist for a much longer time and if multiple such sockets are accumulated, they might impact new socket creation.
10. Disable firewall on ESXi ( use these rules exactly, so HA will work)
# esxcli network firewall get
Default Action: DROP or PASS
Enabled: false
Loaded: true
Default Action: DROP or PASS
Enabled: false
Loaded: true
12. Increase VPXA thread.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2009217
13. NFS datastore has more than 20,000 files - vpxa could hang.
14. If there is memory contention, make sure ESXi has 2G of memory reserved.
14. If there is memory contention, make sure ESXi has 2G of memory reserved.