Friday, June 1, 2012

Vcenter Disconnects due to APD or Firewall

When ESXi5.0 disconnects temporarily from the Vcenter, more often or not, it is related to the Storage issues and other times it could be ESXi5.0 firewall.

Storage Issues: (APD)

Symptoms:

Vcenter Disconnects and Errors in VMKernel Log:

2011-07-30T14:47:41.361Z cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.60a98000572d54724a34642d71325763" - failed to issue command due to Not found (APD), try again...

To prevent APD during Storage Array maintenance:

1. If you have maintenance window, on Vcenter -> ESXi host -> Configuration -> Storage - unmount the datastore  (datastore tab) and then detach the device( devices tab). ( u need VMs to be powered off, no heartbeat datastore and there are quite a few prerequistes)

2. The following commands also can be executed to prevent APD, if you have too many devices to do the unmount.( depending on SCSI timeout values, it is good to follow  step 1)

To enable w/o requiring downtime during storage Array maintenance, execute:  ( this might help in not having to unmount/detach datastores/devices)
# esxcfg-advcfg -s 1/VMFS3/FailVolumeOpenIfAPD

Note: This command might prevent new storage devices to be discovered. So use only during maintenance.
Revert the changes back via the following command.
To check the value of this option, execute:
# esxcfg-advcfg -g /VMFS3/FailVolumeOpenIfAPD
Revert it back after the maintenance:
# esxcfg-advcfg -s 0 /VMFS3/FailVolumeOpenIfAPD

Firewall  Issues Causing Vcenter Disconnects:
Symptoms : Other than vcenter disconnects
 Vpxa.log (on ESXi host):
Stolen/full sync required message:
"2012-02-02T18:32:49.941Z [6101DB90 info 'Default'
opID=HB-host-56@2627-b61e8cd4-e4] [VpxaMoService::GetChangesInt]
Forcing a full host synclastSentMasterGen = 2627 MasterGenNo from vpxd
= 2618
2012-02-02T18:32:49.941Z [6101DB90 verbose 'Default'
opID=HB-host-56@2627-b61e8cd4-e4] [VpxaMoService::GetChangesInt] Vpxa
restarted or stolen by other server. Start a full sync"
Difficulty translating between host and vpxd:
2012-01-24T19:26:05.705Z [FFDACB90 warning 'Default']
[FetchQuickStats] GetTranslators -- host to vpxd translation is empty.
Dropping results
2012-01-24T19:26:05.706Z [FFDACB90 warning 'Default']
[AddEntityMetric] GetTranslators -- host to vpxd translation is empty.
Dropping results
2012-01-24T19:26:05.706Z [FFDACB90 warning 'Default']
[AddEntityMetric] GetTranslators -- host to vpxd translation is empty.
Dropping results
2012-01-24T19:26:05.706Z [FFDACB90 warning 'Default']
[AddEntityMetric] GetTranslators -- host to vpxd translation is empty.
Dropping results
2012-01-24T19:26:05.706Z [FFDACB90 warning 'Default']
[AddEntityMetric] GetTranslators -- host to vpxd translation is empty. Difficulty translating between host and vpxd:
 Vpxd.log (on vCenter Server):
 
Timeouts/Failed to respond and Host Sync failures:
2012-01-24T18:50:15.015-08:00 [00808 error 'HttpConnectionPool']
[ConnectComplete] Connect error A connection attempt failed because
the connected party did not properly respond after a period of time,
or established connection failed be
cause connected host has failed to respond.
After trying few options , our team successfully able to avoid vcenter disconnects with:
esxcli network firewall load  - if it is unloaded , then you can't enable HA, so we have to load it
esxcli network firewall set --enabled false
esxcli network firewall set --default-action true
esxcli network firewall get
 rc.local: (to persist between reboots)

1) echo -e "# disable firewall service\nesxcli network firewall load\nesxcli network firewall set --enabled false\nesxcli network firewall set --default-action true\n# disable firewall service" >> /etc/rc.local
2) auto-backup.sh

Addendum:
We had a useful simple and dirty scripts to monitor dropping, as Vcenter disconnects were intermittent and will recover after couple of minutes and we can't sit infront of the computer all the day

a. cat check_vcenter_disconnect_hep.sh  ( ./check_venter_disconnect_hep.sh >> public_html/disconnect.txt )
#!/bin/sh
while true
do
dddd=`date -u +%Y-%m-%dT%H`
echo $dddd
for seq in `seq 1 4`
do
echo "hep$seq"
~/scripts/hep$seq |grep $dddd
done
sleep 1800
done
b. cat hep1 ( we had hep1-hep4)
#!/usr/bin/expect
spawn ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.x.y.z
expect "word:"
send "esxi12345"
expect "#"
#send "rm  /var/run/log/vpxa.\[1-9\]\r"
#expect "#"
send "gunzip  /var/run/log/vpxa*.gz\r"
expect "#"
send "egrep stolen /var/run/log/vpxa*\r"
expect "#"
send "egrep -i dropping /var/run/log/vpx*\r"
expect "#"
send "egrep -i performance /var/log/vmkernel.log\r"
expect "#"
send "exit\r"


Useful Webpages:
 http://www.virtualizationpractice.com/all-paths-down-16250/
PDLs/APDs http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2004684