Monday, August 5, 2013

Unable to enable HA on one ESXi host

Problem Description:
Host ABCD (x.y.z.150) is unable to start vSphere HA.  The current state is "vSphere HA Agent Unreachable".  I have tried to start HA twice, but this did not resolve the issue.
KBs to review:

Logs to look for in ESXi:  /var/log/vpxa.log and /var/log/fdm.log ( /var/run/log)

fdm.log snippet:

2013-08-04T21:01:18.968Z [FFDD3B90 error 'Cluster' opID=SWI-79b9207c] [ClusterDatastore::DoAcquireDatastoreWork] open(/vmfs/volumes/9e9989cf-f687e31c/.vSphere-HA/FDM-F78AC28A-8862-48C5-BC1C-F369CCABE58E-1480-9c9b8fc-ANTHMSASVC5/protectedlist) failed: Device or resource busy
2013-08-04T21:01:44.224Z [38498B90 error 'Default' opID=SWI-4593d696] SSLStreamImpl::BIOWrite (0d3fe098) Write failed: Broken pipe
2013-08-04T21:01:44.224Z [38498B90 error 'Default' opID=SWI-4593d696] SSLStreamImpl::DoClientHandshake (0d3fe098) SSL_connect failed with BIO Error
2013-08-04T21:01:44.224Z [38498B90 error 'Message' opID=SWI-4593d696] [MsgConnectionImpl::FinishSSLConnect] Error N7Vmacore3Ssl12SSLExceptionE(SSL Exception: BIO Error) on handshake

Workaround: 

- check if there are high latencies on the storage
- restart services.sh
- enable/refresh HA again in the vcenter.

Errors on FDM.log:

2013-08-04T23:23:33.577Z [FFF18B90 verbose 'Cluster' opID=SWI-5a8f10c4] [ClusterManagerImpl::IsBadIP] x.y.z.199 is bad ip
2013-08-04T23:23:34.578Z [FFF18B90 verbose 'Cluster' opID=SWI-5a8f10c4] [ClusterManagerImpl::IsBadIP] x.y.z.199 is bad ip

Workaround:

On x.y.z.199, review the fdm.log and run services.sh restart.( you could disconnect and connect the host, which restarts services, but I find services.sh restart fixing more issues)