Thursday, May 16, 2013

Create NutanixVswitch from scratch


There have been few instances migrating to and from Distributed Vswitch can cause the nutanix vswitch to be deleted or have wrong config. This will cause Genesis to crash and not start.

From CVM: ping -I 192.168.5.254 192.168.5.1  to verify if Internal vswitch is good.

Error Message in Genesis.out:
2012-11-06 13:57:42 ERROR node_manager.py:2378 Could not load the local ESX configuration

Sample of right config of NutanixVswitch

~ # esxcfg-vswitch -l
Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
vSwitchNutanix                   128         3           128               1500
  PortGroup Name        VLAN ID  Used Ports  Uplinks
  svm-iscsi-pg                         0        1
  vmk-svm-iscsi-pg                 0        1

Make sure these info matches.
- No uplink interfaces
- Used ports 3
- Portgroup and # of used ports
- Name of the vswitch

Sample of vmknic: ( make sure subnet mask and ip address and enabled=true matches)

~ # esxcfg-vmknic -l
Interface  Port Group/DVPort   IP Family IP Address    Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type               
vmk1       vmk-svm-iscsi-pg    IPv4      192.168.5.1   255.255.255.0   192.168.5.255   00:50:56:62:5a:e4 1500    65535     true    STATIC        

Commands to recreate the nutanix Vswitch:


1. esxcfg-vswitch -a vSwitchNutanix ( this needs to be done via CLI)

2. esxcfg-vswitch -A "svm-isci-pg" vSwitchNutanix

3. esxcfg-vswitch -A "vmk-svm-iscsi-pg" vSwitchNutanix

4. If vmknic does not exist,
esxcfg-vmknic -a -i 192.168.5.1 -n 255.255.255.0 -p vmk-svm-iscsi-pg

5. esxcfg-vmknic -l

6. enable the 192.168.5.1 vmknic
esxcfg-vmknic -e vmk1

7. On CVM , verify eth1 is part of svm-iscsi-pg (edit settings of CVM in Vcenter)

8.ifconfig eth1 and eth1:1 to verify if the ports are up and ip address is
configured correctly

Monday, May 13, 2013

Guest VM crash due to vcpu-0:EPT misconfiguration


Guest VM powered off on ESXi 5.1 or 5.0 due EPT misconfiguration (vmware.log)

Snippet of the log :(vmware.log in location of .vmx file)
2013-05-03T17:27:43.262Z| vcpu-1| MONITOR PANIC: vcpu-0:EPT misconfiguration: PA b49b405b0
2013-05-03T17:27:43.262Z| vcpu-1| Core dump with build build-623860
2013-05-03T17:27:43.262Z| vcpu-1| Writing monitor corefile "/vmfs/volumes/51548019-3efd569e-d4d8-002590840e37/ServiceVM/vmmcores.gz"
2013-05-03T17:27:43.262Z| vcpu-6| Exiting vcpu-6
2013-05-03T17:27:43.262Z| vcpu-3| Exiting vcpu-3
2013-05-03T17:27:43.262Z| vcpu-7| Exiting vcpu-7
2013-05-03T17:27:43.262Z| vcpu-2| Exiting vcpu-2
2013-05-03T17:27:43.262Z| vcpu-0| Exiting vcpu-0
2013-05-03T17:27:43.262Z| vcpu-4| Exiting vcpu-4
2013-05-03T17:27:43.262Z| vcpu-5| Exiting vcpu-5
2013-05-03T17:27:43.268Z| vcpu-1| Saving anonymous memory
2013-05-03T17:27:43.280Z| vcpu-1| Dumping core for vcpu-0
2013-05-03T17:27:43.280Z| vcpu-1| CoreDump: dumping core with superuser privileges
2013-05-03T17:27:43.280Z| vcpu-1| VMK Stack for vcpu 0 is at 0x41223b241000
2013-05-03T17:27:43.280Z| vcpu-1| Beginning monitor coredump
2013-05-03T17:27:44.181Z| vcpu-1| End monitor coredump
2013-05-03T17:27:44.182Z| vcpu-1| Dumping core for vcpu-1
2013-05-03T17:27:44.182Z| vcpu-1| CoreDump: dumping core with superuser privileges
2013-05-03T17:27:44.182Z| vcpu-1| VMK Stack for vcpu 1 is at 0x41223b2c1000
2013-05-03T17:27:44.182Z| vcpu-1| Beginning monitor coredump
2013-05-03T17:27:45.062Z| vcpu-1| End monitor coredump
2013-05-03T17:27:45.063Z| vcpu-1| Dumping core for vcpu-2
2013-05-03T17:27:45.063Z| vcpu-1| CoreDump: dumping core with superuser privileges
2013-05-03T17:27:45.063Z| vcpu-1| VMK Stack for vcpu 2 is at 0x41223b301000
2013-05-03T17:27:45.063Z| vcpu-1| Beginning monitor coredump
2013-05-03T17:27:45.940Z| vcpu-1| End monitor coredump
2013-05-03T17:27:45.941Z| vcpu-1| Dumping core for vcpu-3
2013-05-03T17:27:45.941Z| vcpu-1| CoreDump: dumping core with superuser privileges
2013-05-03T17:27:45.941Z| vcpu-1| VMK Stack for vcpu 3 is at 0x41223b341000
2013-05-03T17:27:45.941Z| vcpu-1| Beginning monitor coredump
2013-05-03T17:27:46.813Z| vcpu-1| End monitor coredump
2013-05-03T17:27:46.814Z| vcpu-1| Dumping core for vcpu-4
2013-05-03T17:27:46.814Z| vcpu-1| CoreDump: dumping core with superuser privileges
2013-05-03T17:27:46.814Z| vcpu-1| VMK Stack for vcpu 4 is at 0x41223b381000
2013-05-03T17:27:46.814Z| vcpu-1| Beginning monitor coredump
2013-05-03T17:27:47.685Z| vcpu-1| End monitor coredump
2013-05-03T17:27:47.685Z| vcpu-1| Dumping core for vcpu-5
2013-05-03T17:27:47.685Z| vcpu-1| CoreDump: dumping core with superuser privileges
2013-05-03T17:27:47.685Z| vcpu-1| VMK Stack for vcpu 5 is at 0x41223b3c1000
2013-05-03T17:27:47.686Z| vcpu-1| Beginning monitor coredump
2013-05-03T17:27:48.559Z| vcpu-1| End monitor coredump
2013-05-03T17:27:48.559Z| vcpu-1| Dumping core for vcpu-6
2013-05-03T17:27:48.559Z| vcpu-1| CoreDump: dumping core with superuser privileges
2013-05-03T17:27:48.560Z| vcpu-1| VMK Stack for vcpu 6 is at 0x41223b401000
2013-05-03T17:27:48.560Z| vcpu-1| Beginning monitor coredump
2013-05-03T17:27:49.429Z| vcpu-1| End monitor coredump
2013-05-03T17:27:49.430Z| vcpu-1| Dumping core for vcpu-7
2013-05-03T17:27:49.430Z| vcpu-1| CoreDump: dumping core with superuser privileges
2013-05-03T17:27:49.430Z| vcpu-1| VMK Stack for vcpu 7 is at 0x41223b441000
2013-05-03T17:27:49.430Z| vcpu-1| Beginning monitor coredump
2013-05-03T17:27:50.305Z| vcpu-1| End monitor coredump
2013-05-03T17:27:50.306Z| vcpu-1| Dumping extended monitor data
2013-05-03T17:27:56.966Z| vcpu-1| Msg_Post: Error
2013-05-03T17:27:56.966Z| vcpu-1| [msg.log.monpanic] *** VMware ESX internal monitor error ***
2013-05-03T17:27:56.966Z| vcpu-1| --> vcpu-0:EPT misconfiguration: PA b49b405b0
2013-05-03T17:27:56.966Z| vcpu-1| [msg.log.monpanic.report] You can report this problem by selecting menu item Help > VMware on the Web > Request Support, or by going t
9%2d3efd569e%2dd4d8%2d002590840e37%2fServiceVM%2fvmmcores%2egz". Provide the log file (/vmfs/volumes/51548019-3efd569e-d4d8-002590840e37/ServiceVM/vmware.log) and the c
ore file(s) (/vmfs/volumes/51548019-3efd569e-d4d8-002590840e37/ServiceVM/vmmcores.gz, /vmfs/volumes/51548019-3efd569e-d4d8-002590840e37/ServiceVM/vmx-zdump.000).
2013-05-03T17:27:56.966Z| vcpu-1| [msg.log.monpanic.serverdebug] If the problem is repeatable, set 'Use Debug Monitor' to 'Yes' in the 'Misc' section of the Configure V
irtual Machine Web page. Then reproduce the incident and file it according to the instructions.
2013-05-03T17:27:56.966Z| vcpu-1| [msg.log.monpanic.vmSupport.vmx86] To collect data to submit to VMware support, run "vm-support".
2013-05-03T17:27:56.966Z| vcpu-1| [msg.log.monpanic.entitlement] We will respond on the basis of your support entitlement.
2013-05-03T17:27:56.966Z| vcpu-1| [msg.log.monpanic.finish] We appreciate your feedback,
2013-05-03T17:27:56.966Z| vcpu-1| -->   -- the VMware ESX team.
2013-05-03T17:27:56.966Z| vcpu-1| ----------------------------------------
2013-05-03T17:27:57.894Z| vmx| GuestRpcSendTimedOut: message to toolbox timed out.



vmkwarning.log:2013-05-03T17:27:59.813Z cpu5:4151)WARNING: PFrame: vm 1543881: 1514: Deallocating pinned pgNum 0x0, pinCount 1 throttle 0.
vobd.log:2013-05-03T17:27:56.966Z: [UserWorldCorrelator] 1984725641586us: [vob.uw.core.dumped] /bin/vmx(1543893) /vmfs/volumes/51548019-3efd569e-d4d8-002590840e37/ServiceVM/vmx-zdump.000
vobd.log:2013-05-03T17:27:56.966Z: [UserWorldCorrelator] 1984705573722us: [esx.problem.application.core.dumped] An application (/bin/vmx) running on ESXi host has crashed (1 time(s) so far). A core file may have been created at /vmfs/volumes/51548019-3efd569e-d4d8-002590840e37/ServiceVM/vmx-zdump.000.


Seems CVM crash

-rw-r--r--    1 root     root           16276242 May  3 17:27 vmmcores-1.gz
-r--------    1 root     root            5394432 May  3 17:27 vmx-zdump.000
-rw-r--r--    1 root     root             240217 May  3 17:28 vmware-5.log
-rw-------    1 root     root           54525952 May  4 02:34 vmx-ServiceVM-830969


Here is the workaround from VMware support:
1. Can you let me know what .vmx changes needs to be done to reduce the
occurrence of the bug?
- change the MMU setting to software virtualization as shown in KB Article: 1036775
- another possible workaround is to change (or add) the following line in the .vmx file:
    monitor_control.disable_mmu_largepages=TRUE

2. Can you give us the instructions to install the patch and also the
debug patch ? We will test it internally before giving to the customer ?
- we have submitted your request to engineering
- engineering will investigate that the issue in your case to see if it exactly matches the particular issue being investigated
- they will let us know if your case qualifies for a debug patch

3. Is the fix in ESXi 5.0 U3 or 5.1 U3. ?
- actually, there is no fix at this time
- the issue is still under investigation

4. does increasing the VM memory help reduce the occurrence of it ? I see
from /proc/meminfo (grep AS) about 9G is being used of 12 G in the guest VM
? May be increasing VM memory to 16G might reduce the occurrence of it.
- No, decreasing the memory for this (and other) virtual machines might reduce the occurrence of the issue.
- see KB article 1021896 for details

5.  however the information requested is published in KB Article: 2040519

Thursday, May 2, 2013

Troubleshooting bad hard disk on Nutanix

stargate marks a disk offline if any command to it  takes more than 20 seconds.

Reasons that a disk could take more than 20 seconds to respond

1. bad sectors - sudo smartctl -a /dev/sdX1 -T permissive
2. bad adapter - sudo smartctl  -l sataphy /dev/sdc1
3. bad chassis
4. disk slow ( iostat from sysstats - data/logs/sysstat)
5. stargate.INFO will show the disk offline and gdb the stargate core file - stuck on fdatasync and pwrite
6. Mark the disk offline to see if goes offline again.

The following solution details, walks through the commands.
Solution
Here is an example of the outputs

1. io-stat: ( you can graph using Nagios- zk_leader:7777 or sysstats directory in /home/nutanix/data/logs)

IO on this disk was stuck for more than 20s which caused stargate to mark it offline. There were a bunch of writes ops in progress which were stuck on the pwrite/fdatasync system calls on this disk.

#TIMESTAMP 1366215700 : 04/17/2013 09:21:40 AM
Linux 2.6.35-30-server (NTNX-Ctrl-VM-3-NTNX)    04/17/2013      _x86_64_        (8 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.10    0.95    6.14   45.96    0.00   38.85
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               0.00     9.00    0.00    8.40     0.00     0.07    16.57     0.00    0.24   0.24   0.20
sdb               0.00   100.80    4.60   10.80     0.19     0.43    82.29     0.04    2.60   2.21   3.40
sdc               0.00     0.00    0.00    1.40     0.00     0.70  1024.00    13.82 10674.29 714.29 100.00
sdd               0.00  1618.60    4.60   15.20     0.16     6.38   676.77     3.04  153.43  10.91  21.60
2. smartctrl -l sataphy output: check if it increments.

ATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2           83  Device-to-host register FISes sent due to a COMRESET
0x0001  2           38  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2           38  R_ERR response for host-to-device data FIS
0x0007  2            2  R_ERR response for host-to-device non-data FIS
3. lspci |grep -i sata: to find the type of controller
00:00:1f.2 Mass storage controller: Intel Corporation ICH10 6 port SATA AHCI Controller [vmhba0]

4. ncli disk ls - offline disks to see offline disks
Disk ID                   : 5946892
Storage Tier              : DAS-SATA
Host Name                 : 10.30.1.212
Mount Path                : /home/nutanix/data/stargate-storage/disks/9XG2WKG1
Online                    : false
Location                  : 6
5. stargate.ERROR:
E0417 09:21:49.258476  5468 extent_store.cc:546] notification=PathOffline mount_path=/home/nutanix/data/stargate-storage/disks/9XG2WKG1 ip_address=10.30.1.217 service_vm_id=17
F0417 09:21:49.263244  5468 extent_store.cc:667] Mount path /home/nutanix/data/stargate-storage/disks/9XG2WKG1 with disk id 5946892 marked offline
6. dmesg or sudo cat /var/log/messages on CVM
7. zeus_config_printer:
{
  "cluster_id": 1081,
  "disk_id": 5946892,
  "disk_size": 925779999948,
  "storage_tier": "DAS-SATA"
}


8. smartctl on disks
root@NTGTDC-Ctrl-VM-3:10.30.1.217:/var/log# smartctl  -a /dev/sdc1 -T permissive
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.35-30-server] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Serial Number:    9XG2WKG1
LU WWN Device Id: 5 000c50 04e7f347e
Firmware Version: SN02  <<<< SN03 is better.
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
=== START OF READ SMART DATA SECTION ===
SMART STATUS RETURN: incomplete response, ATA output registers missing
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  642) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 201) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   069   044    Pre-fail  Always       -       98874921
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       420367
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       516
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       23
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   098   076   000    Old_age   Always       -       26
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   076   060   045    Old_age   Always       -       24 (Min/Max 20/29)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       600
194 Temperature_Celsius     0x0022   024   040   000    Old_age   Always       -       24 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   115   100   000    Old_age   Always       -       98874921
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       38
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

9. gdb on core file: (stack trace output)
vim stargate.core.5425.6.20130417-092157.stack_trace.txt.gz - stuck on fdatasync or pwrite.

Thread 78 (Thread 5536):
#0  0x00007f8fd985a027 in fdatasync () from /home/nutanix/toolchain/x86_64-unknown-linux-gnu/1.3/lib/libc.so.6
Thread 77 (Thread 5558):
#0  0x00007f8fd9859868 in pwritev (fd=72, vector=0x3489f460, count=7, offset=15323) at ../sysdeps/unix/sysv/linux/pwritev.c:67

 10. ncli disk mark-online id=5946892

IPMItool

to get the current IPaddress:

ESXi#/ipmitool  lan print 1
~ # /ipmitool lan print 1| egrep "IP Address|Subnet|Gateway IP|VLAN"
IP Address Source       : Static Address
IP Address              : 112.18.12.167
Subnet Mask             : 255.255.254.0
Default Gateway IP      : 112.18.12.1
Backup Gateway IP       : 0.0.0.0
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0


~ # /ipmitool  lan set 1 ipsrc static
~ # /ipmitool lan set 1 netmask 255.255.248.0
Setting LAN Subnet Mask to 255.255.248.0
~ # /ipmitool lan set 1 ipaddr 101.23.40.47 
~ # /ipmitool lan set 1 defgw ipaddr 101.23.40.1


 Locator LED :
/ipmitool chassis identify force

Set the locator LED off

/ipmitool chassis identify 0

Set the locator LED upto 255 seconds

 /ipmitool chassis identify 255


 SEL events:

/ipmitool sel list
/ipmitool -v sel list
 ( more verbose)

sensor readings

/ipmitool sensor list

Reset the BMC:

/ipmitool mc reset cold
/ipmitool  raw 0x06 0x02

To find the serial number of the system:

/ipmitool fru list

 
To reset the server

/ipmitool chassis power off/off/cycle/reset
/ipmitool chassis power diag -to send NMI to ESXi(server)


IPMI policy when the power is restored:


~ # /ipmitool chassis policy
chassis policy <state>
   list        : return supported policies
   always-on   : turn on when power is restored
   previous    : return to previous state when power is restored
   always-off  : stay off after power is restored

Monday, April 1, 2013

Reclaim the space on storage from a VM

When files are deleted within the  guest VM running on a NFS storage mounted on ESXi, basically only the inode entries are updated with info that these blocks can be reused. The actual NFS storage does not free up this space.


Above blog, talks about reclaiming via vmware tools and  sdelete.

On linux, you can do this

on linux try this on user VM , to if free space is 32G, then 

df -h 

/mnt/local_disk1 has 32G free, you can zero out about 32G. 

dd if=/dev/zero of=/mnt/local_disk${i}/zero.out bs=$((4*1024*1024)) count=8000 && rm /mnt/disk1/zero.out
then delete zero.out

Friday, March 15, 2013

Netstat for ESXi to troubleshoot Vcenter disconnects.

Recently we found from a customer where the ESXi connection to vcenter drops during backup or
during vmotion.

Synopsys:

1. Vmware might not support more than 100 connections per vmk port.
2. It is good to seperate vmotion into a seperate vmk nic  and vlan than being with management port.
3. Firewall in ESXi might not be able to handle the traffic
4. Check the ethtool for any network drops.

How to check the number of connections per vmk ?
(http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2021629)

Quick script:


~ # esxcli network ip connection list | sort -k 4| awk '{print $4}'|cut -d ":" -f 1| uniq -c
1 ------------------
13 0.0.0.0
9 10.202.1.154
55 127.0.0.1
3 169.254.1.1
6 192.168.5.1
1 Send





~ # esxcli network ip connection list | sort -k 4
-----  ------  ------  ------------------  ------------------  -----------  --------  --------------------
tcp         0       0  0.0.0.0:22          0.0.0.0:0           LISTEN              0
udp         0       0  0.0.0.0:427         0.0.0.0:0                               0
tcp         0       0  0.0.0.0:443         0.0.0.0:0           LISTEN           4936  hostd-worker
tcp         0       0  0.0.0.0:5989        0.0.0.0:0           LISTEN              0
tcp         0       0  0.0.0.0:80          0.0.0.0:0           LISTEN           4936  hostd-worker
tcp         0       0  0.0.0.0:8000        0.0.0.0:0           LISTEN           4739
tcp         0       0  0.0.0.0:8100        0.0.0.0:0           LISTEN           4739
udp         0       0  0.0.0.0:8182        0.0.0.0:0                            5972  fdm
tcp         0       0  0.0.0.0:8182        0.0.0.0:0           LISTEN           5963  fdm
udp         0       0  0.0.0.0:8200        0.0.0.0:0                            4739
udp         0       0  0.0.0.0:8301        0.0.0.0:0                            4739
udp         0       0  0.0.0.0:8302        0.0.0.0:0                            4739
tcp         0       0  0.0.0.0:902         0.0.0.0:0           LISTEN              0
udp         0       0  10.202.1.154:427    0.0.0.0:0                               0
tcp         0       0  10.202.1.154:427    0.0.0.0:0           LISTEN              0
tcp         0       0  10.202.1.154:443    10.202.1.101:57247  ESTABLISHED      5225  hostd-worker
tcp         0       0  10.202.1.154:443    10.202.1.101:60137  ESTABLISHED      6356  hostd-worker
tcp         0       0  10.202.1.154:443    10.202.1.101:60142  ESTABLISHED      8571  hostd-worker
tcp         0       0  10.202.1.154:443    10.202.1.101:62014  ESTABLISHED      6351  hostd-worker
tcp         0       0  10.202.1.154:51421  10.202.1.170:443    ESTABLISHED      5589  vShield-Endpoint-Mux
tcp         0       0  10.202.1.154:56833  10.202.1.170:443    ESTABLISHED      5589  vShield-Endpoint-Mux
tcp         0       0  10.202.1.154:5989   10.202.1.101:58067  FIN_WAIT_2          0
tcp         0       0  10.202.1.154:60608  10.202.1.153:8182   ESTABLISHED      5977  fdm
tcp         0       0  10.202.1.154:897    10.202.1.250:2049   ESTABLISHED      4811  nfsLockFileUpdate
tcp         0       0  127.0.0.1:12000     0.0.0.0:0           LISTEN         225516  vpxa-worker
tcp         0       0  127.0.0.1:12001     0.0.0.0:0           LISTEN           4936  hostd-worker
tcp         0       0  127.0.0.1:427       0.0.0.0:0           LISTEN              0
tcp         0       0  127.0.0.1:443       127.0.0.1:50393     ESTABLISHED      6352  hostd-worker
tcp         0       0  127.0.0.1:49422     127.0.0.1:8307      ESTABLISHED    225530  vpxa-worker
tcp         0       0  127.0.0.1:49572     127.0.0.1:5988      CLOSE_WAIT       6355  hostd-worker
tcp         0       0  127.0.0.1:50058     127.0.0.1:80        ESTABLISHED      5873  sfcb-vmware_bas
tcp         0       0  127.0.0.1:50163     127.0.0.1:8307      ESTABLISHED    263472  vpxa-worker
tcp         0       0  127.0.0.1:50393     127.0.0.1:443       ESTABLISHED    984559  python
tcp         0       0  127.0.0.1:51125     127.0.0.1:8307      TIME_WAIT           0
tcp         0       0  127.0.0.1:52165     127.0.0.1:8089      ESTABLISHED     25860  hostd-worker
tcp         0       0  127.0.0.1:52240     127.0.0.1:8307      TIME_WAIT           0
tcp         0       0  127.0.0.1:52398     127.0.0.1:8307      ESTABLISHED    225532  vpxa-worker
tcp         0       0  127.0.0.1:53593     127.0.0.1:8307      ESTABLISHED      6352  hostd-worker
tcp         0       0  127.0.0.1:53696     127.0.0.1:8307      TIME_WAIT           0
tcp         0       0  127.0.0.1:54018     127.0.0.1:8307      FIN_WAIT_2       4936  hostd-worker
tcp         0       0  127.0.0.1:54093     127.0.0.1:80        CLOSE_WAIT       5867  sfcb-vmware_bas
tcp         0       0  127.0.0.1:54103     127.0.0.1:8307      ESTABLISHED      5975  fdm
tcp         0       0  127.0.0.1:54545     127.0.0.1:8089      ESTABLISHED    225466  hostd-worker
tcp         0       0  127.0.0.1:56006     127.0.0.1:8307      TIME_WAIT           0
tcp         0       0  127.0.0.1:56719     127.0.0.1:8307      ESTABLISHED    225533  vpxa-worker
tcp         0       0  127.0.0.1:57105     127.0.0.1:443       TIME_WAIT           0
tcp         0       0  127.0.0.1:57181     127.0.0.1:8089      ESTABLISHED      6352  hostd-worker
tcp         0       0  127.0.0.1:57196     127.0.0.1:8089      ESTABLISHED    225479  hostd-worker
tcp         0       0  127.0.0.1:57311     127.0.0.1:8089      TIME_WAIT           0
tcp         0       0  127.0.0.1:57541     127.0.0.1:8089      TIME_WAIT           0
tcp         0       0  127.0.0.1:57720     127.0.0.1:443       TIME_WAIT           0
tcp         0       0  127.0.0.1:57807     127.0.0.1:8307      ESTABLISHED      5977  fdm
tcp         0       0  127.0.0.1:57981     127.0.0.1:8307      ESTABLISHED    225529  vpxa-worker
tcp         0       0  127.0.0.1:59574     127.0.0.1:8307      ESTABLISHED    225528  vpxa-worker
tcp         0       0  127.0.0.1:5988      0.0.0.0:0           LISTEN              0
tcp         0       0  127.0.0.1:5988      127.0.0.1:49572     FIN_WAIT_2          0
tcp         0       0  127.0.0.1:5988      127.0.0.1:56961     TIME_WAIT           0
tcp         0       0  127.0.0.1:60625     127.0.0.1:8307      ESTABLISHED    263472  vpxa-worker
tcp         0       0  127.0.0.1:61171     127.0.0.1:8307      ESTABLISHED      5963  fdm
tcp         0       0  127.0.0.1:62139     127.0.0.1:8307      ESTABLISHED    225479  hostd-worker
tcp         0       0  127.0.0.1:80        127.0.0.1:50058     ESTABLISHED      6588  hostd-worker
tcp         0       0  127.0.0.1:80        127.0.0.1:54093     FIN_WAIT_2       4936  hostd-worker
tcp         0       0  127.0.0.1:8089      0.0.0.0:0           LISTEN         225516  vpxa-worker
tcp         0       0  127.0.0.1:8089      127.0.0.1:52165     ESTABLISHED    225532  vpxa-worker
tcp         0       0  127.0.0.1:8089      127.0.0.1:54545     ESTABLISHED    225525  vpxa-worker
tcp         0       0  127.0.0.1:8089      127.0.0.1:57181     ESTABLISHED    263531  vpxa-worker
tcp         0       0  127.0.0.1:8089      127.0.0.1:57196     ESTABLISHED    225522  vpxa-worker
tcp         0       0  127.0.0.1:8307      0.0.0.0:0           LISTEN           4936  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:49422     ESTABLISHED    225477  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:50163     ESTABLISHED      6355  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:52398     ESTABLISHED    225466  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:53593     ESTABLISHED      5225  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:54018     CLOSE_WAIT       4965  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:54103     ESTABLISHED      4966  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:56719     ESTABLISHED     22169  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:57807     ESTABLISHED    225482  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:57981     ESTABLISHED      4936  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:59574     ESTABLISHED      4962  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:60625     ESTABLISHED      6557  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:61171     ESTABLISHED    225480  hostd-worker
tcp         0       0  127.0.0.1:8307      127.0.0.1:62139     ESTABLISHED      6553  hostd-worker
tcp         0       0  127.0.0.1:8309      0.0.0.0:0           LISTEN           4936  hostd-worker
tcp         0       0  127.0.0.1:8889      0.0.0.0:0           LISTEN           5387  openwsmand
tcp         0       0  127.0.0.1:9089      0.0.0.0:0           LISTEN           5963  fdm
tcp         0       0  127.0.0.1:9090      0.0.0.0:0           LISTEN           5963  fdm
tcp         0       0  169.254.1.1:2222    0.0.0.0:0           LISTEN           4739
udp         0       0  169.254.1.1:427     0.0.0.0:0                               0
tcp         0       0  169.254.1.1:427     0.0.0.0:0           LISTEN              0
tcp         0       0  192.168.5.1:22      192.168.5.2:49718   ESTABLISHED         0
udp         0       0  192.168.5.1:427     0.0.0.0:0                               0
tcp         0       0  192.168.5.1:427     0.0.0.0:0           LISTEN              0
tcp         0       0  192.168.5.1:825     192.168.5.2:2049    ESTABLISHED      4811  nfsLockFileUpdate
tcp         0       0  192.168.5.1:826     192.168.5.2:2049    ESTABLISHED      6496  vmm0:D39
tcp         0       0  192.168.5.1:828     192.168.5.2:2049    ESTABLISHED         0
Proto  Recv Q  Send Q  Local Address       Foreign Address     State        World ID  World Name


2. Seperate the vmotion and management vmknic/vlan

3. Firewall settings:
~ # esxcli network firewall get
   Default Action: PASS
   Enabled: false 
   Loaded: true ( it should be loaded for HA to work)

4.  "ethtool -S vmnic0 |grep error| grep -v :\ 0"
     "ethtool -S vmnic0 |grep rx_no_buffer_count | grep -v :\ 0"

5. Finally review vpxa.log and hostd.log on ESXi as well as vpxd.log in vcenter.

6. Fix when it is hung is to restart services ( services.sh restart)