Friday, March 15, 2013

Vcenter disconnects troubleshooting.

Recently we found from a customer where the ESXi connection to vcenter drops during backup or
during vmotion.

Synopsys:

1. Vmware might not support more than 100 connections per vmk port.
2. It is good to seperate vmotion into a seperate vmk nic  and vlan than being with management port. ( make sure vmks are in different subnet)
3. Firewall in ESXi might not be able to handle the traffic
4. Check the ethtool for any network drops. ( checking duplicate IPs always helps)
5. NFS datastore has more than 20,000 files. (ls -lR |wc -l)
6. ESXi resources (cpu/memory - there might be other performance issues)
7. Disable HA/DRS to see if disconnects disappear. (related to the datastores with
a lot of files)
8. vpxa/vpxd crashes.

Details:

1. How to check the number of connections per vmk ?

Quick script:


~ # esxcli network ip connection list | sort -k 4| awk '{print $4}'|cut -d ":" -f 1| uniq -c
 1 ------------------
14 0.0.0.0
394 10.X.Y.150   <<<< all were stuck at FIN_WAIT_2
40 127.0.0.1
7 192.168.5.1
1 Send
~ # esxcli network ip connection list | sort -k 4

2. Seperate the vmotion and management vmknic/vlan

3. Firewall settings:
~ # esxcli network firewall get
   Default Action: PASS
   Enabled: false 
   Loaded: true ( it should be loaded for HA to work)

4.  This will determine any errors or packet drops - due to switch /cable/port

     "ethtool -S vmnic0 |grep error| grep -v :\ 0"
     "ethtool -S vmnic0 |grep rx_no_buffer_count | grep -v :\ 0"
     "ethtool -S vmnic0 |grep drop"

5. Duplicate IP address - run arp commands from different node, when pinging ESXi.
or ssh to ESXi and see if keeps disconnecting.

 esxcli  network ip neighbor list -- see if any MAC entries are flapping .
arp -a (linux)



6. Make sure esxcfg-vmknic -l ( all the ips don't have duplicate ip/conflict/multiple vmks in same subnet) 
7. Finally review vpxa.log/vmkernel.log(APD) and hostd.log on ESXi as well as vpxd.log in vcenter. - vmkernel logs for vpxa crash

8. Fix when it is hung is to restart services ( services.sh restart)

PR 831801: The default value of FIN_WAIT_2 timer was erroneosly set to TCPTV_KEEPINTVL * TCP_KEEPCNT = 75* 0x400. This discrepancy results in the socket at FIN_WAIT_2 state to exist for a much longer time and if multiple such sockets are accumulated, they might impact new socket creation.

10. Disable firewall on ESXi  ( use these rules exactly, so HA will work)
 # esxcli network firewall get
   Default Action: DROP or PASS
   Enabled: false
   Loaded: true

12. Increase VPXA thread.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2009217

13. NFS datastore has more than 20,000 files - vpxa could hang.

14. If there is memory contention, make sure ESXi has 2G of memory reserved.
 
 



Thursday, March 7, 2013

Heat Map Analysis

Versions Affected2.6.4; 3.0.2
Description
Command heat_map_printer is available in 3.0.2 and can be downloaded for
2.6.4 cluster ( ithaca: ~jerome/heat_map_printer.2.6.4)

heat_map_printer should be run on a  running active cluster.

heat map printer calculates how many extent groups were accessed in each
 tier during last six hours on all the nodes.

On 400 G SSD-PCIe card, 200G is used for extent groups (stargate-data),
with replication factor 2, only 100 G is available per node.
Solution
Output:

Column 1 : tier accessed during last X seconds.
Column 2:  Access between that time frame
(  between 300 sec and 600sec - 3930 extent groups were accessed)
Column 3: Cumulative access.
 (  from 0 to  600 sec 15907 = 11977 + 3930 extent groups was accessed )

Extent Group is 16 MB in size, so within less than 5 minutes, 11977 extent groups
were accessed, so total  GB will be ( 11977 * 16)/1024.

It divides based on SSD, HDD and total histogram.  The data accessed will be
more accurate within last 15 minutes  in SSD and after 30 minutes in HDD, because
of less ILM  migrate activities in that tier.

The last one lists the  name of VMDK and how busy each of the VMDKs were in
descending order.
Histogram for storage tier SSD-PCIe
 <300      : 11977      11977
 <600      : 3930       15907
 <900      : 2457       18364
 <1200     : 747        19111
 <1500     : 1134       20245
 <1800     : 1245       21490
 <2100     : 1445       22935
 <2400     : 935        23870
 <2700     : 988        24858
 <3000     : 1076       25934
 <3300     : 811        26745
 <3600     : 618        27363
Histogram for storage tier DAS-SATA
 <300      : 1082       1082
 <600      : 743        1825
 <900      : 663        2488
 <1200     : 433        2921
 <1500     : 374        3295
 <1800     : 854        4149
 <2100     : 724        4873
 <2400     : 660        5533
 <2700     : 969        6502
 <3000     : 729        7231
 <3300     : 806        8037
 <3600     : 481        8518
Total histogram
 <300      : 13059      13059
 <600      : 4673       17732
 <900      : 3120       20852
 <1200     : 1180       22032
 <1500     : 1508       23540
 <1800     : 2099       25639
 <2100     : 2169       27808
 <2400     : 1595       29403
 <2700     : 1957       31360
 <3000     : 1805       33165
 <3300     : 1617       34782
 <3600     : 1099       35881
VDisk id per egroup access: total egroups(35889)  - renamed different
Usernames with Username
     47854:       1292 (03 pct) fname:Username-flat.vmdk
     22204:       1180 (03 pct) fname:Username-flat.vmdk
     22205:       1130 (03 pct) fname:Username-flat.vmdk
    132938:       1125 (03 pct) fname:Username-flat.vmdk
     78361:       1110 (03 pct) fname:Username-flat.vmdk
    139761:       1089 (03 pct) fname:Username-flat.vmdk
     81750:       1045 (02 pct) fname:Username-flat.vmdk
     80434:       1007 (02 pct) fname:Username-flat.vmdk
     22240:        967 (02 pct) fname:Username-flat.vmdk
    161460:        951 (02 pct) fname:Username-flat.vmdk
     61030:        949 (02 pct) fname:Username-flat.vmdk
    130895:        948 (02 pct) fname:Username-flat.vmdk
     41526:        842 (02 pct) fname:Username-flat.vmdk
    104769:        833 (02 pct) fname:Username-flat.vmdk
    128032:        796 (02 pct) fname:Username-flat.vmdk
     22282:        758 (02 pct) fname:Username-flat.vmdk
     63566:        752 (02 pct) fname:Username-flat.vmdk
    140027:        743 (02 pct) fname:Username-flat.vmdk
    103007:        670 (01 pct) fname:Username-flat.vmdk
     75787:        646 (01 pct) fname:Username-flat.vmdk
     22246:        642 (01 pct) fname:Username-flat.vmdk
    135960:        640 (01 pct) fname:Username-flat.vmdk
    155611:        565 (01 pct) fname:Username-flat.vmdk
    127394:        564 (01 pct) fname:Username-flat.vmdk
     59341:        540 (01 pct) fname:Username-flat.vmdk
      2535:        518 (01 pct) fname:Username-flat.vmdk
     30385:        509 (01 pct) fname:Username-flat.vmdk
     82923:        498 (01 pct) fname:Username-flat.vmdk
    161143:        443 (01 pct) fname:Username-flat.vmdk
     50827:        424 (01 pct) fname:Username-flat.vmdk
    162668:        413 (01 pct) fname:Username-flat.vmdk
     39376:        411 (01 pct) fname:Username-flat.vmdk
     33416:        396 (01 pct) fname:Username-flat.vmdk
     29321:        393 (01 pct) fname:Username-flat.vmdk
     31597:        355 (00 pct) fname:Username-flat.vmdk
     41560:        355 (00 pct) fname:Username-flat.vmdk
     58304:        302 (00 pct) fname:Username-flat.vmdk
    138675:        300 (00 pct) fname:Username-flat.vmdk
   1138125:        284 (00 pct) fname:Username-flat.vmdk
     38653:        282 (00 pct) fname:Username-flat.vmdk
    106381:        273 (00 pct) fname:Username-flat.vmdk
    131231:        267 (00 pct) fname:Username-flat.vmdk
     54950:        261 (00 pct) fname:Username-flat.vmdk
     46777:        261 (00 pct) fname:Username-flat.vmdk
    139497:        246 (00 pct) fname:Username-flat.vmdk
    124993:        237 (00 pct) fname:Username-flat.vmdk
     79091:        236 (00 pct) fname:Username-flat.vmdk
     49942:        225 (00 pct) fname:Username-flat.vmdk
    151952:        225 (00 pct) fname:Username-flat.vmdk
   1222704:        223 (00 pct) fname:Username-flat.vmdk
     22018:        220 (00 pct) fname:Username-flat.vmdk
     65750:        219 (00 pct) fname:Username-flat.vmdk
    108415:        211 (00 pct) fname:Username-flat.vmdk
    112976:        208 (00 pct) fname:Username-000001-delta.vmdk
    137102:        208 (00 pct) fname:Username-flat.vmdk
    131914:        206 (00 pct) fname:Username-flat.vmdk
     46847:        205 (00 pct) fname:Username-flat.vmdk
     22284:        200 (00 pct) fname:Username-flat.vmdk
    103593:        199 (00 pct) fname:Username-flat.vmdk
     22251:        199 (00 pct) fname:Username-flat.vmdk
    105040:        190 (00 pct) fname:Username-flat.vmdk
     48304:        187 (00 pct) fname:Username-flat.vmdk
    114661:        186 (00 pct) fname:Username-flat.vmdk
     49179:        186 (00 pct) fname:Username-flat.vmdk
    125596:        184 (00 pct) fname:Username-flat.vmdk
     81201:        183 (00 pct) fname:Username-flat.vmdk
    108229:        181 (00 pct) fname:Username-flat.vmdk
     83459:        181 (00 pct) fname:Username-flat.vmdk
     74649:        169 (00 pct) fname:Username-flat.vmdk
    112339:        165 (00 pct) fname:Username-flat.vmdk
     32009:        156 (00 pct) fname:Username-flat.vmdk
     38308:        155 (00 pct) fname:Username-flat.vmdk
     37874:        154 (00 pct) fname:Username-flat.vmdk
    113781:        152 (00 pct) fname:Username-flat.vmdk
     73465:        150 (00 pct) fname:Username-flat.vmdk
     60248:        150 (00 pct) fname:Username-flat.vmdk
     73301:        148 (00 pct) fname:Username-flat.vmdk
    345401:        145 (00 pct) fname:Username-flat.vmdk
     50071:        136 (00 pct) fname:Username-flat.vmdk
     37870:        133 (00 pct) fname:Username-flat.vmdk
     58256:        125 (00 pct) fname:Username-flat.vmdk
     61625:        123 (00 pct) fname:Username-flat.vmdk
    570923:         93 (00 pct) fname:Username-flat.vmdk
     14826:         53 (00 pct) fname:Username-flat.vmdk
    930800:          4 (00 pct) fname:vmware.log
    930512:          2 (00 pct) fname:vmware.log
      4221:          1 (00 pct) fname:protectedlist_
    133299:          1 (00 pct) fname:vmware.log
 perl -ne 's/$1/Username/  if /fname:(.*?)-/;print' heatmap.20130307095821