Tuesday, June 18, 2013

Curator does thankless job of keeping Nutanix cluster Clean and Lean.

Curator uses map reduce logic to clean up deleted vdisks, containers and update reference count.
It monitors under-replicated or over-replicated extent groups , redistribute extent groups for node and block awareness. Based on upper and lower threshold, it migrates extent groups between the tiers based on "hotness" of the data.  Partial scan is initiated every 30 minutes if there is "to Remove", "ILM needed" or "Diskspace utilization". Full scan initiated every 6 hours does additional function of updating ref. counts. For ILM and cleaning extent groups, curator finds those extent groups informs stargate to do the actual job and chronos acts as admission control on how many of these jobs are forwarded to stargate.

BTW, curator does these in the background and nutanix cluster has optimum gflags and
settings for the curator. With every release of the nutanix software, more of these configs
will be tuned automatically based on workload.

Gflags for configuring curator: ( if you need to change gflags, please contact nutanix
support or sales team).
1. Lower threshold is configured via
ncli sp edit ilm-thresh ( default 70)
2.upper ilm threshold -  --curator_tier_usage_ilm_threshold_percent 
3. how much to migrate between tiers upto lower threshold -
curator_tier_free_up_percent_by_ilm

4.  how often chronos asks stargate to work on curator jobs: chronos_master_handshake_period_msecs
5. 
--curator_next_tier_usage_ilm_threshold_percent=95 (default 90) -- migrate to next tier only if next tier has so much free space.
6. -curator_full_scan_period_secs  - how often full scan is run
7. --chronos_master_node_max_active_requests - # number of requests sent to stargate at every handshake.


How to manually run a full scan:

for svm in `svmips`; do wget -O - "http://$svm:2010/master/api/client/StartCuratorTasks?task_type=2"; done


Here is an example of hot tier usage and ILM migration activity if these params are set aggressively and this
could cause unnecessary network traffic and I/O activities. As noticed in the following figure, that before Apr17th,
there were a lot of migrate activities from the SSD tier. This picture is plots the usage of SSD tier.


How to check how much of your data was accessed in last 30 minutes in any tier ?
heat-map-analysis

 
How to find when curator full scan and partial scan was run. ?

Curator Jobs

Job idExecution idJob nameStatusReasonsZeus config validStart timeEnd timeTotal time (secs)
165656Partial ScanSucceededILMYesTue Jun 18 10:34:18 2013Tue Jun 18 10:39:51 2013333
165654Partial ScanSucceededILMYesTue Jun 18 10:03:48 2013Tue Jun 18 10:09:08 2013320
165652Partial ScanSucceededILMYesTue Jun 18 09:33:18 2013Tue Jun 18 09:38:51 2013333
165650Partial ScanSucceededILMYesTue Jun 18 09:03:17 2013Tue Jun 18 09:08:50 2013333
165647Partial ScanSucceededILMYesTue Jun 18 08:32:47 2013Tue Jun 18 08:38:07 2013320
065642Full ScanSucceededILM ToRemoveYesTue Jun 18 08:02:17 2013Tue Jun 18 08:17:50 2013933
165640Partial ScanSucceededPeriodicYesTue Jun 18 07:50:28 2013Tue Jun 18 07:55:49 2013321

Tier Usage:

Storage Pool: NTNX-SP1 ILM Down Migrate threshold: 85

Tier NameTier UsageTier SizeTier Usage Pct
SSD-PCIe1355.50 GB1481.57 GB91%
SSD-SATAN/AN/AN/A
DAS-SATA15362.15 GB51371.99 GB29%

Are all Nodes balanced disk usage ?

Storage Pool: NTNX-SP1 Tier: SSD-PCIe

Mean Usage Pct92%
Zone of Balance85% - 99%
Usage Spread Pct8%
StatusBalanced
Rack IdService VMDisk IdDisk UsageDisk SizeDisk Usage PctInside Zone of Balance
453898548849049320988.31 GB93.09 GB94%Yes
453898548223588.48 GB93.09 GB95%Yes
453898548234888.02 GB93.09 GB94%Yes
453898548285987.77 GB93.09 GB94%Yes
4538985526739568233697919588.48 GB93.09 GB95%Yes
4538985526739568433697917488.61 GB93.09 GB95%Yes
4538985526739568633697921188.58 GB93.09 GB95%Yes
4538985526739568833697918488.41 GB93.09 GB94%Yes
490725470490725463490725475163.83 GB184.21 GB88%Yes
490725470490725465490725476160.78 GB184.21 GB87%Yes
490725470490725467490725477163.83 GB184.21 GB88%Yes
490725470490725471490725478160.41 GB184.21 GB87%Yes

What are the activities done during last patial scan ?
MapReduce job 65657
Job id65657
Job namePartialScan MapReduce
StatusSucceeded
Map tasks done36/36
Reduce tasks done24/24
Start timeTue Jun 18 10:35:18 2013
End timeTue Jun 18 10:39:48 2013
Total time (secs)270

Map Tasks

Task idTask TypeDesired StatusStatusNode idStart timeEnd timeTotal time (secs)
0ExtentGroupIdMapTaskSucceededSucceeded472452227Tue Jun 18 10:35:18 2013Tue Jun 18 10:37:31 2013133
1ExtentGroupIdMapTaskSucceededSucceeded490493246Tue Jun 18 10:35:18 2013Tue Jun 18 10:37:37 2013139
2ExtentGroupIdMapTaskSucceededSucceeded490725549Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:59 2013101
3ExtentGroupIdMapTaskSucceededSucceeded490725550Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:51 201393
4ExtentGroupIdMapTaskSucceededSucceeded472452000Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:17 201359
5ExtentGroupIdMapTaskSucceededSucceeded472581426Tue Jun 18 10:35:18 2013Tue Jun 18 10:37:06 2013108
6ExtentGroupIdMapTaskSucceededSucceeded472451186Tue Jun 18 10:35:18 2013Tue Jun 18 10:37:48 2013150
7ExtentGroupIdMapTaskSucceededSucceeded490725552Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:14 201356
8ExtentGroupIdMapTaskSucceededSucceeded490725511Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:34 201376
9ExtentGroupIdMapTaskSucceededSucceeded472451018Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:53 201395
10ExtentGroupIdMapTaskSucceededSucceeded472451324Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:57 201399
11ExtentGroupIdMapTaskSucceededSucceeded472337394Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:34 201376
12ExtentGroupAccessDataMapTaskSucceededSucceeded472452227Tue Jun 18 10:35:18 2013Tue Jun 18 10:35:31 201313
13ExtentGroupAccessDataMapTaskSucceededSucceeded490725552Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:14 201356
14ExtentGroupAccessDataMapTaskSucceededSucceeded490725549Tue Jun 18 10:35:18 2013Tue Jun 18 10:35:59 201341
15ExtentGroupAccessDataMapTaskSucceededSucceeded490725550Tue Jun 18 10:35:18 2013Tue Jun 18 10:35:51 201333
16ExtentGroupAccessDataMapTaskSucceededSucceeded472452000Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:17 201359
17ExtentGroupAccessDataMapTaskSucceededSucceeded472581426Tue Jun 18 10:35:18 2013Tue Jun 18 10:36:06 201348
18ExtentGroupAccessDataMapTaskSucceededSucceeded472451186Tue Jun 18 10:35:18 2013Tue Jun 18 10:35:48 201330
19ExtentGroupAccessDataMapTaskSucceededSucceeded472452227Tue Jun 18 10:35:32 2013Tue Jun 18 10:36:31 201359
20ExtentGroupAccessDataMapTaskSucceededSucceeded490725511Tue Jun 18 10:35:18 2013Tue Jun 18 10:35:34 201316
21ExtentGroupAccessDataMapTaskSucceededSucceeded472451018Tue Jun 18 10:35:18 2013Tue Jun 18 10:35:53 201335
22ExtentGroupAccessDataMapTaskSucceededSucceeded490493246Tue Jun 18 10:35:18 2013Tue Jun 18 10:35:37 201319
23ExtentGroupAccessDataMapTaskSucceededSucceeded472337394Tue Jun 18 10:35:18 2013Tue Jun 18 10:35:34 201316
24VDiskOplogMapTaskSucceededSucceeded472451324Tue Jun 18 10:35:18 2013Tue Jun 18 10:35:57 201339
25VDiskOplogMapTaskSucceededSucceeded472337394Tue Jun 18 10:35:34 2013Tue Jun 18 10:36:34 201360
26VDiskOplogMapTaskSucceededSucceeded490725511Tue Jun 18 10:35:34 2013Tue Jun 18 10:36:34 201360
27VDiskOplogMapTaskSucceededSucceeded490493246Tue Jun 18 10:35:37 2013Tue Jun 18 10:36:37 201360
28VDiskOplogMapTaskSucceededSucceeded472451186Tue Jun 18 10:35:48 2013Tue Jun 18 10:36:48 201360
29VDiskOplogMapTaskSucceededSucceeded490725550Tue Jun 18 10:35:51 2013Tue Jun 18 10:36:51 201360
30VDiskOplogMapTaskSucceededSucceeded472451018Tue Jun 18 10:35:53 2013Tue Jun 18 10:36:53 201360
31VDiskOplogMapTaskSucceededSucceeded472451324Tue Jun 18 10:35:57 2013Tue Jun 18 10:36:57 201360
32VDiskOplogMapTaskSucceededSucceeded490725549Tue Jun 18 10:35:59 2013Tue Jun 18 10:36:59 201360
33VDiskOplogMapTaskSucceededSucceeded472581426Tue Jun 18 10:36:06 2013Tue Jun 18 10:37:06 201360
34VDiskOplogMapTaskSucceededSucceeded490725552Tue Jun 18 10:36:14 2013Tue Jun 18 10:37:14 201360
35VDiskOplogMapTaskSucceededSucceeded490725552Tue Jun 18 10:36:14 2013Tue Jun 18 10:37:14 201360

Reduce Tasks

Task idTask TypeDesired StatusStatusNode idStart timeEnd timeTotal time (secs)
0DiskIdReduceTaskSucceededSucceeded472452227Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:31 2013253
1DiskIdReduceTaskSucceededSucceeded472452227Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:31 2013253
2DiskIdReduceTaskSucceededSucceeded472452000Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:17 2013239
3DiskIdReduceTaskSucceededSucceeded472452000Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:17 2013239
4DiskIdReduceTaskSucceededSucceeded490493246Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:37 2013259
5DiskIdReduceTaskSucceededSucceeded490493246Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:37 2013259
6DiskIdReduceTaskSucceededSucceeded490725549Tue Jun 18 10:35:18 2013Tue Jun 18 10:38:59 2013221
7DiskIdReduceTaskSucceededSucceeded490725549Tue Jun 18 10:35:18 2013Tue Jun 18 10:38:59 2013221
8DiskIdReduceTaskSucceededSucceeded472337394Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:34 2013256
9DiskIdReduceTaskSucceededSucceeded472337394Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:34 2013256
10DiskIdReduceTaskSucceededSucceeded490725550Tue Jun 18 10:35:18 2013Tue Jun 18 10:38:51 2013213
11DiskIdReduceTaskSucceededSucceeded490725550Tue Jun 18 10:35:18 2013Tue Jun 18 10:38:51 2013213
12ExtentGroupIdReduceTaskSucceededSucceeded472581426Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:06 2013228
13ExtentGroupIdReduceTaskSucceededSucceeded472581426Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:06 2013228
14ExtentGroupIdReduceTaskSucceededSucceeded490725552Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:14 2013236
15ExtentGroupIdReduceTaskSucceededSucceeded490725552Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:14 2013236
16ExtentGroupIdReduceTaskSucceededSucceeded472451018Tue Jun 18 10:35:18 2013Tue Jun 18 10:38:53 2013215
17ExtentGroupIdReduceTaskSucceededSucceeded472451018Tue Jun 18 10:35:18 2013Tue Jun 18 10:38:53 2013215
18ExtentGroupIdReduceTaskSucceededSucceeded490725511Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:34 2013256
19ExtentGroupIdReduceTaskSucceededSucceeded490725511Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:34 2013256
20ExtentGroupIdReduceTaskSucceededSucceeded472451324Tue Jun 18 10:35:18 2013Tue Jun 18 10:38:57 2013219
21ExtentGroupIdReduceTaskSucceededSucceeded472451324Tue Jun 18 10:35:18 2013Tue Jun 18 10:38:57 2013219
22ExtentGroupIdReduceTaskSucceededSucceeded472451186Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:48 2013270
23ExtentGroupIdReduceTaskSucceededSucceeded472451186Tue Jun 18 10:35:18 2013Tue Jun 18 10:39:48 2013270

Job Counters

NameValue
MapExtentGroupIdMap535252
ReduceDiskIdExtentGroupId1070510
MapExtentGroupAccessDataMap535213
NumExtentGroupsToMigrateForILM4740
NumExtentGroupsToMigrateForDiskBalancing0
MapVDiskOplogMap764
NumHostVDiskTasks3
FgHostVDiskTaskCount3
FgDeleteToRemoveOplogMapEntryTaskCount0
FgDeleteVDiskBlocksTaskCount0
MapVDiskBlockMap0
NumExtentGroupsWithReplicaOnSameNode0
NumExtentGroupsWithReplicaOnSameRack3275
NumFixExtentGroupsTasksReplicaOnSameRack23
FgDeleteExtentGroupsWithNonEidExtentsTaskCount0
NumExtentGroupsWithNonEidExtentsToDelete0
NumInvalidExtentGroupAccessDataMapEntries0
BgFixExtentGroupTaskCount5779
BgMergeExtentGroupsTaskCount0
BgCompressExtentsTaskCount0
BgDeduplicateExtentTaskCount0
BgMigrateExtentsTaskCount0
BgCopyBlockmapMetadataTaskCount0
BgUpdateRefcountsTaskCount0
InternalError0

What are the activities done during full scan ?

MapReduce job 65643
Job id65643
Job nameFullScan MapReduce #1
StatusSucceeded
Map tasks done25/25
Reduce tasks done24/24
Start timeTue Jun 18 08:03:32 2013
End timeTue Jun 18 08:06:33 2013
Total time (secs)181

Map Tasks

Task idTask TypeDesired StatusStatusNode idStart timeEnd timeTotal time (secs)
0VDiskOplogMapTaskSucceededSucceeded490725550Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:50 201317
1VDiskOplogMapTaskSucceededSucceeded490725552Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:13 201340
2VDiskOplogMapTaskSucceededSucceeded490725552Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:13 201340
3VDiskOplogMapTaskSucceededSucceeded472451018Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:52 201319
4VDiskOplogMapTaskSucceededSucceeded472451018Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:52 201319
5VDiskOplogMapTaskSucceededSucceeded490725511Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:33 201360
6VDiskOplogMapTaskSucceededSucceeded490725511Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:33 201360
7VDiskOplogMapTaskSucceededSucceeded472451324Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:56 201323
8VDiskOplogMapTaskSucceededSucceeded472451324Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:56 201323
9VDiskOplogMapTaskSucceededSucceeded472451186Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:47 201314
10VDiskOplogMapTaskSucceededSucceeded472451186Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:47 201314
11VDiskOplogMapTaskSucceededSucceeded490493246Tue Jun 18 08:03:36 2013Tue Jun 18 08:04:36 201360
12NfsInodeMapTaskSucceededSucceeded472452227Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:30 201357
13NfsInodeMapTaskSucceededSucceeded490493246Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:36 20133
14NfsInodeMapTaskSucceededSucceeded472452227Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:30 201357
15NfsInodeMapTaskSucceededSucceeded490725549Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:58 201325
16NfsInodeMapTaskSucceededSucceeded472452000Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:16 201343
17NfsInodeMapTaskSucceededSucceeded490725549Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:58 201325
18NfsInodeMapTaskSucceededSucceeded472337394Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:33 201360
19NfsInodeMapTaskSucceededSucceeded472581426Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:04 201331
20NfsInodeMapTaskSucceededSucceeded472452000Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:16 201343
21NfsInodeMapTaskSucceededSucceeded490725550Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:50 201317
22NfsInodeMapTaskSucceededSucceeded490493246Tue Jun 18 08:03:33 2013Tue Jun 18 08:03:36 20133
23NfsInodeMapTaskSucceededSucceeded472337394Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:33 201360
24NfsVDiskMapTaskSucceededSucceeded472581426Tue Jun 18 08:03:33 2013Tue Jun 18 08:04:04 201331

Reduce Tasks

Task idTask TypeDesired StatusStatusNode idStart timeEnd timeTotal time (secs)
0NfsInodeReduceTaskSucceededSucceeded472452227Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:30 2013177
1NfsInodeReduceTaskSucceededSucceeded472452227Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:30 2013177
2NfsInodeReduceTaskSucceededSucceeded472452000Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:16 2013163
3NfsInodeReduceTaskSucceededSucceeded472452000Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:16 2013163
4NfsInodeReduceTaskSucceededSucceeded490493246Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:36 2013123
5NfsInodeReduceTaskSucceededSucceeded490493246Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:36 2013123
6NfsInodeReduceTaskSucceededSucceeded490725549Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:58 2013145
7NfsInodeReduceTaskSucceededSucceeded490725549Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:58 2013145
8NfsInodeReduceTaskSucceededSucceeded472337394Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:33 2013180
9NfsInodeReduceTaskSucceededSucceeded472337394Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:33 2013180
10NfsInodeReduceTaskSucceededSucceeded490725550Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:50 2013137
11NfsInodeReduceTaskSucceededSucceeded490725550Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:50 2013137
12NfsDirectoryReduceTaskSucceededSucceeded472581426Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:04 2013151
13NfsDirectoryReduceTaskSucceededSucceeded472581426Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:04 2013151
14NfsDirectoryReduceTaskSucceededSucceeded490725552Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:13 2013160
15NfsDirectoryReduceTaskSucceededSucceeded490725552Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:13 2013160
16NfsDirectoryReduceTaskSucceededSucceeded472451018Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:52 2013139
17NfsDirectoryReduceTaskSucceededSucceeded472451018Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:52 2013139
18NfsDirectoryReduceTaskSucceededSucceeded490725511Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:33 2013180
19NfsDirectoryReduceTaskSucceededSucceeded490725511Tue Jun 18 08:03:33 2013Tue Jun 18 08:06:33 2013180
20NfsDirectoryReduceTaskSucceededSucceeded472451324Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:56 2013143
21NfsDirectoryReduceTaskSucceededSucceeded472451324Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:56 2013143
22NfsDirectoryReduceTaskSucceededSucceeded472451186Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:47 2013134
23NfsDirectoryReduceTaskSucceededSucceeded472451186Tue Jun 18 08:03:33 2013Tue Jun 18 08:05:47 2013134

Job Counters

NameValue
MapVDiskOplogMap764
FgHostVDiskTaskCount1
FgDeleteToRemoveOplogMapEntryTaskCount0
NumHostVDiskTasks1
FgAddNfsInodeContainerIdTaskCount0
NumNfsInodesUpdatedWithContainerId0
FgDeleteNfsInodesTaskCount0
NumNfsInodesDeleted0
NumNfsVDisksProcessed879
NfsReduceChildLinkCount5853
NfsReduceParentLinkCount5859
NfsReduceAttributeCount5859
NfsReduceVDiskCount879
FgFixNfsInodeLinksTaskCount0
FgFixNfsLinkAcrossContainersTaskCount0
FgFixNfsVDiskTaskCount0
FgDeleteNfsDirectoryCount0
BgFixExtentGroupTaskCount0
BgMergeExtentGroupsTaskCount0
BgCompressExtentsTaskCount0
BgDeduplicateExtentTaskCount0
BgMigrateExtentsTaskCount0
BgCopyBlockmapMetadataTaskCount0
BgUpdateRefcountsTaskCount0
InternalError0
NameIdValue
NumNfsDirectoryInodes19930
NumNfsDirectoryInodes19940
NumNfsDirectoryInodes2876983
NumNfsDirectoryInodes3369989193
NumNfsDirectoryInodes19950
NumNfsDirectoryInodes2877211
NumNfsDirectoryInodes19960
NumNfsDirectoryInodes193305844
NumNfsDirectoryInodes1933059261
NumNfsDirectoryInodes4133650
NumNfsDirectoryInodes3234873

Scripts to check Network Stats in a Nutanix Cluster.

Nutanix cluster captures sysstats every often so you can use it graph , using our Nagios tool and run scripts against it
If there is any network latency and unreachable, you use the following script:
Here is the script that checks the ping_hosts.INFO


 for i in `svmips` ; do (echo ; echo "SVM: $i" ; ssh $i cat data/logs/sysstats/ping_hosts.INFO | egrep -v "IP : time" | \
awk '/^#TIMESTAMP/ || $3>13.00 || $3=unreachable' | egrep -B1 " ms|unreachable" | egrep -v "\-\-" ); done

This will print if there is any unreachable or ping response taking more than 13 ms.


Here is another script that prints network utilization of above 1.2Gbps ( you can use Nagios to graph but
 it does not combine both Rx and Tx Bps


for i in `svmips`; do (echo CVM:$i; ssh $i cd data/logs/sysstats;cat sar.INFO |egrep "eth0"| awk '/^#TIMESTAMP/ || \
$6 > 30000 || $7 > 30000' | egrep -B1 " eth0" | awk '{print $1,$2,$6,$7,($6+$7)/1024}'|awk '$5 > 120');done

Here is the modification of above script to check Average BW during certain time: - 6pm to 12 midnight.


for i in `svmips`; do (echo CVM:$i; ssh $i cat data/logs/sysstats/sar.INFO |egrep "eth0"| awk '/^#TIMESTAMP/ || \
$6 > 30000 || $7 > 30000' | egrep -B1 " eth0" | awk '{print $1,$2,$6,$7,($6+$7)/1024}');done |\
 egrep "^06|^07|^08|^09|^10|^11"|grep PM|awk '{sum+=$5} END { print "Average = ",sum/NR}'
Or find the total number of times,network utilization crossed 2G between certain time
for i in `svmips`; do (echo CVM:$i; ssh $i cd data/logs/sysstats;cat sar.INFO |egrep "eth0"| awk '/^#TIMESTAMP/ || \
$6 > 30000 || $7 > 30000' | egrep -B1 " eth0" | awk '{print $1,$2,$6,$7,($6+$7)/1024}'|awk '$5 > 200');done|\
 grep -v CVM|wc-l 


Used this script to verify if the customer network usage dropped to 1G(between 2pm to 3pm)

for i in `svmips`; do (echo CVM:$i; ssh $i cat data/logs/sysstats/sar.INFO |egrep "eth0"| awk '/^#TIMESTAMP/ || \
$6 > 50000 || $7 > 50000' | egrep -B1 " eth0" | awk '{print $1,$2,$6,$7,($6+$7)/1024}');done | egrep "^02"|grep PM

Tuesday, June 11, 2013

Standby or unused Uplink is used after rebooting a ESXi 5.0 Update 1 host

Versions AffectedESXi 5.0; ESXi 5.0 Update 1
Description
Symptom:
Diagnostics.py sequential write performance is poor 
 and esxtop with n switch shows that 1Gbps network is used instead of 
10 Gbps.
Solution
It is due to vmware issues explained on these  KBs on ESXi 5.0 update 1:
kb2008144 
kb2030006
Workaround I: Remove 1Gbps from the vswitch configuration (validated)

esxcfg-nics -l - to find one Gig link ids ( eg, vmnic2 and vmnic3)
esxcfg-vswitch -l - to find the vswitch portgroups that use these links
esxcfg-vswitch -U vmnic2 vSwitch0 


Workaround II:
To work around this issue, try setting the NIC Failback option to yes on 
vswitch as well port group level.

TagsNetworking; VMware; Troubleshooting

Access Nutanix NFS from a different NFS client

Nutanix NFS can be exported to a non-nutanix NFS client on different subnet.

1. Whitelist NFS datastore onNutanix 

ncli> cluster add-to-nfs-whitelist  ip-subnet-masks=10.1.59.210/255.255.255.255

where 10.1.59.210 is non-Nutanix NFS client.

2. Verify that NFS datastore exported correctly - run this command on Nutanix Controller VM

showmount -e
Export list for TEST-13SM35190018-1-CVM:
/TEST-CTR1 10.3.177.28,10.3.177.27,10.3.177.26,10.3.177.25,10.1.59.210/255.255.255.255,192.168.5.0/255.255.255.128


3. Nutanix Centos is stig compliant, we have iptables to prevent accessing Nutanix CVM from another subnet. So here are the iptable rules to allow NFS access. Run these commands on Controller VM ( this is needed only if Nutanix CVM and NFS client are in
different subnets)
Open Port mapper:
for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 111  -j ACCEPT"; done
Open NFS/Mountd port:
for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2049 -j ACCEPT"; done
Save the rules:

sudo iptables-save
 /etc/init.d/iptables save

4. Mount it on remote .210 client (NFS client) 

10.1.59.210:~$ sudo mount 10.3.177.29:/TEST-CTR1 /mnt
esxi: esxcfg-nas -a  -o 10.3.1.177.29 -s /TEST-CTR1 NTNX-Datastore

5. This KB might be useful as well

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007352

Centos Guest VM Hanging at eth0 every alternate Boot on ESXi 5.0

Description
Symptom:
Every alternate reboot on Centos VM hangs on eth0.

Troubleshooting:
- add set -x /etc/sysconfig/network-scripts/ifup-eth to find exactly where it is hanging.
- in this case it hang at arping trying to find the duplicate IP.
 if ! /sbin/arping -q -c 2 -w 3 -D -I ${REALDEVICE} ${ipaddr[$idx]}


 
Solution
Root Cause:
Arping Uses real time instead of relative time to wait for 3 seconds , 
so if real time goes back by an hour during this 3 seconds, 
it will wait for 1 hour 3 seconds instead of 3 seconds. So the 
root cause was time difference between Centos VM and ESXi.

Workaround:

- adding 2 seconds so there is no race condition between time changes.
or
- make sure ESXi time and Centos VM time have correct time ( in one 
customer  case, they had wrong time set on Centos VM
and it was off by 2 hours, even if NTP is defined in Centos VM,
the time difference was too large for NTP ) - Most preferable.
or
- if Centos VM has to have different time than ESXi,then remove time sync
via vmware tools.
   vmware KB
TagsTroubleshooting

Nutanix Diaster Recovery for files.

Description
This document augments setup guide in creating the file protection.. Nutanix 
document for VM protection is very precise. 
For Gold image, create a protection domain, remote site,
add the gold image VM to that PD and once the data is migrated , pd 
activate on remote site. Nutanix solution engineering is putting together
white paper for Goldimage DR

Solution
Topology


/agave_all_container/Ubuntu11.goldimage-flat.vmdk and test.txt \
---(adonis cluster) ----LAN/WAN ---(haley cluster) ---ctr ---/agave_all_container


Commands:

1. Create Protection domain (adonis)

ncli pd create name=file_pd_test

2. Create Remote Site:(adonis) vstore-map is same as container-map

ncli> remote-site create name=haley address-list
\="10.3.176.158,10.3.176.170,10.3.176.182" 
\vstore-map="agave_all_container:agave_all_container" enable-proxy=true

3. Create Remote site in haley

ncli> remote-site create name=adonis address-list
\="10.3.100.145,10.3.100.146,10.3.100.147,10.3.100.148" 
\vstore-map="agave_all_container:agave_all_container" enable-proxy=true


4. Configure files to be protected ( in adonis)

ncli>  pd protect name=file_pd_test files=
\"/agave_all_container/Ubuntu11.goldimage-flat.vmdk"
    Protection Domain         : file_pd_test
    Active                    : true
    Marked For Removal        : false
    Remote Sites              :
    Next Snapshot Time        : -
    Virtual Machine(s)        : 0
    NFS Files                 : 1
        NFS File Name             : 
\/agave_all_container/Ubuntu11.goldimage-flat.vmdk
        Consistency Group         : Ubuntu11.goldimage-flat
ncli>  pd protect name=file_pd_test files="/agave_all_container/test.txt"
    Protection Domain         : file_pd_test
    Active                    : true
    Marked For Removal        : false
    Remote Sites              :
    Next Snapshot Time        : -
    Virtual Machine(s)        : 0
    NFS Files                 : 2
        NFS File Name             : /agave_all_container/
\Ubuntu11.goldimage-flat.vmdk
        Consistency Group         : Ubuntu11.goldimage-flat
        NFS File Name             : /agave_all_container/test.txt
        Consistency Group         : test
5. To make an immediate snapshot ( in adonis)

 ncli> pd add-one-time-snapshot name="file_pd_test" remote-sites=haley
    Action Id                 : 14649
    Start Time                : 06/10/2013 14:06:45 PDT
    Remote Sites              : haley
    Snapshot retention (secs) : Forever
6. ncli>  pd list-snapshots
    ID                        : 14653
    Protection Domain         : file_pd_test
    Create Time               : 06/10/2013 14:06:45 PDT
    Expiry Time               : 06/28/2081 17:20:52 PDT
    Virtual Machine(s)        : 0
    Replicated To Site(s)     : haley


Verifying on Haley:

root@NTNX-QTFCE521601498-1-CVM:10.3.100.145:
\/tmp/mnt/.snapshot/89/401-1370887974745103-14689# ls
Ubuntu11.goldimage-flat.vmdk

After the copy over : http://haley-c1:2020 and http://adonis-c1:2020 (to verify)

On haley:
nutanix@Haley-0-A-CVM:10.3.176.158:~$ ncli pd ls
    Protection Domain         : pd_test
    Active                    : false
    Marked For Removal        : false
    Remote Sites              :
    Next Snapshot Time        : -
    Virtual Machine(s)        : 0
    NFS Files                 : 0
    Protection Domain         : partha
    Active                    : true
    Marked For Removal        : false
    Remote Sites              :
    Next Snapshot Time        : -
    Virtual Machine(s)        : 0
    NFS Files                 : 0
    Protection Domain         : file_pd_test
    Active                    : false
    Marked For Removal        : false
    Remote Sites              :
    Next Snapshot Time        : -
    Virtual Machine(s)        : 0
    NFS Files                 : 0
 ncli> pd activate  name=file_pd_test
Request to activate the protection domain file_pd_test is successful

ncli> pd ls
    Protection Domain         : pd_test
    Active                    : false
    Marked For Removal        : false
    Remote Sites              :
    Next Snapshot Time        : -
    Virtual Machine(s)        : 0
    NFS Files                 : 0
    Protection Domain         : partha
    Active                    : true
    Marked For Removal        : false
    Remote Sites              :
    Next Snapshot Time        : -
    Virtual Machine(s)        : 0
    NFS Files                 : 0
    Protection Domain         : file_pd_test
    Active                    : true
    Marked For Removal        : false
    Remote Sites              :
    Next Snapshot Time        : -
    Virtual Machine(s)        : 0
    NFS Files                 : 2
        NFS File Name             : 
\/agave_all_container/Ubuntu11.goldimage-flat.vmdk
        Consistency Group         : Ubuntu11.goldimage-flat
        NFS File Name             : /agave_all_container/test.txt
        Consistency Group         : test

[root@Haley-0-A-CVM test]# ls
test.txt  Ubuntu11.goldimage-flat.vmdk