nutanix: January 2015

Tuesday, January 27, 2015

NCC - The Swiss Army Knife of Nutanix Troubleshooting Tools.

The Swiss Army knife is a pocket size multi-tool which equips you for everyday challenges. NCC equips you with multiple Nutanix troubleshooting tools in one package.

NCC provides multiple utilities (plugins ) for the Nutanix Infrastructure administrator to

check the health of Hypervisor, Nutanix Cluster Components, Network and Hardware
identify misconfiguration that can cause performance issues.
collect the logs for specific time period and components.
if needed, ability to execute NCC automatically and email the results at certain configurable time interval.

NCC is developed by Nutanix Engineering based on inputs provided by support engineers, customers, on-call engineers and solution architects. NCC helps the Nutanix customer to identify the problem and fix the problem or report it to Nutanix Support. NCC enables faster problem resolution by reducing the time taken to triage an issue.

When should we run NCC ?

after a new install.
Before and After any cluster activities - add node, remove node, reconfiguration and an upgrade
anytime when you are troubleshooting an issue.

As mentioned in the cluster health blog, NCC is the collector agent for cluster health.

Ref: http://nutanix.blogspot.com/2014/07/40-feature-cluster-health-framework.html

Roadmap for NCC 1.4( Feb 2015) and beyond:

infrastructure for alert ( ability to configure email, frequency of the checks) - (Update : NOS 4.1.3/ NCC 2.0 - Jun 2015)
ability to dynamically add new checks to the cluster health or alert framework after the NCC upgrade.
Run NCC from the Prism UI and upload the log collector output to Nutanix FTP site.
Link to the KB that provides details for the specific failures. ( Fix-it-yourself) - ( Update: NOS 4.1.x/NCC 1.4.1)
Re-run only the failed tests.
Link pre-upgrade checks to the latest NCC checks.

NCC workflow: ( Release Notes : http://download.nutanix.com/guides/additional_releases/ncc/Release_Notes-NCC.pdf )

a. Download and upgrade NCC to 1.3.1
http://portal.nutanix.com -> Downloads -> Tools and Firmware

NCC 1.3.1(latest version as of Jan27 '2015) will be the default version bundled with the NOS 4.1.1.

1. Upgrading NCC from CLI:

Jeromes-Macbook:~ jerome$ scp Downloads/nutanix-ncc-1.3.1-latest-installer.sh nutanix@10.1.65.100:
Login to CVM and run "chmod u+x nutanix-ncc-1.3.1-latest-installer.sh;./nutanix-ncc-1.3.1-latest-installer.sh"

2. NCC Upgrade UI screenshot: (NCC 1.4 can be upgraded via UI from NOS 4.1.1. )

b. Executing NCC healthchecks:

1. ncc health_checks - shows the list of health checks

2. Execute "ncc health_checks run_all" and monitor for messages other than PASS.

3. List of NCC Status

4. Results of a NCC check run on a lab cluster

5. Displaying and analyzing the failed tests.

FAILUREs are due to sub-optimal CVM memory and network errors. So to fix the issue
- increase CVM memory to 16G or more. (KB: 1513 -https://portal.nutanix.com/#/page/kbs/details?targetId=kA0600000008djKCAQ )
- check the network (rx_missed_errors -- check for network port flaps, network driver issues- KB 1679 and KB 1381)

c. Log Collector Feature of NCC: ( similar to show tech_support of Cisco or vm-support of VMware)

NCC Log collector collects the logs from all the CVMs in parallel.

1. Execute ncc log_collector to find the list of logs that will be collected.

2. To collect all the logs for last 4 hours - ncc log_collector run_all

For example: stargate.INFO will have the time period when it is collected:

****Log Collector Start time = 2015/01/27-09:19:15 End time = 2015/01/27-13:19:15 ****

3. to anonymize ( ip address/cluster name/username) the logs

"ncc log_collector --anonymize_output=true run_all"

The directory listing within the tar bundle:
nutanix@NTNX-13SM15010001-C-CVM::~/data/log_collector/NCC-logs-2015-01-27-640-1422393524$ ls
cluster_config xx.yy.64.165-logs xx.yy.65.100-logs xx.yy.65.98-logs
command.txt xx.yy.64.166-logs xx.yy.65.101-logs xx.yy.65.99-logs

4. To collect logs only from the component stargate "ncc log_collector cvm_logs --component_list=stargate"

Additional Options:

5. To collect logs only from one CVM "ncc log_collector --cvm_list=10.1.65.100 alerts"

6. More options and filters:

d. Auto-run of NCC healthchecks and email of the results
Verify if Email alert is enabled:

nutanix@NTNX-13SM15010001-C-CVM:~$ ncli alert get-alert-config

Alert Email Status : Enabled

Send Email Digest : true

Enable Default Nutanix... : true

Default Nutanix Email : al@nutanix.com

Email Contacts : j@nutanix.com

SMTP Tunnel Status : success

Service Center : n.nutanix.net

Tunnel Connected Since : Wed Jan 21 13:58:03 PST 2015

Enable auto-run of NCC
ncc --set_email_frequency=24
Verify the config:

ncc --show_email_config

[ info ] NCC is set to send email every 24 hrs.

[ info ] NCC email has not been sent after last configuration.

[ info ] NCC email configuration was last set at 2015-01-27 13:47:32.956196.

Sample Email:

Monday, January 12, 2015

Nutanix Shadow Clones

Introduction:

Nutanix shadow clones were introduced in NOS 3.2 to provide a quicker response for the read-only multi-reader vmdks that are accessed from multiple hosts. From NOS 3.5.5 and 4.0.2 onwards, the shadow clones feature is enabled by default.
Examples of multi-reader vmdks are - linked clone replicas in VMware view/ MCS Master image in Xendesktop.

Design Info:

Each VM comprises of vdisk/vmdks. Based on where it was powered on or accessed for the first time, the Controller VM (vdisk_controller) of that host will manage that vdisk. This provides data locality, reduced network traffic and many other benefits.

With multi-reader vmdks, multiple hosts can access (read) a vdisk. Therefore, the read requests from the multiple nodes has to be managed by that one controller VM. NFS server/vdisk controller process (stargate) is capable of handling multiple read requests from different nodes.

However, it is preferable for the stargate to be able to detect the multi-reader vmdks automatically and create “shadow” clone on-demand.

Before the shadow clone feature is enabled:

Master replica is owned by one cvm

All reads will be redirected to that cvm

extent cache of that cvm will be populated

No data localization on the remote cvm from where the read came.

After the shadow clone feature is enabled:

master replica will be owned by one CVM

When a remote linked-clone VM access the replica, the shadow clone of that replica vdisk will be created via zero-copy snapshot on-demand and will be owned by the remote CVM.

new reads will populate the extent cache and this extent cache will be shared by all vdisks in that cvm. Least recently used(LRU) data will be evicted from the extent cache, if needed.

if there is a write to the master replica, then the shadow clones are destroyed.

if VMs migrate to another node, either a new shadow clone is created or it will use the existing shadow clone owned by vdisk_controller(CVM) of that node.

Logic: if a vdisk has reads from 2 or more nodes AND

"no writes for 300 seconds OR number of reads exceeds 100",
then shadow clone is created
This can be changed via gflags ( Note: Do not change gflags without consulting Nutanix support)

--stargate_nfs_adapter_read_shadow_remote_read_threshold=100
--stargate_nfs_adapter_read_shadow_threshold_secs=300
--stargate_nfs_adapter_read_shadow_min_remote_nodes=2

Picture Credit (Nutanix Bible) -http://stevenpoitras.com/the-nutanix-bible/#shadow_clones

Configuration:

to verify if the shadow clone feature is enabled

ncli cluster get-params|grep -i shadow

Shadow Clones Status : Enabled

Other way to verify,
zeus_config_printer |grep shad
shadow_clones_enabled: true

If it is disabled, enable with ncli cluster edit-params enable-shadow-clones=true

If you have linked clone environment, find the master replica disk’s vdisk_id using the following command

vdisk_config_printer ( vdisk_config_printer |grep -B 12 shadow|grep -B 10 -A 3 replica)

vdisk_id: 58810154 <<<<<<

vdisk_name: "NFS:58810154"

          vdisk_size: 4398046511104

          container_id: 1519

          creation_time_usecs: 1406123698498177

           vdisk_creator_loc: 4

           vdisk_creator_loc: 58557595

           vdisk_creator_loc: 34032844

           nfs_file_name: "replica-4e5bf2ad-c5d2-4e9d-8a91-f7b17389872e_7-flat.vmdk" <<<<<<

           may_be_parent: true

           shadow_read_requests: true <<<<

Verify that shadow clones are created for vdisk id 58810154. As you see, parent vdisk id is replica vdisk (58810154). shadow clone vdisks are created on the different cvms ( @10, @6 are cvm ids)

vdisk_config_printer |grep 58810154

vdisk_name: "NFS:58810154#58810154@10"

parent_vdisk_id: 58810154

vdisk_name: "NFS:58810154#58810154@6"

parent_vdisk_id: 58810154

vdisk_name: "NFS:58810154#58810154@25525172"

parent_vdisk_id: 58810154

vdisk_name: "NFS:58810154#58810154@25525291"

parent_vdisk_id: 58810154

The properties of the shadow clone ( creates an Immutable shadow). Here i selected for the one on the cvm id 10.

vdisk_config_printer |grep NFS:58810154#58810154@10 -B 2 -A 12

vdisk_id: 58825863

vdisk_name: "NFS:58810154#58810154@10"

parent_vdisk_id: 58810154

vdisk_size: 4398046511104

container_id: 1519

creation_time_usecs: 1406125098160169

mutability_state: kImmutableShadow <<<<<<<

vdisk_creator_loc: 4

vdisk_creator_loc: 58557595

vdisk_creator_loc: 34032844

nfs_file_name: "replica-4e5bf2ad-c5d2-4e9d-8a91-f7b17389872e_7-flat.vmdk"

generate_vblock_copy: true

parent_nfs_file_name_hint: "replica-4e5bf2ad-c5d2-4e9d-8a91-f7b17389872e_7-flat.vmdk"

Performance Gains:
The shadow clone will improve the bootstorm as well as the reads from a linked clone VM.
Additional Info with regards to performance gains:
Andre’s Blog - http://myvirtualcloud.net/?p=5979
Kees Baggerman's Blog - AppVolume Performance improvements