Friday, July 11, 2014

4.0 Feature: Cluster Health Framework

Nutanix cluster health is designed to detect and analyze Nutanix cluster related failures to bring more customer visibility to the issues by providing cause/impact and resolution. Nutanix cluster health utilizes plugins from NCC ( nutanix cluster check) utility.

NCC is developed by Nutanix Engineering from inputs provided by support engineers, customers, on-call engineers and solution architects. Nutanix Engineering productised their troubleshooting scripts into NCC. Nutanix Cluster health runs NCC plugins at various intervals and provides easy access to the results through UI.  Nutanix customer will be able troubleshoot or identify the issues with the cluster and will result in a faster resolution. This will provide uniform troubleshooting tools across different Hypervisors.

Nutanix cluster health can be accessed from Prism Element (URL: CVM's ip address).

Prism Element First Page - Cluster Health Access


Nutanix Prism Element has a cluster health walk-through which shows an example disk health troubleshooting.
 

Here is the list of cluster health checks 
List of Health Check

CVM:
  1. CPU Utilization/Load Average
  2. Disk - Metadata Usage (Inode)/HDD disk Usage (df) /HDD latency(sar/iostat) smartctl status/SSD latency.
  3. Memory committed ( /proc/meminfo)
  4. Network - CVM to CVM connectivity (external Vswitch), CVM to host (nutanixVswitch), Gateway config, subnet config ( verify nutanix HA network config)
  5. Time Drift - between CVM/Host
Host/Node:
  1. CPU utilization
  2. Memory swap rate
  3. Network - 10 Gbe connectivity(vswitch and vmknic) / Nic Error Rate/Receive and Packet loss (ethtool)
VM : ( Nutanix provides VM centric stats - http://nutanix.blogspot.com/2013/08/how-much-of-detail-can-you-get-about-vm.html)
  1. CPU utilization
  2. I/O latency (vdisk)
  3. Memory ( swap rate/usage)
  4. Network Rx/Tx packet loss. 
CLI options:


nutanix@NTNX-13SM35300008-B-CVM:10.1.60.110:~$ ncli health-check ls |grep Name
    Name                      : I/O Latency
    Name                      : CPU Utilization
    Name                      : Disk Metadata Usage
    Name                      : Transmit Packet Loss
    Name                      : CVM to CVM Connectivity
    Name                      : Receive Packet Loss
    Name                      : CPU Utilization
    Name                      : CPU Utilization
    Name                      : Transmit Packet Loss
    Name                      : Memory Usage
    Name                      : Receive Packet Loss
    Name                      : HDD I/O Latency
    Name                      : Gateway Configuration
    Name                      : HDD S.M.A.R.T Health Status
    Name                      : Load Level
    Name                      : Memory Usage
    Name                      : CVM to Host Connectivity
    Name                      : 10 GbE Compliance
    Name                      : Time Drift
    Name                      : HDD Disk Usage
    Name                      : SSD I/O Latency
    Name                      : Memory Pressure
    Name                      : Memory Swap Rate
    Name                      : Subnet Configuration
    Name                      : Memory Swap Rate
    Name                      : Node Nic Error Rate High

Configuration of Health Checks:


  1. Turn Check off
  2. Parameters for Critical/Warn Threshold (if applicable)
  3. Change the schedule.
  4. Edit option available from CLI as well . ncli health-check edit  (interval/enable/parameter-thresholds) 
Cause/Impact and Resolution:

Each health check provides
  1. Cause of the failure :  Example - disk running out of space
  2. Resolution: How to fix the issue -Example: add storage capacity or delete the data.
  3. Impact:  What will be the user/cluster impact ?
 History of the check and list of entities checked ( different disks/ list of CVMs)

Running the plugin manually:

NCC is superset of health check and has more plugins that can be run. NCC can be updated independently of NOS versions. ( note that the certain newer NCC plugins may be applicable only to certain NOS versions). Here are sample of few NCC options.
ncc health check

 List of Network Checks that can be run ( intrusive checks will affect the performance of the cluster.)
ncc network_checks
Sample of running a network_check .

Future Developments:
  1. Unification of Cluster health, Alerts and Events. 
  2. Impact, Cause and Resolution updated via NCC updates independent of NOS upgrade.
  3. Provide finer Root Cause analysis and more insights into cluster health.

Tuesday, June 24, 2014

Hyper-Converge DELL server into server-SAN with Nutanix




 
Nutanix has always included software capable of running on generic off-the shelf hardware. 
(as in above figure, x86 hardware) Even though the software is abstracted to run on any hardware, Nutanix recommends choosing from a shorter list of hardware possibilities rather than providing a compatibility matrix.  Choosing from this shorter list enables the customer to purchase a solution which has been thoroughly tested by the Nutanix QA and hardware experts on the Nutanix Engineering team (Quality over Quantity). It can be compared to the synergy between a horse (HW) and a jockey (SW), so that the horse can run faster under guidance of the jockey.



Problems with providing software only solution with a compatibility matrix:

   
It is impossible for any software vendor to test all hardware, even if individual components are certified. It is so difficult to draft a reliable compatibility matrix. There are Knowledge base articles explaining how to use compatibility matrix. Furthermore, no software vendor should put the burden of validation on the customer. Many vendors use this matrix to avoid responsibility for failures in their products.
 

Just because a particular software is hardware agnostic, it is not guaranteed that the hardware as a whole provides capability of running the software at an optimal level.  Just a small queue-depth in a controller could cause performance issues and take days to pin point the problem even with all the KBs and blog posts. 

Nutanix Appliance Approach:




Nutanix has taken a different approach by owning the hardware qualification process even though Nutanix's IP is entirely software related. Nutanix ships super tested, qualified and validated off-the-shelf hardware appliances so customers can run their production workload at ease and have peace of mind with their data. Due to the ease of setup and configuration, most Nutanix customers are able to put their new hardware into production within a day.
 

In taking this approach, Nutanix has built a world-class hardware engineering team with industry experts, allowing them to take on the additional responsibilities of testing, qualifying, and certifying additional hardware. Nutanix is excited to announce a new partnership with Dell to deliver another exciting hardware option for its customers.  (Refer: http://www.nutanix.com/the-nutanix-solution/tech-specs/#nav). Nutanix will continue to provide the appliances and add additional models as new innovations occur (HDD/SSD/NVRAM-e/CPU/Memory/GPU/etc).



DELL+Nutanix Partnership:
Official Blogs:
Refer above blogs for more detailed information on DELL and Nutanix alliance.

 DELL is a valuable IT partner to many of the big enterprises. Many of these enterprises have tested and purchased Nutanix Appliances by getting one time exemption from their purchase department.   DELL and Nutanix OEM agreement removes this exemption requirement.

Nutanix Software will convert powerful standalone DELL R-720 servers into hyper-converged Compute and Software Defined Distributed Storage Server.

Quote from DELL (eXCellent)
"Dell XC Series of Web-scale Converged AppliancesDell today also announced plans to deliver the Dell XC Series of Web-scale Converged Appliances powered by Nutanix. With the announcement, Dell extends its software defined storage portfolio with plans to offer customers a new series of appliances, which combine compute, storage and networking into a single offering. These new appliances will benefit customers seeking an integrated IT approach, offering simple deployment, management and scale as needed."

 
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSd8HinZ2EsR4f4FGmXgjANOM0cYAWaKgzu_FEkW-erd6D9o8A8