Tuesday, February 16, 2016

Nutanix Resiliency: Invisible Failure (Part I:Introduction and Disk Resiliency Details.)


Resiliency is the ability of a server, network, storage system,to recover quickly and continue operating even when there has been an equipment failure, or other disruption.

Invisible Infrastructure should have the ability to fix the failure without any disruption to the end user or the application.

In Nutanix, every component is designed to be resilient to the failures. As we all know,that the  hardware will fail eventually or the certain part of software will have bugs, it is the job of the software to build resiliency when there is a failure.

Nutanix XCP platform is inherently resilient to failures due to the intelligence built into the distributed software. Nutanix XCP administrator can configure additional resiliency on a container/datastore level, if resiliency needs to cover the multiple point failure. However, the self healing capability of Nutanix XCP platform reduces the need for configuring higher level of resiliency than the default.

In this blog, let me introduce to the different types of resiliency available in our system.

1. Hardware Resiliency:

a. Disk Resiliency
b. Node Resiliency
c. Block/Rack Resiliency
d. Cluster Resiliency/Data Center Resiliency
2. Software Resiliency 
a. Quarantining the non-optimal node
b. Auto-migration of Software services to a different node (Cassandra Forwarding/Zookeeper migration)
b. Fail Fast Concept
c. No strict Leadership
d. Share Nothing Architecture.
e. Fault tolerance - FT-1 and FT-2

a. Disk Resiliency:
Nutanix Architecture
If the Nutanix cluster is configured for the tolerance of one failure (FT=1*) at a single point of time,  when a block of data is written, the second copy of the data will be stored on a disk in another node.

If any disk fails, the data on that disk will be under-replicated, but all the blocks of the data of the failed disk can be accessed from the other nodes in the cluster.  Nutanix self healing process running on all the nodes in the cluster will re-replicate the data in the background.

Data Resiliency Status

Nutanix Controller VM data is on the SSD and it is mirrored to the second SSD on the same node.In the event of an SSD failure, Nutanix CVM will continue to run without any disruption.

Nutanix Cluster monitors the disks proactively through S.M.A.R.T utility, so if the value of any attribute indicates any potential failure, the cluster will copy the data from the disk to other nodes/disks in the cluster and mark the disk offline.

After replacing the disk physically, Nutanix UI will guide the customer to re-add the disk back to the cluster.
UI: Adding the replaced disk to the cluster

*FT=1 or FT=2 is configurable.  More details in http://nutanix.blogspot.com/2014/06/nutanix-40-feature-increased-resiliency.html

Disk offline logic: (hades process)

On a serious note, let us see if The Wolverine or DP has the better self-healing powers?

Friday, June 12, 2015

Foundation: Then, Now and Beyond

In creating a world class product, the challenge is always to develop something that is flexible,useful and simple. Many times, simplicity is lost at the expense of greater flexibility or flexibility is lost when a product becomes overly simple.

During the development process at Nutanix, two questions are always asked: “How do we build a solution that is simple yet flexible at the same time? How do we make the datacenter invisible? How do we help our customers to spend more time with their loved ones than being in the datacenter ?
The Nutanix datacenter solution is easy to operate, install, configure, and troubleshoot thanks to features such as Easy and Non-Disruptive upgrades, Foundation, and Cluster Health.
Highlighted below is the evolution of one of our most exciting feature that all of our customers will see and use, the Nutanix Foundation installer tool.

Before Foundation, the Nutanix factory had a repository of scripts that allowed them to install, test, and ship the blocks to the customers.  The scripts had limited options for the network configurations, hypervisor, and Nutanix OS versions. 

Our customers told us they needed increased flexibility to install the hypervisor of their choice, hypervisor version of their preference, and the Nutanix OS with the networking configuration that best suits their datacenter at the time of installation, not at the time of ordering.

While we were developing this tool, one of the hypervisor vendor decided to throw a curve ball at us, forced us not to ship their hypervisor from Nutanix factory, even if the customer wanted their hypervisor. They considered us a competitor though we were enabling their bare hypervisor to move up a few layers and provide storage as well.

This made us more deterministic to provide the flexibility of changing the hypervisor at the time of installation to our field, partners and customer. This tool will eventually help our customers to non-disruptively change the hypervisors at anytime on a running cluster.

To meet these customer needs, the Nutanix Foundation installer tool was created. It was developed with the aim of providing an uncompromisingly simple factory installer that is both flexible and can be used reliably in the customer datacenter.

Foundation 2.1 allows the customer/partner to configure the network parameters, install any hypervisor and NOS version of their choice, create the cluster, and run their production workload within a few hours of receiving the Nutanix block.
So far, the feedback from our customers and partners has been fantastic.

Screen Shot 2015-05-13 at 10.16.15 AM.png

In the short time that Foundation has been available, the tool has quickly evolved to meet new customer needs.

Screen Shot 2015-05-14 at 12.37.58 PM.png


Inception of Foundation 3.0 started from the desire to make add node as simple as pairing an iWatch to the iPhone.

We strive to keep Foundation uncompromisingly simple and still extend the features without becoming a “featuritis - feature nightmare” by constantly reviewing it with our UX team and being in touch with our customers.

Very soon, we will be shipping Foundation within NOS 4.5, it will facilitate these Lifecycle management features
  • Faster Imaging of the nodes by our customers.
  • Seamless expansion of clusters. ( AOS 4.5)
  • Efficient Replacement of boot drives. (AOS 4.5.2)
  • Network aware cluster installs and cluster expansion.
  • BMC/BIOS Upgrades. (AOS 4.5.2)
  • Reset the cluster back to the factory defaults.(AOS 4.5)
  • Easy onboarding of any node to the cluster

I am excited to see how the new foundation framework will enable the future Lifecycle management innovations.

We will release more details on each of the above features soon.

Note: First version of this blog appeared in .Nutanix Community Blog

Vision:Multi-Hypervisor Cluster

Tuesday, January 27, 2015

NCC - The Swiss Army Knife of Nutanix Troubleshooting Tools.

The Swiss Army knife is a pocket size multi-tool which equips you for everyday challenges.  NCC equips you with multiple Nutanix troubleshooting tools in one package.

NCC provides multiple utilities (plugins ) for the Nutanix Infrastructure administrator to
  • check the health of Hypervisor, Nutanix Cluster Components, Network and Hardware
  • identify misconfiguration that can cause performance issues.
  • collect the logs for specific time period and components.
  • if needed, ability to execute NCC automatically and email the results at certain configurable time interval.
NCC is developed by Nutanix Engineering based on inputs provided by support engineers, customers, on-call engineers and solution architects.  NCC helps the Nutanix customer to identify the problem and fix the problem or report it to Nutanix Support. NCC enables faster problem resolution by reducing the time taken to triage an issue.

When should we run NCC ?
  • after a new install.
  • Before and After any cluster activities - add node, remove node, reconfiguration and an upgrade
  • anytime when you are troubleshooting an issue.

As mentioned in the cluster health blog, NCC is the collector agent for cluster health.

Roadmap for NCC 1.4( Feb 2015) and beyond:
  • infrastructure for alert ( ability to configure email, frequency of the checks) - (Update : NOS 4.1.3/ NCC 2.0 - Jun 2015)
  • ability to dynamically add new checks to the cluster health or alert framework after the NCC upgrade. 
  • Run NCC from the Prism UI and upload the log collector output to Nutanix FTP site.
  • Link to the KB that provides details for the specific failures. ( Fix-it-yourself) - ( Update: NOS 4.1.x/NCC 1.4.1)
  • Re-run only the failed tests.
  • Link pre-upgrade checks to the latest NCC checks. 

a. Download and upgrade NCC to 1.3.1 
http://portal.nutanix.com -> Downloads -> Tools and Firmware

NCC 1.3.1(latest version as of Jan27 '2015) will be the default version bundled with the NOS 4.1.1.
1. Upgrading NCC from CLI:
  • Jeromes-Macbook:~ jerome$ scp Downloads/nutanix-ncc-1.3.1-latest-installer.sh nutanix@
  • Login to CVM and run "chmod u+x nutanix-ncc-1.3.1-latest-installer.sh;./nutanix-ncc-1.3.1-latest-installer.sh
2. NCC Upgrade UI screenshot: (NCC 1.4 can be upgraded via UI from NOS 4.1.1.  )

b. Executing NCC healthchecks:

1. ncc health_checks - shows the list of health checks

2. Execute "ncc health_checks run_all" and monitor for messages other than PASS.

3. List of NCC Status

4. Results of a NCC check run on a lab cluster

5. Displaying and analyzing the failed tests.

FAILUREs are due to sub-optimal CVM memory and network errors. So to fix the issue
- increase CVM memory to 16G or more. (KB: 1513 -https://portal.nutanix.com/#/page/kbs/details?targetId=kA0600000008djKCAQ )
- check the network (rx_missed_errors -- check for network port flaps, network driver issues- KB 1679 and KB 1381)

c. Log Collector Feature of NCC: ( similar to show tech_support of Cisco or vm-support of VMware)

NCC Log collector collects the logs from all the CVMs in parallel.

1. Execute ncc log_collector to find the list of logs that will be collected.

2. To collect all the logs for last 4 hours -  ncc log_collector run_all
For example: stargate.INFO will have the time period when it is collected:
****Log Collector Start time = 2015/01/27-09:19:15 End time = 2015/01/27-13:19:15 ****

3. to anonymize ( ip address/cluster name/username)  the logs 
"ncc log_collector --anonymize_output=true  run_all"
The directory listing within the tar bundle:
nutanix@NTNX-13SM15010001-C-CVM::~/data/log_collector/NCC-logs-2015-01-27-640-1422393524$ ls
cluster_config  xx.yy.64.165-logs  xx.yy.65.100-logs  xx.yy.65.98-logs
command.txt     xx.yy.64.166-logs  xx.yy.65.101-logs  xx.yy.65.99-logs

4. To collect logs only from the component stargate  "ncc log_collector cvm_logs --component_list=stargate"

Additional Options:

5. To collect logs only from one CVM "ncc log_collector --cvm_list= alerts"

6. More options and filters:

d. Auto-run of NCC healthchecks and email of the results

Verify if Email alert is enabled:
nutanix@NTNX-13SM15010001-C-CVM:~$ ncli alert get-alert-config
    Alert Email Status        : Enabled
    Send Email Digest         : true
    Enable Default Nutanix... : true
    Default Nutanix Email     : al@nutanix.com
    Email Contacts            : j@nutanix.com
    SMTP Tunnel Status        : success
    Service Center            : n.nutanix.net
    Tunnel Connected Since    : Wed Jan 21 13:58:03 PST 2015
Enable auto-run of NCC
ncc --set_email_frequency=24
Verify the config:

ncc --show_email_config     
[ info ] NCC is set to send email every 24 hrs.
[ info ] NCC email has not been sent after last configuration.
[ info ] NCC email configuration was last set at 2015-01-27 13:47:32.956196.

Sample Email: