Tuesday, February 16, 2016

Nutanix Resiliency: Invisible Failure (Part I:Introduction and Disk Resiliency Details.)


Resiliency is the ability of a server, network, storage system,to recover quickly and continue operating even when there has been an equipment failure, or other disruption.

Invisible Infrastructure should have the ability to fix the failure without any disruption to the end user or the application.

In Nutanix, every component is designed to be resilient to the failures. As we all know,that the  hardware will fail eventually or the certain part of software will have bugs, it is the job of the software to build resiliency when there is a failure.

Nutanix XCP platform is inherently resilient to failures due to the intelligence built into the distributed software. Nutanix XCP administrator can configure additional resiliency on a container/datastore level, if resiliency needs to cover the multiple point failure. However, the self healing capability of Nutanix XCP platform reduces the need for configuring higher level of resiliency than the default.

In this blog, let me introduce to the different types of resiliency available in our system.

1. Hardware Resiliency:

a. Disk Resiliency
b. Node Resiliency
c. Block/Rack Resiliency
d. Cluster Resiliency/Data Center Resiliency
2. Software Resiliency 
a. Quarantining the non-optimal node
b. Auto-migration of Software services to a different node (Cassandra Forwarding/Zookeeper migration)
b. Fail Fast Concept
c. No strict Leadership
d. Share Nothing Architecture.
e. Fault tolerance - FT-1 and FT-2

a. Disk Resiliency:
Nutanix Architecture
If the Nutanix cluster is configured for the tolerance of one failure (FT=1*) at a single point of time,  when a block of data is written, the second copy of the data will be stored on a disk in another node.

If any disk fails, the data on that disk will be under-replicated, but all the blocks of the data of the failed disk can be accessed from the other nodes in the cluster.  Nutanix self healing process running on all the nodes in the cluster will re-replicate the data in the background.

Data Resiliency Status

Nutanix Controller VM data is on the SSD and it is mirrored to the second SSD on the same node.In the event of an SSD failure, Nutanix CVM will continue to run without any disruption.

Nutanix Cluster monitors the disks proactively through S.M.A.R.T utility, so if the value of any attribute indicates any potential failure, the cluster will copy the data from the disk to other nodes/disks in the cluster and mark the disk offline.

After replacing the disk physically, Nutanix UI will guide the customer to re-add the disk back to the cluster.
UI: Adding the replaced disk to the cluster

*FT=1 or FT=2 is configurable.  More details in http://nutanix.blogspot.com/2014/06/nutanix-40-feature-increased-resiliency.html

Disk offline logic: (hades process)

On a serious note, let us see if The Wolverine or DP has the better self-healing powers?