.
|
Resiliency
is the ability of a server, network, storage system,to recover quickly and
continue operating even when there has been an equipment failure, or other
disruption.
Invisible
Infrastructure should have the ability to fix the failure without any
disruption to the end user or the application.
In
Nutanix, every component is designed to be resilient to the failures. As we all
know,that the hardware will fail eventually or the certain part of
software will have bugs, it is the job of the software to build resiliency when
there is a failure.
Nutanix
XCP platform is inherently resilient to failures due to the intelligence built
into the distributed software. Nutanix XCP administrator can configure
additional resiliency on a container/datastore level, if resiliency needs to
cover the multiple point failure. However, the self healing capability of
Nutanix XCP platform reduces the need for configuring higher level of resiliency
than the default.
In this
blog, let me introduce to the different types of resiliency available in our
system.
1.
Hardware Resiliency:
a. Disk
Resiliency
b. Node
Resiliency
c.
Block/Rack Resiliency
d.
Cluster Resiliency/Data Center Resiliency
2. Software
Resiliency
a.
Quarantining the non-optimal node
b.
Auto-migration of Software services to a different node (Cassandra
Forwarding/Zookeeper migration)
b. Fail
Fast Concept
c. No
strict Leadership
d. Share
Nothing Architecture.
e. Fault
tolerance - FT-1 and FT-2
a. Disk Resiliency:
Nutanix Architecture |
If
any disk fails, the data on that disk will be under-replicated, but all the
blocks of the data of the failed disk can be accessed from the other nodes in
the cluster. Nutanix self healing process running on all the nodes in the
cluster will re-replicate the data in the background.
Nutanix
Controller VM data is on the SSD and it is mirrored to the second SSD on the
same node.In
the event of an SSD failure, Nutanix CVM will continue to run without any
disruption.
Nutanix Cluster monitors the disks proactively
through S.M.A.R.T utility, so if the value of any attribute indicates any
potential failure, the cluster will copy the data from the disk to other
nodes/disks in the cluster and mark the disk offline.
After replacing the disk physically, Nutanix UI will guide the customer to re-add the disk back to the cluster.
|