Wednesday, June 4, 2014

Nutanix 4.0 Feature - Protection against Simultaneous Double points of failure

Nutanix provided replication factor/resiliency factor of two from day one. Due to request from customers who wanted to provided higher level of resiliency, Nutanix has RF=3 support from 4.0.

Pre 4.0 feature: Resiliency Factor of two:(RF=2) - Network Mirroring
  • Provides protection for single point of failure.
  • Please note that RF=2 provides high resiliency even from two related failures in addition to providing single point of failure, like two disk failures on a same node or  two node failures on a same block if there are multiple blocks.( block aware feature in 4.0)
  • RF=2 will still protect from two failures, if they are non-simultaneous, second failure happens after the replication status is restored from the first failure and atleast two configuration ie., zookeeper nodes are active ( thanks to curator and metadata auto-node detach). Non disruptive manual zookeeper migration can be done in 4.0 . If first zookeeper node fails, you can manually migrate zookeeper to other active node, in that way second failure can be tolerated even if it is a zookeeper node. In 4.1, Nutanix Engineering is implementing auto-zookeeper migration.
  • (Ref:Figure 1)First copy of the data in 4 MB chunks (extent group) stored in node 1 where the VM is running.(data locality)
  • Second copy of the data will be stored in other nodes as well as in different disks on those nodes.
  • So if one disk/node fails, curator will fix the under-replication faster as all remaining nodes and disks will participate in rebuilt process. VM can be migrated to other nodes and curator will eventually bring the data local from other nodes and will rebuild the replication.
  • When placing the second copy, software is intelligent enough to place it in a way that it is block aware/disks and node data is always within the range of balance.
  • so data is available if there is a disk failure or multiple disk failure within same node or one node failure. 
  • In 4.0, Nutanix UI provides cluster health which provides information on replication status and progress monitor on replication status.
  • RF=2 will be active even with a three node cluster. 
  • Space utilization consideration: to rebuild RF=2 on a node failure,  on a 5 node cluster, use 80% of the datastore space available.
Figure 1:








 
 

The following output provides the container (datastore) configuration.

nutanix@NTNX-13SM35300010-B-CVM:10.1.60.98:~$ ncli ctr ls

    ID                        : 1504
      Replication Factor        : 2
    Oplog Replication Factor  : 2


If you need to verify if the system can take one node failure:
nutanix@NTNX-13SM35300010-B-CVM:10.1.60.98:~$ ncli cluster  get-domain-fault-tolerance-status type=node ( states 1 which means we can tolerate single point of failure)
    Component Type            : EXTENT_GROUPS
    Current Fault Tolerance   : 1
   
    Component Type            : OPLOG
    Current Fault Tolerance   : 1


Figure 2:
UI Provides similar output:




4.0 Feature: Resiliency Factor=3:
  • Minimum of 5 nodes required.
  •  RF=3 provides high resiliency from two unrelated simultaneous failures , like two disk failures on different nodes or  two node failures on a different block
  • Resiliency factor on a cluster needs to be set during cluster init process where it requires to configure five zookeeper nodes, five copies of metadata.  ( Nutanix Engineering is working on allowing this change dynamically))
  • Within RF=3 cluster, we can have containers of RF=2 or RF=3 and VMs that need higher resiliency can be created in RF=3 container/datastore. You can change the RF of a container dynamically.
  • Nutanix will have three copies of same extent group  and will try to provide block awareness if possible.
  • RF=3 and Block awareness can be configured in the same cluster ( requires 5 blocks in 4.0  and 3 blocks in 4.0.1)
  • RF=3 and Dedup will make sure that only three copies of same extent group are stored in the cluster. So in some cases, RF=2 without dedup may consume more space than RF=3 with dedup. 
  •  RF=3 has only 10-15% overhead on performance

 

  • Steps (cluster init webpage may refer this feature as FT=2 cluster) 
  • CLI commands: cluster -f -s <svmips> --redundancy_factor=3 create ; ncli sp create name=sp add-all-free-disks=true; ncli ctr create name=red rf=3 sp-name=sp
  •  ncli cluster get-domain-fault-tolerance-status type=node

        Domain Type               : NODE
        Component Type            : STATIC_CONFIGURATION
        Current Fault Tolerance   : 2
        Fault Tolerance Details   :
        Last Update Time          : Wed Jun 04 11:52:11 PDT 2014

        Domain Type               : NODE
        Component Type            : ZOOKEEPER
        Current Fault Tolerance   : 2
        Fault Tolerance Details   :
        Last Update Time          : Wed Jun 04 11:49:55 PDT 2014

        Domain Type               : NODE
        Component Type            : METADATA
        Current Fault Tolerance   : 2
        Fault Tolerance Details   :
        Last Update Time          : Wed Jun 04 11:46:55 PDT 2014

        Domain Type               : NODE
        Component Type            : OPLOG
        Current Fault Tolerance   : 2
        Fault Tolerance Details   :
        Last Update Time          : Wed Jun 04 11:51:18 PDT 2014

        Domain Type               : NODE
        Component Type            : EXTENT_GROUPS
        Current Fault Tolerance   : 2
        Fault Tolerance Details   :
        Last Update Time          : Wed Jun 04 11:51:18 PDT 2014
  •