Thursday, September 12, 2013

Nutanix DR troubleshooting

This is my scratch pad that I use for DR troubleshooting.
 

Basics: 
Cerebro Master(CM): - maintains persistent metadata (PDs/VMs/Snapshots), acts on Active PD schedule, replicate snapshots, handshake with remote CM, receives remote snapshots. one per cluster.
Elected by zookeeper, register/start/stop/unregister VMs by communicating with hyperint.
Offload replicating individual file to Cerebro slaves.

Cerebro Slaves: - runs on every node, receives request from the master to replicate NFS file, request
might contain a reference NFS file that has been replicated earlier. Compute and ship diffs from the reference NFS file.


Information about remote cluster maintained in zookeeper. - ip and port info of remote cerebro,
if proxy needed , maintains mapping between container (datastore ) names, maintains transfer rate
(within max allowed)

Cerebro slave asks stargate to replicate the file. So it leverages stargate for actual datatransfer.


It is true DDR (Distributed DR) - all cerebro slaves and stargate work to replicate the data. Log files to look for.

-data/logs/cerebro.[INFO,ERROR], data/logs/cerebro.out
-data/logs/stargate.INFO,ERROR - grep -i Cerebro

three components that are mostly used for DR - prism_gateway, cerebro(2020), stargate(2009) 
interacts with ESXi to register VM when rolling back.

Quick topology

 ProtectionDomain(VM:xclone) -- Primary Container:LocalCTR---<primary SVM ips> - ports 2020/2009
====== ports 2009/2020 --<remoteDR SVMips> ---RemoteCTR_noSSD (.snapshot directory)



   
Troubleshooting steps:

- links http://remote-site-cvm:2009 and 2020 (to all remote CVMs and from all local CVMs) -- do it on both sides.

Traces:

http://cvm:2020
http://cvm:2020/h/traces
http://cvm:2009/h/traces



Verify Routes, so that replication traffic goes through non-production path
Quick steps:
1. Note the container that has VMs that has to be replicated.
   
2. Create container remote site (no SSD)

3.  Create Remote sites in both location
ncli> rs ls
    Name                      : backup
    Replication Bandwidth     :
    Remote Address(es)        : remote_cvm1_ip,
remote_cvm2_ip,etc
   Container Map             : LocalCTR:Remote_DR_CTR
    Proxy Enabled
4. Create the Protection Domain
ncli pd   create name="Platinum-1h" on primary site.

5.  Add VMs in the local CTR defined in (ncli rs ls)  to the protection domain
pd protect name="Platinum-1h" vm-names="xclone"
6. Create one time snapshot to verify replication is working
pd create-one-time-snapshot name="Platinum-1h" remote-sites="backup" retention-time=36000

(if you see oob error repeatedly, pkill cerebro and start cerebro as well restart prism bootstrapdaemon - call nutanix support)

6b. pd set-schedule name="Platinum-1h" interval="3600" remote-sites="backup" min-snap-retention-count=10 retention-policy=1:30  - take snapshot every 1 hour, retain atleast 10 counts,

7. ncli> pd list-replication-status

  ID                        : 16204522
    Protection Domain         : Platinum-1h
    Replication Operation     : Sending
    Start Time                : 09/12/2013 09:51:33 PDT
    Remote Site               : backup
    Snapshot Id               : 16204518
    Aborted                   : false
    Paused                    : false
    Bytes Completed           : 17.57 GB (18,861,841,805 bytes)
    Complete Percent          : 6.0113173


8. list previous snaps


ncli> pd ls-snaps name=Platinum-1h

    ID                        : 16883844
    Protection Domain         : Platinum-1h
    Create Time               : 09/12/2013 13:52:42 PDT
    Expiry Time               : 09/13/2013 19:52:42 PDT
    Virtual Machine(s)        : 4

        VM Name                   : xclone
        Consistency Group         :xclone
        Power state on recovery   : Powered On

    Replicated To Site(s)     : backup

    ID                        : 17258039
    Protection Domain         : Platinum-1h
    Create Time               : 09/12/2013 15:52:43 PDT
    Expiry Time               : 09/13/2013 21:52:43 PDT
    Virtual Machine(s)        : 4

        VM Name                   : xclone
        Consistency Group         : xclone
        Power state on recovery   : Powered On

        Replicated To Site(s)     : backup

    ID                        : 16967424
    Protection Domain         : Platinum-1h
    Create Time               : 09/12/2013 14:52:42 PDT
    Expiry Time               : 09/13/2013 20:52:42 PDT
    Virtual Machine(s)        : 4

        VM Name                   : xclone
        Consistency Group         : xclone
        Power state on recovery   : Powered On

       
  
 

Additional notes:
1. If Remote site is in different subnet, try (all remote CVMs) links http://remote_host:2009 and 2020, if it does not work,
enable iptable rules:

for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save;sudo /etc/init.d/iptables save"; done
this opens up 2009 and needs to be done on both sides. Please repeat for port 2020 as well

Run this command on one local  as well as one remote CVM  site to verify the connectivity:

for cvm in `svmips`; do (echo "From the CVM $cvm:"; ssh -q -t $cvm 'for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do echo Checking stargate and cerebro port connection to $i ...; nc -v -w 2 $i -z 2009; nc -v -w 2 $i -z 2020; done');done

From the CVM 192.168.3.207:

Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
From the CVM 192.168.3.208:

Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
From the CVM 192.168.3.209:

Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!



If it does not work, enable the iptable rules

Enable 2020 and 2009 on remote site:
 for i in ` source /etc/profile;ncli  rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do ssh $i sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save; done
 for i in ` source /etc/profile;ncli  rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do ssh $i sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2020 -j ACCEPT;sudo service iptables save; done

Enable it locally:
  for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save"; done
 for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2020 -j ACCEPT;sudo service iptables save"; done
Quick Verification:
 
. verify the directory listing on remote container
sudo mount localhost:/Rep_container /mnt
cd /mnt/.snapshot
and you can see the replication files there. 


On the Primary site Cerebro master 2020 page ( you can find it from http://cvmip:2020 - it will point to master) Cerebro
Master

Protection Domains (1)

Name Is Active Total Consistency Groups Total Entities Total Snapshots Total Remotes Executing Meta Ops Tx or Rx KB/s
Platinum-1h yes 4 4 3 1 1 46561.28

Remote Bandwidth

Remote KB/s
Tx Rx
backup 46561.28

On the PD webpage ( ?pd=Platinum-1h ) -- --:Start time --:-- 1020130910-21:57:41-GMT-0700 04 k Build Version release-congo-3.1.2-stable-101e3a8c77c9b8a2b3173a9e040bc41b339cdb5c Build Last Commit Date 2013-08-27 23:43:31 -0700 Component ID 105 Incarnation ID 4475694 Local stargate incarnation ID 4623290 Highest allocated opid 18095 Highest contiguous completed opid 18094
Master

Protection Domain 'Platinum-1h'
Is Active yes
Total Consistency Groups 4
Total Entities 4
Total Snapshots 3
Total Remotes 1
Executing Meta Ops 1
Tx KB/s 54947.84

Consistency Groups (4)

Name Total VMs Total NFS Files To Remove
x-clone 1 0 no

Snapshots (3)

Handle Start Date Finish Date Expiry Date Total Consistency Groups Total Files Total GB Aborted Replicated To Remotes
(3103, 1361475834422355, 16204518) 20130912-09:51:31-GMT-0700 20130912-09:51:32-GMT-0700 20130916-13:51:32-GMT-0700 4 36 1871 no -
(3103, 1361475834422355, 16205073) 20130912-09:51:39-GMT-0700 20130912-09:51:40-GMT-0700 20130912-10:51:40-GMT-0700 4 36 1871 no -
(3103, 1361475834422355, 16507948) 20130912-11:00:14-GMT-0700 20130912-11:00:15-GMT-0700 20130912-12:00:15-GMT-0700 4 36 1871 no -

Meta Ops (1)

Meta Op Id Creation Date Is Paused To Abort
16219199 20130912-09:53:45-GMT-0700 no no

Snapshot Schedule

Periodic Schedule

Out Of Band Schedules
[no out of band schedules]

Pending Actions

replication {
  remote_name: "backup"
  snapshot_handle_vec {
    cluster_ id: 3103
    cluster_incarnation_id: 1361475834422355
    entity_id: 16507948
  }
}

Latest completed top-level meta ops

Meta opid Opcode Creation time Duration (secs) Attributes Aborted Abort detail
15421792 SnapshotProtectionDomain 20130911-15:07:48-GMT-0700 1 snapshot=(3103, 1361475834422355, 15421793) No
15421797 Replicate 20130911-15:07:49-GMT-0700 36555 remote=backup snapshot=(3103, 1361475834422355, 15421793) replicated_bytes=2009518144619 tx_bytes=1109779340685 No
16204509 SnapshotGarbageCollector 20130912-01:17:05-GMT-0700 1 snapshot=(3103, 1361475834422355, 15421793) No
16204517 SnapshotProtectionDomain 20130912-09:51:31-GMT-0700 1 snapshot=(3103, 1361475834422355, 16204518) No
16205072 SnapshotProtectionDomain 20130912-09:51:39-GMT-0700 1 snapshot=(3103, 1361475834422355, 16205073) No
16204522 Replicate 20130912-09:51:33-GMT-0700 132 remote=backup snapshot=(3103, 1361475834422355, 16204518) replicated_bytes=145570717069 tx_bytes=21713968525
16205072 SnapshotProtectionDomain 20130912-09:51:39-GMT-0700 1 snapshot=(3103, 1361475834422355, 16205073) No
16204522 Replicate 20130912-09:51:33-GMT-0700 132 remote=backup snapshot=(3103, 1361475834422355, 16204518) replicated_bytes=145570717069 tx_bytes=21713968525 Yes Received RPC to abort
16507947 SnapshotProtectionDomain 20130912-11:00:14-GMT-0700 1 snapshot=(3103, 1361475834422355, 16507948) No


Advanced Cerebro Commands:
 cerebro_cli query_protection_domain pd_name list_consistency_groups=true

Remove a VM from PD
 cerebro_cli remove_consistency_group   pd_name vm_name

Sample Error
On cerebro Master:
grep Silver-24h data/logs/cerebro.*


cerebro.WARNING:E0913 11:00:59.830462  3361 protection_domain.cc:1601] notification=ProtectionDomainReplicationFailure protection_domain_name=Silver-24hremote_name=backup timestamp_usecs=1379095259228275 reason=Attempt to lookup attributes of path /AML_NFS01/.snapshot/48/3103-1361475834422355-18439248/AML-CTX-WIN7-A failed with NFS error 2

We found this error was due to multiple VMs in the PD accessing same vmdk due to independent
non-persistent config
./AML-CTX-WIN7-A/AML-CTX-WIN7-A.vmx
scsi0:0.mode = "independent-nonpersistent"
./AML-CTX-WIN7-B/AML-CTX-WIN7-B.vmx
scsi0:0.mode = "independent-nonpersistent"
./AML-CTX-WIN7-C/AML-CTX-WIN7-C.vmx
scsi0:0.mode = "independent-nonpersistent"
./AML-CTX-WIN7-D/AML-CTX-WIN7-D.vmx
scsi0:0.mode = "independent-nonpersistent"
 From NFS master page (/h/traces/ -- nfs adapter --> error page)



The following snapshot is Cerebro master webpage:






4 comments:

  1. Hi,
    so, and i can't understand how to troubleshoot dr?
    I have a situation, with Activate PD from target site (remote site has down), target cluster say me "Protection Domain update has been initiated successfully. This process can take a while.". But after two hour change nothing to do. And PD list-replication-status " Complete Percent : 76.59703" nothing change.

    ReplyDelete
    Replies
    1. Hi , Sorry for the delay. We may have to look at stargate and cerebro logs to find why this issue is happening. There could be a stuck metaop. Can u please open Nutanix support ticket to fix the issue ?

      Delete
  2. You did a very complex job, thanks for sharing the information \

    ReplyDelete