nutanix: Nutanix DR troubleshooting

This is my scratch pad that I use for DR troubleshooting.

Basics:
Cerebro Master(CM): - maintains persistent metadata (PDs/VMs/Snapshots), acts on Active PD schedule, replicate snapshots, handshake with remote CM, receives remote snapshots. one per cluster.
Elected by zookeeper, register/start/stop/unregister VMs by communicating with hyperint.
Offload replicating individual file to Cerebro slaves.

Cerebro Slaves: - runs on every node, receives request from the master to replicate NFS file, request
might contain a reference NFS file that has been replicated earlier. Compute and ship diffs from the reference NFS file.

Information about remote cluster maintained in zookeeper. - ip and port info of remote cerebro,
if proxy needed , maintains mapping between container (datastore ) names, maintains transfer rate
(within max allowed)

Cerebro slave asks stargate to replicate the file. So it leverages stargate for actual datatransfer.

It is true DDR (Distributed DR) - all cerebro slaves and stargate work to replicate the data. Log files to look for.

-data/logs/cerebro.[INFO,ERROR], data/logs/cerebro.out

-data/logs/stargate.INFO,ERROR - grep -i Cerebro

three components that are mostly used for DR - prism_gateway, cerebro(2020), stargate(2009)
interacts with ESXi to register VM when rolling back.

Quick topology

ProtectionDomain(VM:xclone) -- Primary Container:LocalCTR---<primary SVM ips> - ports 2020/2009

====== ports 2009/2020 --<remoteDR SVMips> ---RemoteCTR_noSSD (.snapshot directory)

Troubleshooting steps:

- links http://remote-site-cvm:2009 and 2020 (to all remote CVMs and from all local CVMs) -- do it on both sides.

Traces:

http://cvm:2020

http://cvm:2020/h/traces

http://cvm:2009/h/traces

Verify Routes, so that replication traffic goes through non-production path

Quick steps:

1. Note the container that has VMs that has to be replicated.

2. Create container remote site (no SSD)

3. Create Remote sites in both location

ncli> rs ls
    Name                      : backup
    Replication Bandwidth     :
    Remote Address(es)        : remote_cvm1_ip,

remote_cvm2_ip,etc

Container Map : LocalCTR:Remote_DR_CTR
Proxy Enabled

4. Create the Protection Domain
ncli pd create name="Platinum-1h" on primary site.

5. Add VMs in the local CTR defined in (ncli rs ls) to the protection domain

pd protect name="Platinum-1h" vm-names="xclone"

6. Create one time snapshot to verify replication is working

pd create-one-time-snapshot name="Platinum-1h" remote-sites="backup" retention-time=36000

(if you see oob error repeatedly, pkill cerebro and start cerebro as well restart prism bootstrapdaemon - call nutanix support)

6b. pd set-schedule name="Platinum-1h" interval="3600" remote-sites="backup" min-snap-retention-count=10 retention-policy=1:30 - take snapshot every 1 hour, retain atleast 10 counts,

7. ncli> pd list-replication-status

ID                        : 16204522
    Protection Domain         : Platinum-1h
    Replication Operation     : Sending
    Start Time                : 09/12/2013 09:51:33 PDT
    Remote Site               : backup
    Snapshot Id               : 16204518
    Aborted                   : false
    Paused                    : false
    Bytes Completed           : 17.57 GB (18,861,841,805 bytes)
    Complete Percent          : 6.0113173

8. list previous snaps

ncli> pd ls-snaps name=Platinum-1h

    ID                        : 16883844
    Protection Domain         : Platinum-1h
    Create Time               : 09/12/2013 13:52:42 PDT
    Expiry Time               : 09/13/2013 19:52:42 PDT
    Virtual Machine(s)        : 4

        VM Name                   : xclone
        Consistency Group         :xclone
        Power state on recovery   : Powered On

    Replicated To Site(s)     : backup

    ID                        : 17258039
    Protection Domain         : Platinum-1h
    Create Time               : 09/12/2013 15:52:43 PDT
    Expiry Time               : 09/13/2013 21:52:43 PDT
    Virtual Machine(s)        : 4

        VM Name                   : xclone
        Consistency Group         : xclone
        Power state on recovery   : Powered On

        Replicated To Site(s)     : backup

    ID                        : 16967424
    Protection Domain         : Platinum-1h
    Create Time               : 09/12/2013 14:52:42 PDT
    Expiry Time               : 09/13/2013 20:52:42 PDT
    Virtual Machine(s)        : 4

        VM Name                   : xclone
        Consistency Group         : xclone
        Power state on recovery   : Powered On

Additional notes:
1. If Remote site is in different subnet, try (all remote CVMs) links http://remote_host:2009 and 2020, if it does not work,
enable iptable rules:

for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save;sudo /etc/init.d/iptables save"; done

this opens up 2009 and needs to be done on both sides. Please repeat for port 2020 as well

Run this command on one local as well as one remote CVM site to verify the connectivity:

for cvm in `svmips`; do (echo "From the CVM $cvm:"; ssh -q -t $cvm 'for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do echo Checking stargate and cerebro port connection to $i ...; nc -v -w 2 $i -z 2009; nc -v -w 2 $i -z 2020; done');done

From the CVM 192.168.3.207:

Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
From the CVM 192.168.3.208:

Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
From the CVM 192.168.3.209:

Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!

If it does not work, enable the iptable rules

Enable 2020 and 2009 on remote site:
for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do ssh $i sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save; done
for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do ssh $i sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2020 -j ACCEPT;sudo service iptables save; done

Enable it locally:
for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save"; done
for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2020 -j ACCEPT;sudo service iptables save"; done

Quick Verification:

. verify the directory listing on remote container

sudo mount localhost:/Rep_container /mnt

cd /mnt/.snapshot

and you can see the replication files there.

On the Primary site Cerebro master 2020 page ( you can find it from http://cvmip:2020 - it will point to master) Cerebro

Master

Protection Domains (1)

Name	Is Active	Total Consistency Groups	Total Entities	Total Snapshots	Total Remotes	Executing Meta Ops	Tx or Rx KB/s
Platinum-1h	yes	4	4	3	1	1	46561.28

Remote Bandwidth

Remote	KB/s
	Tx	Rx
backup	46561.28

On the PD webpage ( ?pd=Platinum-1h ) -- --:Start time --:-- 1020130910-21:57:41-GMT-0700 04 k Build Version release-congo-3.1.2-stable-101e3a8c77c9b8a2b3173a9e040bc41b339cdb5c Build Last Commit Date 2013-08-27 23:43:31 -0700 Component ID 105 Incarnation ID 4475694 Local stargate incarnation ID 4623290 Highest allocated opid 18095 Highest contiguous completed opid 18094

Master

Protection Domain 'Platinum-1h'
Is Active	yes
Total Consistency Groups	4
Total Entities	4
Total Snapshots	3
Total Remotes	1
Executing Meta Ops	1
Tx KB/s	54947.84

Consistency Groups (4)

Name	Total VMs	Total NFS Files	To Remove
x-clone	1	0	no

Snapshots (3)

Handle	Start Date	Finish Date	Expiry Date	Total Consistency Groups	Total Files	Total GB	Aborted	Replicated To Remotes
(3103, 1361475834422355, 16204518)	20130912-09:51:31-GMT-0700	20130912-09:51:32-GMT-0700	20130916-13:51:32-GMT-0700	4	36	1871	no	-
(3103, 1361475834422355, 16205073)	20130912-09:51:39-GMT-0700	20130912-09:51:40-GMT-0700	20130912-10:51:40-GMT-0700	4	36	1871	no	-
(3103, 1361475834422355, 16507948)	20130912-11:00:14-GMT-0700	20130912-11:00:15-GMT-0700	20130912-12:00:15-GMT-0700	4	36	1871	no	-

Meta Ops (1)

Meta Op Id	Creation Date	Is Paused	To Abort
16219199	20130912-09:53:45-GMT-0700	no	no

Snapshot Schedule

Periodic Schedule

Out Of Band Schedules
[no out of band schedules]

Pending Actions

replication {
  remote_name: "backup"
  snapshot_handle_vec {
    cluster_ id: 3103
    cluster_incarnation_id: 1361475834422355
    entity_id: 16507948
  }
}

Latest completed top-level meta ops

Meta opid	Opcode	Creation time	Duration (secs)	Attributes	Aborted	Abort detail
15421792	SnapshotProtectionDomain	20130911-15:07:48-GMT-0700	1	snapshot=(3103, 1361475834422355, 15421793)	No
15421797	Replicate	20130911-15:07:49-GMT-0700	36555	remote=backup snapshot=(3103, 1361475834422355, 15421793) replicated_bytes=2009518144619 tx_bytes=1109779340685	No
16204509	SnapshotGarbageCollector	20130912-01:17:05-GMT-0700	1	snapshot=(3103, 1361475834422355, 15421793)	No
16204517	SnapshotProtectionDomain	20130912-09:51:31-GMT-0700	1	snapshot=(3103, 1361475834422355, 16204518)	No
16205072	SnapshotProtectionDomain	20130912-09:51:39-GMT-0700	1	snapshot=(3103, 1361475834422355, 16205073)	No
16204522	Replicate	20130912-09:51:33-GMT-0700	132	remote=backup snapshot=(3103, 1361475834422355, 16204518) replicated_bytes=145570717069 tx_bytes=21713968525
16205072	SnapshotProtectionDomain	20130912-09:51:39-GMT-0700	1	snapshot=(3103, 1361475834422355, 16205073)	No
16204522	Replicate	20130912-09:51:33-GMT-0700	132	remote=backup snapshot=(3103, 1361475834422355, 16204518) replicated_bytes=145570717069 tx_bytes=21713968525	Yes	Received RPC to abort
16507947	SnapshotProtectionDomain	20130912-11:00:14-GMT-0700	1	snapshot=(3103, 1361475834422355, 16507948)	No

Advanced Cerebro Commands:
cerebro_cli query_protection_domain pd_name list_consistency_groups=true

Remove a VM from PD
cerebro_cli remove_consistency_group pd_name vm_name

Sample Error
On cerebro Master:
grep Silver-24h data/logs/cerebro.*

cerebro.WARNING:E0913 11:00:59.830462 3361 protection_domain.cc:1601] notification=ProtectionDomainReplicationFailure protection_domain_name=Silver-24hremote_name=backup timestamp_usecs=1379095259228275 reason=Attempt to lookup attributes of path /AML_NFS01/.snapshot/48/3103-1361475834422355-18439248/AML-CTX-WIN7-A failed with NFS error 2

We found this error was due to multiple VMs in the PD accessing same vmdk due to independent
non-persistent config
./AML-CTX-WIN7-A/AML-CTX-WIN7-A.vmx
scsi0:0.mode = "independent-nonpersistent"
./AML-CTX-WIN7-B/AML-CTX-WIN7-B.vmx
scsi0:0.mode = "independent-nonpersistent"
./AML-CTX-WIN7-C/AML-CTX-WIN7-C.vmx
scsi0:0.mode = "independent-nonpersistent"
./AML-CTX-WIN7-D/AML-CTX-WIN7-D.vmx
scsi0:0.mode = "independent-nonpersistent"
From NFS master page (/h/traces/ -- nfs adapter --> error page)

The following snapshot is Cerebro master webpage:

4 comments:

Антон ВладимировичDecember 27, 2013 at 12:12 AM
Hi,
so, and i can't understand how to troubleshoot dr?
I have a situation, with Activate PD from target site (remote site has down), target cluster say me "Protection Domain update has been initiated successfully. This process can take a while.". But after two hour change nothing to do. And PD list-replication-status " Complete Percent : 76.59703" nothing change.
lost_in_woodsJuly 22, 2019 at 9:53 PM
thank you for the blog visit us forSAN Solutions in Dubai
VictoriaDecember 30, 2019 at 2:28 AM
You did a very complex job, thanks for sharing the information \

Thursday, September 12, 2013

Nutanix DR troubleshooting

4 comments: