This is my scratch pad that I use for DR troubleshooting.
Basics:
Cerebro Master(CM): - maintains persistent metadata (PDs/VMs/Snapshots), acts on Active PD schedule, replicate snapshots, handshake with remote CM, receives remote snapshots. one per cluster.
Elected by zookeeper, register/start/stop/unregister VMs by communicating with hyperint.
Offload replicating individual file to Cerebro slaves.
Cerebro Slaves: - runs on every node, receives request from the master to replicate NFS file, request
might contain a reference NFS file that has been replicated earlier. Compute and ship diffs from the reference NFS file.
Information about remote cluster maintained in zookeeper. - ip and port info of remote cerebro,
if proxy needed , maintains mapping between container (datastore ) names, maintains transfer rate
(within max allowed)
Cerebro slave asks stargate to replicate the file. So it leverages stargate for actual datatransfer.
It is true DDR (Distributed DR) - all cerebro slaves and stargate work to replicate the data. Log files to look for.
Basics:
Cerebro Master(CM): - maintains persistent metadata (PDs/VMs/Snapshots), acts on Active PD schedule, replicate snapshots, handshake with remote CM, receives remote snapshots. one per cluster.
Elected by zookeeper, register/start/stop/unregister VMs by communicating with hyperint.
Offload replicating individual file to Cerebro slaves.
Cerebro Slaves: - runs on every node, receives request from the master to replicate NFS file, request
might contain a reference NFS file that has been replicated earlier. Compute and ship diffs from the reference NFS file.
Information about remote cluster maintained in zookeeper. - ip and port info of remote cerebro,
if proxy needed , maintains mapping between container (datastore ) names, maintains transfer rate
(within max allowed)
Cerebro slave asks stargate to replicate the file. So it leverages stargate for actual datatransfer.
It is true DDR (Distributed DR) - all cerebro slaves and stargate work to replicate the data. Log files to look for.
-data/logs/cerebro.[INFO,ERROR], data/logs/cerebro.out
-data/logs/stargate.INFO,ERROR - grep -i Cerebro
three components that are mostly used for DR - prism_gateway, cerebro(2020), stargate(2009)
interacts with ESXi to register VM when rolling back.
three components that are mostly used for DR - prism_gateway, cerebro(2020), stargate(2009)
interacts with ESXi to register VM when rolling back.
Quick topology
ProtectionDomain(VM:xclone) -- Primary Container:LocalCTR---<primary SVM ips> - ports 2020/2009
====== ports 2009/2020 --<remoteDR SVMips> ---RemoteCTR_noSSD (.snapshot directory)
Troubleshooting steps:
- links http://remote-site-cvm:2009 and 2020 (to all remote CVMs and from all local CVMs) -- do it on both sides.
Traces:
http://cvm:2020
http://cvm:2020/h/traces
http://cvm:2009/h/traces
Verify Routes, so that replication traffic goes through non-production path
Quick steps:
1. Note the container that has VMs that has to be replicated.
2. Create container remote site (no SSD)
3. Create Remote sites in both location
ncli> rs ls
Name : backup
Replication Bandwidth :
Remote Address(es) : remote_cvm1_ip,
Name : backup
Replication Bandwidth :
Remote Address(es) : remote_cvm1_ip,
remote_cvm2_ip,etc
Container Map : LocalCTR:Remote_DR_CTR
Proxy Enabled
Proxy Enabled
4. Create the Protection Domain
ncli pd create name="Platinum-1h" on primary site.
ncli pd create name="Platinum-1h" on primary site.
5. Add VMs in the local CTR defined in (ncli rs ls) to the protection domain
pd protect name="Platinum-1h" vm-names="xclone"
6. Create one time snapshot to verify replication is working
pd create-one-time-snapshot name="Platinum-1h" remote-sites="backup" retention-time=36000
(if you see oob error repeatedly, pkill cerebro and start cerebro as well restart prism bootstrapdaemon - call nutanix support)
6b. pd set-schedule name="Platinum-1h" interval="3600" remote-sites="backup" min-snap-retention-count=10 retention-policy=1:30 - take snapshot every 1 hour, retain atleast 10 counts,
7. ncli> pd list-replication-status
ID : 16204522
Protection Domain : Platinum-1h
Replication Operation : Sending
Start Time : 09/12/2013 09:51:33 PDT
Remote Site : backup
Snapshot Id : 16204518
Aborted : false
Paused : false
Bytes Completed : 17.57 GB (18,861,841,805 bytes)
Complete Percent : 6.0113173
8. list previous snaps
ncli> pd ls-snaps name=Platinum-1h
ID : 16883844
Protection Domain : Platinum-1h
Create Time : 09/12/2013 13:52:42 PDT
Expiry Time : 09/13/2013 19:52:42 PDT
Virtual Machine(s) : 4
VM Name : xclone
Consistency Group :xclone
Power state on recovery : Powered On
Replicated To Site(s) : backup
ID : 17258039
Protection Domain : Platinum-1h
Create Time : 09/12/2013 15:52:43 PDT
Expiry Time : 09/13/2013 21:52:43 PDT
Virtual Machine(s) : 4
VM Name : xclone
Consistency Group : xclone
Power state on recovery : Powered On
Replicated To Site(s) : backup
ID : 16967424
Protection Domain : Platinum-1h
Create Time : 09/12/2013 14:52:42 PDT
Expiry Time : 09/13/2013 20:52:42 PDT
Virtual Machine(s) : 4
VM Name : xclone
Consistency Group : xclone
Power state on recovery : Powered On
Additional notes:
1. If Remote site is in different subnet, try (all remote CVMs) links http://remote_host:2009 and 2020, if it does not work,
enable iptable rules:
for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save;sudo
/etc/init.d/iptables save
"; done
this opens up 2009 and needs to be done on both sides. Please repeat for port 2020 as well
Run this command on one local as well as one remote CVM site to verify the connectivity:
for cvm in `svmips`; do (echo "From the CVM $cvm:"; ssh -q -t $cvm 'for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do echo Checking stargate and cerebro port connection to $i ...; nc -v -w 2 $i -z 2009; nc -v -w 2 $i -z 2020; done');done
From the CVM 192.168.3.207:
Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
From the CVM 192.168.3.208:
Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
From the CVM 192.168.3.209:
Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
If it does not work, enable the iptable rules
Enable 2020 and 2009 on remote site:
for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do ssh $i sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save; done
for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do ssh $i sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2020 -j ACCEPT;sudo service iptables save; done
Enable it locally:
for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save"; done
for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2020 -j ACCEPT;sudo service iptables save"; done
Run this command on one local as well as one remote CVM site to verify the connectivity:
for cvm in `svmips`; do (echo "From the CVM $cvm:"; ssh -q -t $cvm 'for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do echo Checking stargate and cerebro port connection to $i ...; nc -v -w 2 $i -z 2009; nc -v -w 2 $i -z 2020; done');done
From the CVM 192.168.3.207:
Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
From the CVM 192.168.3.208:
Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
From the CVM 192.168.3.209:
Checking stargate and cerebro port connection to 192.168.3.191 ...
Connection to 192.168.3.191 2009 port [tcp/news] succeeded!
Connection to 192.168.3.191 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.192 ...
Connection to 192.168.3.192 2009 port [tcp/news] succeeded!
Connection to 192.168.3.192 2020 port [tcp/xinupageserver] succeeded!
Checking stargate and cerebro port connection to 192.168.3.193 ...
Connection to 192.168.3.193 2009 port [tcp/news] succeeded!
Connection to 192.168.3.193 2020 port [tcp/xinupageserver] succeeded!
If it does not work, enable the iptable rules
Enable 2020 and 2009 on remote site:
for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do ssh $i sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save; done
for i in ` source /etc/profile;ncli rs ls |grep Remote|cut -d : -f2|sed 's/,//g'`; do ssh $i sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2020 -j ACCEPT;sudo service iptables save; done
Enable it locally:
for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2009 -j ACCEPT;sudo service iptables save"; done
for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2020 -j ACCEPT;sudo service iptables save"; done
Quick Verification:
. verify the directory listing on remote container
sudo mount localhost:/Rep_container /mnt
cd /mnt/.snapshot
and you can see the replication files there.
Master |
Protection Domains (1) |
Name | Is Active | Total Consistency Groups | Total Entities | Total Snapshots | Total Remotes | Executing Meta Ops | Tx or Rx KB/s |
---|---|---|---|---|---|---|---|
Platinum-1h | yes | 4 | 4 | 3 | 1 | 1 | 46561.28 |
Remote Bandwidth |
Remote | KB/s | |
---|---|---|
Tx | Rx | |
backup | 46561.28 |
Master |
Protection Domain 'Platinum-1h' | |
Is Active | yes |
Total Consistency Groups | 4 |
Total Entities | 4 |
Total Snapshots | 3 |
Total Remotes | 1 |
Executing Meta Ops | 1 |
Tx KB/s | 54947.84 |
Consistency Groups (4) |
Name | Total VMs | Total NFS Files | To Remove |
---|---|---|---|
x-clone | 1 | 0 | no |
Snapshots (3) |
Handle | Start Date | Finish Date | Expiry Date | Total Consistency Groups | Total Files | Total GB | Aborted | Replicated To Remotes |
---|---|---|---|---|---|---|---|---|
(3103, 1361475834422355, 16204518) | 20130912-09:51:31-GMT-0700 | 20130912-09:51:32-GMT-0700 | 20130916-13:51:32-GMT-0700 | 4 | 36 | 1871 | no | - |
(3103, 1361475834422355, 16205073) | 20130912-09:51:39-GMT-0700 | 20130912-09:51:40-GMT-0700 | 20130912-10:51:40-GMT-0700 | 4 | 36 | 1871 | no | - |
(3103, 1361475834422355, 16507948) | 20130912-11:00:14-GMT-0700 | 20130912-11:00:15-GMT-0700 | 20130912-12:00:15-GMT-0700 | 4 | 36 | 1871 | no | - |
Meta Ops (1) |
Meta Op Id | Creation Date | Is Paused | To Abort |
---|---|---|---|
16219199 | 20130912-09:53:45-GMT-0700 | no | no |
Snapshot Schedule |
Periodic Schedule |
---|
Out Of Band Schedules |
[no out of band schedules] |
Pending Actions |
replication { remote_name: "backup" snapshot_handle_vec { cluster_ id: 3103 cluster_incarnation_id: 1361475834422355 entity_id: 16507948 } } |
Latest completed top-level meta ops |
Meta opid | Opcode | Creation time | Duration (secs) | Attributes | Aborted | Abort detail |
---|---|---|---|---|---|---|
15421792 | SnapshotProtectionDomain | 20130911-15:07:48-GMT-0700 | 1 | snapshot=(3103, 1361475834422355, 15421793) | No | |
15421797 | Replicate | 20130911-15:07:49-GMT-0700 | 36555 | remote=backup snapshot=(3103, 1361475834422355, 15421793) replicated_bytes=2009518144619 tx_bytes=1109779340685 | No | |
16204509 | SnapshotGarbageCollector | 20130912-01:17:05-GMT-0700 | 1 | snapshot=(3103, 1361475834422355, 15421793) | No | |
16204517 | SnapshotProtectionDomain | 20130912-09:51:31-GMT-0700 | 1 | snapshot=(3103, 1361475834422355, 16204518) | No | |
16205072 | SnapshotProtectionDomain | 20130912-09:51:39-GMT-0700 | 1 | snapshot=(3103, 1361475834422355, 16205073) | No | |
16204522 | Replicate | 20130912-09:51:33-GMT-0700 | 132 | remote=backup snapshot=(3103, 1361475834422355, 16204518) replicated_bytes=145570717069 tx_bytes=21713968525 | ||
16205072 | SnapshotProtectionDomain | 20130912-09:51:39-GMT-0700 | 1 | snapshot=(3103, 1361475834422355, 16205073) | No | |
16204522 | Replicate | 20130912-09:51:33-GMT-0700 | 132 | remote=backup snapshot=(3103, 1361475834422355, 16204518) replicated_bytes=145570717069 tx_bytes=21713968525 | Yes | Received RPC to abort |
16507947 | SnapshotProtectionDomain | 20130912-11:00:14-GMT-0700 | 1 | snapshot=(3103, 1361475834422355, 16507948) | No |
Advanced Cerebro Commands:
cerebro_cli query_protection_domain pd_name list_consistency_groups=true
Remove a VM from PD
cerebro_cli remove_consistency_group pd_name vm_name
Sample Error
On cerebro Master:
grep Silver-24h data/logs/cerebro.*
cerebro.WARNING:E0913 11:00:59.830462 3361 protection_domain.cc:1601] notification=ProtectionDomainReplicationFailure protection_domain_name=Silver-24hremote_name=backup timestamp_usecs=1379095259228275 reason=Attempt to lookup attributes of path /AML_NFS01/.snapshot/48/3103-1361475834422355-18439248/AML-CTX-WIN7-A failed with NFS error 2
We found this error was due to multiple VMs in the PD accessing same vmdk due to independent
non-persistent config
./AML-CTX-WIN7-A/AML-CTX-WIN7-A.vmx
scsi0:0.mode = "independent-nonpersistent"
./AML-CTX-WIN7-B/AML-CTX-WIN7-B.vmx
scsi0:0.mode = "independent-nonpersistent"
./AML-CTX-WIN7-C/AML-CTX-WIN7-C.vmx
scsi0:0.mode = "independent-nonpersistent"
./AML-CTX-WIN7-D/AML-CTX-WIN7-D.vmx
scsi0:0.mode = "independent-nonpersistent"
From NFS master page (/h/traces/ -- nfs adapter --> error page)
The following snapshot is Cerebro master webpage:
Hi,
ReplyDeleteso, and i can't understand how to troubleshoot dr?
I have a situation, with Activate PD from target site (remote site has down), target cluster say me "Protection Domain update has been initiated successfully. This process can take a while.". But after two hour change nothing to do. And PD list-replication-status " Complete Percent : 76.59703" nothing change.
Hi , Sorry for the delay. We may have to look at stargate and cerebro logs to find why this issue is happening. There could be a stuck metaop. Can u please open Nutanix support ticket to fix the issue ?
Deletethank you for the blog visit us forSAN Solutions in Dubai
ReplyDeleteYou did a very complex job, thanks for sharing the information \
ReplyDelete