nutanix: What node removal process does in the background ?

Note: These two commands should work and has been tested in QA.
ncli host start-remove id=X
ncli host get-remove-status
ncli host remove-finish id=X says that host MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE

Look at dynamic_ring_changer.out and dynamic_ring_changer.INF), Cassandra Logs for any crashes,

if it does not complete.

Please note that the following explains how a node removal works: (please note that using the procedure other than Nutanix support/Engineering will cause data corruption)

In older version of the NOS, if you have to manually remove the node, follow this procedure with
help from nutanix support

1. nodetool -h localhost ring - find the token

2. dynamic remove node

ring_changer -node_ring_token=O8yDYNgicJraBlfZrHAsORTseZq0MQ

2ke6uhuzNh8Y6bMLEH046M7yQPz5q5 -do_delete -dynamic_ring_change=true

( you can skip_keyspaces="stats")

ring_changer -node_ring_token="KlbSuLpdEFJDIoUXp1TPVEwmcYrlo9pJlm4yHemOtpmnBowMcyYQAbTcF8Vh" -dynamic_ring_change=true -skip_keyspaces="stats,alerts_keyspace,alerts_index_keyspace,medusa_extentgroupaccessdatamap,pithos"

3. Look at dynamic_ring_changer.out and dynamic_ring_changer.INFO, Cassandra Logs for any crashes,

if it does not complete.

4. compact and remove stats if there are cassandra crashes

a. Connect to cassandra using 'cassandra-cli'

cassandra-cli -h localhost -p 9160 -k stats

b. Remove the desired rows:

del stats_realtime['PWhm:vdisk_usage'];

del stats_realtime['nYeQ:vdisk_derived_usage'];

del stats_realtime['ha4h:vdisk_perf'];
del stats_realtime['E0Tl:vdisk_frontend_adapter_perf'];

c. Exit

quit;
d.for i in `svmips`; do echo $i;ssh $i "source /etc/profile;nodetool -h localhost compact";done

5. run the ring_changer again

6. Verify all the data has replication 2 (inspite the node to be removed is powered off)

for i in `svmips`; do ssh $i "cd data/stargate-storage/disks; find . -name \*.egroup -print|cut -d/ -f6"; done|sort | uniq -c|grep -v " 2 "

It will print extent groups with replication other than 2 .

7. Now we can verify zeus_config (zeus_config_printer)

- make sure all the disks have data migrated is true and disk marked to remove.(

to_remove: true
data_migrated: true)

- node has kOkToBeRemoved and node_removal_ack: 273 (0x111)

0x100 - zookeeper ok to be removed
0x10 -- cassandra ok to be removed
0x1 -curator ok to be removed.

8. Now run
ncli host remove-finish id=X

9. nodetool -h localhost removetoken O8yDYNgicJraBlfZrHAsORTseZq0MQ
2ke6uhuzNh8Y6bMLEH046M7yQPz5q5

nutanix

Wednesday, September 4, 2013

What node removal process does in the background ?

2 comments: