Tuesday, September 10, 2013

Nutanix cluster start -Troubleshooting

1. Cluster create -
cluster discover-nodes
- Network connectivity - all the nodes in same broadcast domain
- /etc/nutanix/factory_config.json - has right config - node location, serial number
- avahi-browser (/var/log/messages)
- genesis running

2. Cluster  start  - three main process that needs to work and svm boot , rest of the process - prism, pithos,
even stargate will startup if these process work.

a. Genesis  - Logs in data/logs/genesis.out
b. Zookeeper - data/logs/zookeeper.out
c. cassandra - data/logs/cassandra/system.log
d. svm_boot - /usr/local/nutanix/bootstrap.log

3. Genesis issues:
a. ESXi password

b. ESXi network /internal vswitch
c. Genesis was started as root
pkill genesis and rm  /home/nutanix/data/locks/genesis, chown genesis.out to nutanix
c. Wrong zookeeper config
E0725 13:15:55.644642 14286 configuration_validator.cc:532] Zeus config check failed: invalid management_server_name 
F0725 13:15:55.646738 14286 zeus.cc:698] Check failed: validator->config_valid() logical_timestamp: 44 

4. zookeeper fails to start
a. /etc/hosts on all the hosts does not have same zookeeper nodes (zk1/zk2 and zk3)
b. snapshot corruption:
c. edit_zeus was used , there was a misconfiguration or zeus_config validation bug 

5. cassandra failures
 - SSD tier full/ SSD tier not accessible
-  cassandra configs did not reset during cluster destroy - 
ERROR [main] 2012-06-19 14:21:03,868 AbstractCassandraDaemon.java (line 154) Fatal exception during initialization
org.apache.cassandra.config.ConfigurationException: enable_cluster_name_change is true but saved cluster name '1131' does not match the allowed prior name 'Test Cluster' and the configured name '1132'
- timestamp in future.

6. CVM boot failures 
svm_boot - disk inventory, formatting and mounting disks , vmx configs ( repopulating vmx disk entries)
Marker : /tmp/svm_boot_succeeded
 /usr/local/nutanix/bootstrap/bin/svm_boot_reboot.dat ( this prevents svm_boot to be run)
copies ipmitool to ESXi

7. Disk failures: smartctl tools and how Nutanix marks the disk offline