stargate marks a disk offline if any command to it takes more than 20 seconds.
Reasons that a disk could take more than 20 seconds to respond
1. bad sectors - sudo smartctl -a /dev/sdX1 -T permissive 2. bad adapter - sudo smartctl -l sataphy /dev/sdc1 3. bad chassis 4. disk slow ( iostat from sysstats - data/logs/sysstat) 5. stargate.INFO will show the disk offline and gdb the stargate core file - stuck on fdatasync and pwrite 6. Mark the disk offline to see if goes offline again.
The following solution details, walks through the commands.
Here is an example of the outputs
1. io-stat: ( you can graph using Nagios- zk_leader:7777 or sysstats directory in /home/nutanix/data/logs)
IO on this disk was stuck for more than 20s which caused stargate to mark it offline. There were a bunch of writes ops in progress which were stuck on the pwrite/fdatasync system calls on this disk.
#TIMESTAMP 1366215700 : 04/17/2013 09:21:40 AM
Linux 2.6.35-30-server (NTNX-Ctrl-VM-3-NTNX) 04/17/2013 _x86_64_ (8 CPU)