[Beowulf] RAID question
mathog at caltech.edu
Fri Mar 13 17:52:26 PDT 2015
A bit off topic, but some of you may have run into something similar.
Today I was called in to try and fix a server which had stopped working.
Not my machine, the usual sysop is out sick. The
model is a Dell PowerEdge T320 with a Raid PERC H710P controller.
The symptoms reported were "it stopped working, could not find 'ls', and
wouldn't reboot past grub". (Evidently it could find 'reboot'.)
Got into the BIOS and ran RAID consistency check, which took 3 hours.
It didn't say if it had passed or failed, or put up any sort of status
message whatsoever, but there were no failure lights lit on the disks.
On a reboot it gives:
grub error 8: kernel must be loaded before booting.
It is a Centos 6.5 system, so booted it with an installation disk of
that flavor, and dropped down into a shell.
This is where it gets strange.
/boot is in /dev/sdb1. When mounted that directory is empty but
when unmounted fsck shows 10 files in it taking up about 12Mb. Pretty
clear why it wouldn't boot with nothing in /boot. Not sure
what the 10 files fsck sees are, perhaps part of the filesystem. (ext2
I think). I had never tried running fsck on an empty file system in a
/bin is missing entirely, so that's why "ls" stopped working. /usr/bin
is still there, which is why reboot was OK.
/var/log/messages shows that the machine was logging what look like
corrected disk errors (sense errors) for /dev/sdb1 for days before it
Tried copying the contents of another machine's /boot (which is supposed
to be an exact copy of this one) into /boot, and rebooting,
but grub didn't get any farther than it had before. Probably grub needs
to be reinstalled, but with /bin missing, and who knows what else gone
besides, it seems like a full OS reinstall would be in order.
Off the top of my head, if it weren't for the sense errors on /dev/sdb1,
I would think that this might have been the result of an accidental (or
rm -rf /
Anybody run into a hardware/software glitch with symptoms like this on a
Is there some way on these sorts of Dell's to run per disk diagnostics
from BIOS or UEFI even if they are already grouped into a virtual disk
by the controller? I suspect that the disk which is /dev/sdb may really
be on its way out, but I couldn't get smartctl to work off the DVD or
from the copy on disk. (The smartctl commands used were tested on the
twin machine, and they worked there.) The BIOS showed that SMART was
disabled on all of the disks. Web searches for diagnostics for this
controller all referenced software that requires a running OS, nothing
built into the BIOS/UEFI. (It is set to use BIOS.)
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf