[Beowulf] RAID question

mathog mathog at caltech.edu
Fri Mar 13 17:52:26 PDT 2015


A bit off topic, but some of you may have run into something similar.

Today I was called in to try and fix a server which had stopped working. 
  Not my machine, the usual sysop is out sick.  The
model is a Dell PowerEdge T320 with a Raid PERC H710P controller.

The symptoms reported were "it stopped working, could not find 'ls', and 
wouldn't reboot past grub".  (Evidently it could find 'reboot'.)

Got into the BIOS and ran RAID consistency check, which took 3 hours.  
It didn't say if it had passed or failed, or put up any sort of status 
message whatsoever, but there were no failure lights lit on the disks.

On a reboot it gives:

   grub error 8: kernel must be loaded before booting.

It is a Centos 6.5 system, so booted it with an installation disk of 
that flavor, and dropped down into a shell.

This is where it gets strange.

/boot is in /dev/sdb1.  When mounted that directory is empty but
when unmounted fsck shows 10 files in it taking up about 12Mb.  Pretty 
clear why it wouldn't boot with nothing in /boot.  Not sure
what the 10 files fsck sees are, perhaps part of the filesystem.  (ext2 
I think).  I had never tried running fsck on an empty file system in a 
partition before.

/bin is missing entirely, so that's why "ls" stopped working.  /usr/bin 
is still there, which is why reboot was OK.

/var/log/messages shows that the machine was logging what look like 
corrected disk errors (sense errors) for /dev/sdb1 for days before it 
failed.

Tried copying the contents of another machine's /boot (which is supposed 
to be an exact copy of this one) into /boot, and rebooting,
but grub didn't get any farther than it had before.  Probably grub needs 
to be reinstalled, but with /bin missing, and who knows what else gone 
besides, it seems like a full OS reinstall would be in order.

Off the top of my head, if it weren't for the sense errors on /dev/sdb1, 
I would think that this might have been the result of an accidental (or 
hacker's)

   rm -rf /

Anybody run into a hardware/software glitch with symptoms like this on a 
similar system???

Is there some way on these sorts of Dell's to run per disk diagnostics 
from BIOS or UEFI even if they are already grouped into a virtual disk 
by the controller?  I suspect that the disk which is /dev/sdb may really 
be on its way out, but I couldn't get smartctl to work off the DVD or 
from the copy on disk.   (The smartctl commands used were tested on the 
twin machine, and they worked there.)  The BIOS showed that SMART was 
disabled on all of the disks.  Web searches for diagnostics for this 
controller all referenced software that requires a running OS, nothing 
built into the BIOS/UEFI.  (It is set to use BIOS.)

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the Beowulf mailing list