[Beowulf] RAID question

Sat Mar 14 14:50:21 PDT 2015

On 3/13/2015 5:52 PM, mathog wrote:
> A bit off topic, but some of you may have run into something similar.
>
> Today I was called in to try and fix a server which had stopped 
> working.  Not my machine, the usual sysop is out sick.  The
> model is a Dell PowerEdge T320 with a Raid PERC H710P controller.
>
> The symptoms reported were "it stopped working, could not find 'ls', 
> and wouldn't reboot past grub".  (Evidently it could find 'reboot'.)
>
> Got into the BIOS and ran RAID consistency check, which took 3 hours.  
> It didn't say if it had passed or failed, or put up any sort of status 
> message whatsoever, but there were no failure lights lit on the disks.
>
> On a reboot it gives:
>
>   grub error 8: kernel must be loaded before booting.
>
> It is a Centos 6.5 system, so booted it with an installation disk of 
> that flavor, and dropped down into a shell.
>
> This is where it gets strange.
>
> /boot is in /dev/sdb1.  When mounted that directory is empty but
> when unmounted fsck shows 10 files in it taking up about 12Mb. Pretty 
> clear why it wouldn't boot with nothing in /boot.  Not sure
> what the 10 files fsck sees are, perhaps part of the filesystem. (ext2 
> I think).  I had never tried running fsck on an empty file system in a 
> partition before.
>
> /bin is missing entirely, so that's why "ls" stopped working. /usr/bin 
> is still there, which is why reboot was OK.
>
> /var/log/messages shows that the machine was logging what look like 
> corrected disk errors (sense errors) for /dev/sdb1 for days before it 
> failed.
>
> Tried copying the contents of another machine's /boot (which is 
> supposed to be an exact copy of this one) into /boot, and rebooting,
> but grub didn't get any farther than it had before.  Probably grub 
> needs to be reinstalled, but with /bin missing, and who knows what 
> else gone besides, it seems like a full OS reinstall would be in order.
>
> Off the top of my head, if it weren't for the sense errors on 
> /dev/sdb1, I would think that this might have been the result of an 
> accidental (or hacker's)
>
>   rm -rf /
>
> Anybody run into a hardware/software glitch with symptoms like this on 
> a similar system???
>
> Is there some way on these sorts of Dell's to run per disk diagnostics 
> from BIOS or UEFI even if they are already grouped into a virtual disk 
> by the controller?  I suspect that the disk which is /dev/sdb may 
> really be on its way out, but I couldn't get smartctl to work off the 
> DVD or from the copy on disk.   (The smartctl commands used were 
> tested on the twin machine, and they worked there.)  The BIOS showed 
> that SMART was disabled on all of the disks.  Web searches for 
> diagnostics for this controller all referenced software that requires 
> a running OS, nothing built into the BIOS/UEFI.  (It is set to use BIOS.)

I might start looking at non-RAID problems first. Maybe you have some 
bad memory or CPU? Errant rm could do it too, as you mentioned.

Skylar