[Beowulf] RAID question

Gavin W. Burris bug at wharton.upenn.edu
Sat Mar 14 07:53:40 PDT 2015


Hi, David.                                                                                                      

This might not be the best forum for Linux technical support, a bit
off-topic.  But I can't resist...

It sounds like you either had a controller glitch that corrupted the
filesystem, or have an actual failed disks.  I wouldn't rule out memory
failure or bad cables, either.  Each mount point would be its own
filesystem, and depending how you have done RAID, the failure could
punch holes in any one of the filesystems.

That said, I recommend trying to salvage the important data immediately,
with either centos rescue, or a fedora live cd / usb, keeping the
partitions read-only to prevent further corruption.  Dell does have a
number of diagnostics.  Usually a dset utility is run and the logs are
sent to support for analysis.  You may want to consider purchasing
support for this incident, if you aren't under warranty.

After you get the important data off, you may want to attempt repair.
Hope you have backups.  Note that once the boot partition is
manipulated, grub needs to be reinstalled to the drive to map the
booting of the kernel.  This assumes you have consistent RAID and
repaired filesystems.  Something like:

boot: linux rescue
# chroot /mnt/sysimage/
# grub-install /dev/sda

You may also need to go into grub.  Something like:
# grub
grub> root (hd0,0)        
grub> setup (hd0)
grub> quit
# reboot

Good luck!

On 05:52PM Fri 03/13/15 -0700, mathog wrote:
> A bit off topic, but some of you may have run into something similar.
> 
> Today I was called in to try and fix a server which had stopped working.
> Not my machine, the usual sysop is out sick.  The
> model is a Dell PowerEdge T320 with a Raid PERC H710P controller.
> 
> The symptoms reported were "it stopped working, could not find 'ls', and
> wouldn't reboot past grub".  (Evidently it could find 'reboot'.)
> 
> Got into the BIOS and ran RAID consistency check, which took 3 hours.  It
> didn't say if it had passed or failed, or put up any sort of status message
> whatsoever, but there were no failure lights lit on the disks.
> 
> On a reboot it gives:
> 
>   grub error 8: kernel must be loaded before booting.
> 
> It is a Centos 6.5 system, so booted it with an installation disk of that
> flavor, and dropped down into a shell.
> 
> This is where it gets strange.
> 
> /boot is in /dev/sdb1.  When mounted that directory is empty but
> when unmounted fsck shows 10 files in it taking up about 12Mb.  Pretty clear
> why it wouldn't boot with nothing in /boot.  Not sure
> what the 10 files fsck sees are, perhaps part of the filesystem.  (ext2 I
> think).  I had never tried running fsck on an empty file system in a
> partition before.
> 
> /bin is missing entirely, so that's why "ls" stopped working.  /usr/bin is
> still there, which is why reboot was OK.
> 
> /var/log/messages shows that the machine was logging what look like
> corrected disk errors (sense errors) for /dev/sdb1 for days before it
> failed.
> 
> Tried copying the contents of another machine's /boot (which is supposed to
> be an exact copy of this one) into /boot, and rebooting,
> but grub didn't get any farther than it had before.  Probably grub needs to
> be reinstalled, but with /bin missing, and who knows what else gone besides,
> it seems like a full OS reinstall would be in order.
> 
> Off the top of my head, if it weren't for the sense errors on /dev/sdb1, I
> would think that this might have been the result of an accidental (or
> hacker's)
> 
>   rm -rf /
> 
> Anybody run into a hardware/software glitch with symptoms like this on a
> similar system???
> 
> Is there some way on these sorts of Dell's to run per disk diagnostics from
> BIOS or UEFI even if they are already grouped into a virtual disk by the
> controller?  I suspect that the disk which is /dev/sdb may really be on its
> way out, but I couldn't get smartctl to work off the DVD or from the copy on
> disk.   (The smartctl commands used were tested on the twin machine, and
> they worked there.)  The BIOS showed that SMART was disabled on all of the
> disks.  Web searches for diagnostics for this controller all referenced
> software that requires a running OS, nothing built into the BIOS/UEFI.  (It
> is set to use BIOS.)
> 
> Thanks,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gavin W. Burris
Senior Project Leader for Research Computing
The Wharton School
University of Pennsylvania


More information about the Beowulf mailing list