[Beowulf] RAID question

mathog mathog at caltech.edu
Mon Mar 16 13:17:20 PDT 2015


Thanks for the feedback.

After copying /boot and /bin from another machine and mucking about with 
grub for far too long (had to edit grub.conf to change virtual disk 
names, and in CentOS's rescue disk it saw the boot disk as hd1, but when 
grub actually started, it saw it as hd0) the system is back on line.

The logs don't show a root command line that specifically took out those 
directories.  They do show a bunch of scripts being run.  My best guess 
is that one of them did something like this:

   AVAR=`command that failed and returned an empty string`
   rm -rf ${AVAR}/b*

It seems unlikely that a low level controller failure would have snipped 
out those files/directories without resulting in a file system that was 
seen as corrupt by fsck.

That said, there is something hardware related going on, since 
/var/log/messages has a lot of these (sorry about the wrap):

Mar 16 12:37:27 mandolin kernel: sd 7:0:0:0: [sdb]  Sense Key : 
Recovered Error [current] [descriptor]
Mar 16 12:37:27 mandolin kernel: Descriptor sense data with sense 
descriptors (in hex):
Mar 16 12:37:27 mandolin kernel:        72 01 04 1d 00 00 00 0e 09 0c 00 
00 00 00 00 00
Mar 16 12:37:27 mandolin kernel:        00 4f 00 c2 40 50
Mar 16 12:37:27 mandolin kernel: sd 7:0:0:0: [sdb]  ASC=0x4 ASCQ=0x1d

That group has several other similar Dell servers, and this is the only 
one logging these.  sdb1 holds /boot and sdb2 is where the lvm keeps its 
information.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the Beowulf mailing list