[Beowulf] RAID question
mathog
mathog at caltech.edu
Mon Mar 16 13:17:20 PDT 2015
Thanks for the feedback.
After copying /boot and /bin from another machine and mucking about with
grub for far too long (had to edit grub.conf to change virtual disk
names, and in CentOS's rescue disk it saw the boot disk as hd1, but when
grub actually started, it saw it as hd0) the system is back on line.
The logs don't show a root command line that specifically took out those
directories. They do show a bunch of scripts being run. My best guess
is that one of them did something like this:
AVAR=`command that failed and returned an empty string`
rm -rf ${AVAR}/b*
It seems unlikely that a low level controller failure would have snipped
out those files/directories without resulting in a file system that was
seen as corrupt by fsck.
That said, there is something hardware related going on, since
/var/log/messages has a lot of these (sorry about the wrap):
Mar 16 12:37:27 mandolin kernel: sd 7:0:0:0: [sdb] Sense Key :
Recovered Error [current] [descriptor]
Mar 16 12:37:27 mandolin kernel: Descriptor sense data with sense
descriptors (in hex):
Mar 16 12:37:27 mandolin kernel: 72 01 04 1d 00 00 00 0e 09 0c 00
00 00 00 00 00
Mar 16 12:37:27 mandolin kernel: 00 4f 00 c2 40 50
Mar 16 12:37:27 mandolin kernel: sd 7:0:0:0: [sdb] ASC=0x4 ASCQ=0x1d
That group has several other similar Dell servers, and this is the only
one logging these. sdb1 holds /boot and sdb2 is where the lvm keeps its
information.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list