[Beowulf] Re: Node Boot Problem ~ No Keyboard/Harddrive/Diskdrive

Mark Hahn hahn at physics.mcmaster.ca
Fri Jul 9 11:09:10 PDT 2004


> version of Debian being loaded into Ramdisk. However, I have also
> tried using a larger version of RedHat loaded into Ramdisk and simply
> using a nfs root filesystem. Each of these setups still suffer the

why bother with a ramdisk?  my clusters boot with a bare kernel PXE'ed,
(monolithic, but hardly elaborate), and and a NFS root.

> If none of the following devices are detected: keyboard, usb mass
> storage device, harddrive....then the node crashes within 15 minutes.

why does it fail to detect the devices?  do you mean it is sometimes 
inconsistent in whether it detects them?  by "detects", do you mean the 
bios, or the kernel?

> When it crashes, the screen simply goes blank and the light on the NIC
> hard goes out.

sounds a bit like a power-management setting.

> If either a keyboard, usb mass storage device or harddrive are
> detected (or more then one are detected) then the node stays up for
> about 24 hours (but does eventually crash in the same fashion).

jeez.

> I do believe that this crash is due to some sort of hardware
> incompatibility, however I have 64 of these identical nodes, so
> replacing the hardware is not an option.

I'd suspect bios settings first.

> I'm currently puzzled as to what may be causing this problem. Possibly
> some sort of glitch in some power-save code? I've disabled all the
> power-save options in the kernel and still experience this problem.

my bet is on bios settings.

> > If the node has a keyboard/hard-drive/floppy-disk-drive plugged into
> > it, then the system boots perfectly. However, if the node has none of
> > these devices plugged in, then it crashes (screen goes blank, nic
> > light goes out). When exactly the node crashes is not consistent,
> > however it always occurs after the kernel has been transfered and
> > before the login screen appears.

that's weird.  you might try making your NFS root mount RW, and checking
to see whether init scripts are doing something weird.  is it safe to assume
you've drastically stripped the usual set of daemons (apmd, acpid, etc)?

> > have stepped through the startup file and found nothing which should

startup file*S*?

> > The nodes are P4 2.5 Ghz, 512 mb RAM, Intel D845GERG2 motherboards.
> > There are 64 of them and all of them are diskless. I would rather not
> > try to resolve this issue by 'buying 64 keyboards'.

recent Intel boards (and most other vendors) have a ignore-kbd-error setting.




More information about the Beowulf mailing list