[Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK

David Mathog mathog at caltech.edu
Wed Dec 2 11:40:09 PST 2009


> What's got me and the IT guys stumped is that while the compute nodes
boot via PXE from the head node without trouble on the NetGear, they
barf with the SMC.  To be specific, after the initial boot with a
minimal Linux kernel, there is a "fatal error" with "timeout waiting for
getfile" when the compute node attempts to download the provisioning
image from head.  However, when they were running Rocks before I
arrived, the cluster worked fine with the SMC switch.

Use tcpdump or some equivalent.  Run it once with the dumb switch, once
with the managed one, and then compare and contrast.

> I've tried resetting the SMC switch to factory defaults (with
auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it
doesn't seem to be demanding anything exotic.  We've tried swapping out
to another SMC switch but that didn't change anything.  

Detach from the world at large then turn off the firewall on the master.
(Probably not it this time, but whenever there are network problems
always rule out the firewall before spending time on anything else.)

Ipv6 vs. Ipv4?  By which I mean, once the kernel boots, perhaps it goes
to ipv6, which the netgear handles properly, but maybe that is turned
off on the SMC?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list