[Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK
artpoon at gmail.com
Tue Dec 1 12:45:52 PST 2009
I am in charge of managing a cluster at our research centre and am stuck with a vexing (to me) problem!
(Disclaimer: I am a biologist by training and a mostly self-taught programmer. I am still learning about networking and cluster management, so please bear with me!)
This is an asymmetric Intel Xeon cluster running 4 compute nodes on CentOS 5.4 and Scyld Clusterware 5. We managed to get it up and running using a dinky little NetGear 5-port 10/100/1000 switch. Now that I'm looking to expand the cluster, I need to get the managed switch working (an SMC 8824M, though we have several other switches available).
What's got me and the IT guys stumped is that while the compute nodes boot via PXE from the head node without trouble on the NetGear, they barf with the SMC. To be specific, after the initial boot with a minimal Linux kernel, there is a "fatal error" with "timeout waiting for getfile" when the compute node attempts to download the provisioning image from head. However, when they were running Rocks before I arrived, the cluster worked fine with the SMC switch.
I've tried resetting the SMC switch to factory defaults (with auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic. We've tried swapping out to another SMC switch but that didn't change anything.
I'm grateful if you could weigh in with your expertise.
More information about the Beowulf