[Beowulf] Re: Beowulf Digest, Vol 70, Issue 4
Jeff Johnson
jeff.johnson at aeoncomputing.com
Wed Dec 2 10:34:20 PST 2009
On 12/2/09 10:21 AM, beowulf-request at beowulf.org wrote:
> ------------------------------
>
> Message: 8
> Date: Tue, 1 Dec 2009 12:45:52 -0800
> From: Art Poon<artpoon at gmail.com>
> Subject: [Beowulf] Re: cluster fails to boot with managed switch, but
> 5-port switch works OK
> To:beowulf at beowulf.org
> Message-ID:<825EEAB3-C58F-46B8-A9C4-A806C5B682D3 at gmail.com>
> Content-Type: text/plain; charset=us-ascii
>
> Dear colleagues,
>
> [snip]
>
> What's got me and the IT guys stumped is that while the compute nodes boot via PXE from the head node without trouble on the NetGear, they barf with the SMC. To be specific, after the initial boot with a minimal Linux kernel, there is a "fatal error" with "timeout waiting for getfile" when the compute node attempts to download the provisioning image from head. However, when they were running Rocks before I arrived, the cluster worked fine with the SMC switch.
>
> I've tried resetting the SMC switch to factory defaults (with auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic. We've tried swapping out to another SMC switch but that didn't change anything.
>
> I'm grateful if you could weigh in with your expertise.
>
I don't know if my $.02 here could be classified as 'expertise'. With
that disclaimer out of the way I can say that SMC switches do have a
tendency to have very old firmware when they are stocked in warehouses
and they are not often updated. Their update process is a PITA compared
to other switches out there. I have seen cases where their old firmware
and STP (spanning tree protocol) causes enough delay when a port comes
up on the switch for the first time in a pxe/dhcp operation that the
process times out while the switch is trying to figure out if there are
network loops. The firmware update can be obtained from www.smc.com and
is at v2.3.0.0 updated in March. Check your switch to see where you are
at now.
The Netgear switches are layer-2 and too dumb to cause problems.
> Thank you,
> - Art.
>
>
>
>
> ------------------------------
>
>
--
------------------------------
Jeff Johnson
Manager
Aeon Computing
jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 f: 858-412-3845
4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117
More information about the Beowulf
mailing list