[Beowulf] Re: Beowulf Digest, Vol 70, Issue 4

Jeff Johnson jeff.johnson at aeoncomputing.com
Wed Dec 2 10:34:20 PST 2009


On 12/2/09 10:21 AM, beowulf-request at beowulf.org wrote:
> ------------------------------
>
> Message: 8
> Date: Tue, 1 Dec 2009 12:45:52 -0800
> From: Art Poon<artpoon at gmail.com>
> Subject: [Beowulf] Re: cluster fails to boot with managed switch,	but
> 	5-port switch works OK
> To:beowulf at beowulf.org
> Message-ID:<825EEAB3-C58F-46B8-A9C4-A806C5B682D3 at gmail.com>
> Content-Type: text/plain; charset=us-ascii
>
> Dear colleagues,
>
> [snip]
>
> What's got me and the IT guys stumped is that while the compute nodes boot via PXE from the head node without trouble on the NetGear, they barf with the SMC.  To be specific, after the initial boot with a minimal Linux kernel, there is a "fatal error" with "timeout waiting for getfile" when the compute node attempts to download the provisioning image from head.  However, when they were running Rocks before I arrived, the cluster worked fine with the SMC switch.
>
> I've tried resetting the SMC switch to factory defaults (with auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic.  We've tried swapping out to another SMC switch but that didn't change anything.
>
> I'm grateful if you could weigh in with your expertise.
>    
I don't know if my $.02 here could be classified as 'expertise'. With 
that disclaimer out of the way I can say that SMC switches do have a 
tendency to have very old firmware when they are stocked in warehouses 
and they are not often updated. Their update process is a PITA compared 
to other switches out there. I have seen cases where their old firmware 
and STP (spanning tree protocol) causes enough delay when a port comes 
up on the switch for the first time in a pxe/dhcp operation that the 
process times out while the switch is trying to figure out if there are 
network loops. The firmware update can be obtained from www.smc.com and 
is at v2.3.0.0 updated in March. Check your switch to see where you are 
at now.

The Netgear switches are layer-2 and too dumb to cause problems.
> Thank you,
> - Art.
>
>
>
>
> ------------------------------
>
>    

-- 
------------------------------
Jeff Johnson
Manager
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810   f: 858-412-3845

4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117




More information about the Beowulf mailing list