[Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK

Greg Keller Greg at keller.net
Thu Dec 3 12:17:56 PST 2009

>>>> What's got me and the IT guys stumped is that while the compute  
>>>> nodes
>>> boot via PXE from the head node without trouble on the NetGear, they
>>> barf with the SMC.  To be specific, after the initial boot with a
>>> minimal Linux kernel, there is a "fatal error" with "timeout  
>>> waiting for
>>> getfile" when the compute node attempts to download the provisioning
>>> image from head.  However, when they were running Rocks before I
>>> arrived, the cluster worked fine with the SMC switch.

This is very common with Spanning tree enabled.  Essentially, once the  
port has a physical link light it may take a while before spanning  
tree allows traffic to actually flow through the port.  Longer than a  
typical timeout.  When loading/reloading the driver there seems to be  
an instantaneous drop of the link that forces a new delay cycle.

With the Dell PowerConnect (SMC Rebrand??) series you have to "enable"  
port fast or "disable" spanning tree to avoid this delay before  
traffic passes.  I generally do both.  The Web based GUI is  
sufficiently bad enough to make this more difficult than it needs to  
be, but you can globally disable spanning tree through it.  I use the  
command line, connect to interface range all, and then configure my  
ports as:

interface range ethernet all
spanning-tree disable
spanning-tree portfast
mtu 9216

Hope this helps!


Technical Principal
R Systems NA, inc.

More information about the Beowulf mailing list