[Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK

Art Poon artpoon at gmail.com
Wed Dec 2 10:42:28 PST 2009


Hi all,

Thanks for your responses!  I finally fixed this yesterday afternoon but neglected to update my post, my apologies.
  
After discussing our problem to the Penguin Computing service rep, I reconfigured the switch to enable fast spanning-tree mode for compute node ports.  That apparently fixed the problem and thanks to your feedback I am starting to understand why.

Thanks again,
- Art.

On Dec 2, 2009, at 10:30 AM, Joe Landman wrote:

> Art Poon wrote:
>> Dear colleagues,
> 
> [...]
> 
>> What's got me and the IT guys stumped is that while the compute nodes
>> boot via PXE from the head node without trouble on the NetGear, they
>> barf with the SMC.  To be specific, after the initial boot with a
>> minimal Linux kernel, there is a "fatal error" with "timeout waiting
>> for getfile" when the compute node attempts to download the
>> provisioning image from head.  However, when they were running Rocks
>> before I arrived, the cluster worked fine with the SMC switch.
> 
> Is it the switch of the dhcp/bootp/tftp setup thats the problem?  Are you sure the tftp daemon is up, or bootp is configured correctly?
> 
> Switches sometimes have broadcast storm suppression turned on, or worse, sometimes they have spanning tree turned on.  You want the switch to be as dumb as you can possibly make it for most linux clusters.  Fast, but dumb.
> 
>> I've tried resetting the SMC switch to factory defaults (with
>> auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and
>> it doesn't seem to be demanding anything exotic.  We've tried
>> swapping out to another SMC switch but that didn't change anything.
> 
> This sounds more on the server software stack than the switch.  Could you describe this?  Are you using Scyld/Rocks for that?
> 
> Rocks is quite sensitive to configuration issues, and really doesn't like altered configurations (it is possible to do, though non-trivial).
> 
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics Inc.
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
>       http://scalableinformatics.com/jackrabbit
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615





More information about the Beowulf mailing list