[Beowulf] How to Diagnose Cause of Cluster Ethernet Errors?

Mon Apr 2 13:58:06 PDT 2007

Hey Jon,

Fun to see you here!!  I was just looking through some old Goleta pictures 
last week.

Just for kicks have a look at these figures: 
http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/SMC_8508T_Performance.html

This was part of a study that we did to select edge switches for the NEMO 
cluster.  We were able to find sub-$100 switches that were wire speed up 
to MTUs of about 6k.

There was a big difference between similar looking cheap switches from 
various companies.  And indeed, 'under the hood' they all used integrated 
chip sets from a handful of chip vendors.

Here are some more testing results from different edge switches:
http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/switching.html

(Note: our processing is embarassingly parallel, so we are primarily 
building compute farms.  We don't need very high bandwidth very low 
latency connections, eg infiniband or myrinet performance.)

Cheers,
 	Bruce

On Sun, 1 Apr 2007, Jon Forrest wrote:

> Douglas Eadline wrote:
>
>> <Soapbox>
>> I am constantly amazed at how many people buy the
>> latest and greatest node hardware and then connect
>> them with a sub-optimal switch (or cheap cables), thus reducing
>> the effective performance of the nodes (for parallel
>> applications). Kind "penny wise and pound foolish" as they say.
>> </Soapbox>
>
> I sincerely appreciate all the comments about my problem. I will reply
> to them in due time. However, I'd like to comment on this, which
> admittedly is off-topic from my original posting.
>
> I don't disagree with what you're saying. The problem is how
> to recognize "sub-optimal" equipment. For example, I see
> three tiers in ethernet switching hardware:
>
> 1) The low-end, e.g. Netgear, Linksys, D-link, ...
>
> 2) The mid-end, e.g. HP Procurve, Dell, SMC, ...
>
> 3) The high-end, e.g. Cisco, Foundry, ...
>
> What I, as a system manager, not as an Electrical Engineer,
> have trouble understanding, is what the true differences
> are between these levels, and, at one level, between
> the various vendors.
>
> These days I suspect that many of the vendors are using
> ASICs made by other chip companies, and the many vendors
> use the same ASICs. Assuming that's true, where's the
> added value that justifies the cost differences? Sometimes
> the value is in the "management" abilities of a device.
> I don't deny this can be a major selling point in a
> large enterprise environment, but in a 30-node cluster,
> or a small LAN, it's hard to justify paying for this.
>
> In terms of ethernet performance, once a device
> can handle wirespeed communication on all ports,
> where's the added value that justifies the added
> cost? I'm looking for empirical answers, which
> aren't always easy to find, and sometimes to understand.
>
> In the case of my cluser, it was configured and purchased
> before I got here, so I had nothing to do with choosing
> its components but I have to admit that I'm not
> sure what I would have done differently.
>
> Cordially,
>
> Jon Forrest
> Unix Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlforrest at berkeley.edu
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>