[Fwd: Re: 32-port gigabit switch]

Fri Mar 7 09:46:42 PST 2003

> I'm not trying to start a flame war, and I'm really curious.  I suggest
> that you're starting the flame war with your attacking tone and lack of
> any facts (or even one example) backing up your statements.  Just saying
> "it depends" doesn't help the rest of us learn.  When is Gigabit better?

   Where's RGB when you need him? :) I think enough people have
pointed out that your statement is wrong. Have you looked in the
beowulf archives? How about a googling?

> In my experience the computation portion of a Beowulf will always
> require low latencies for optimal performance.

   OK. We have 3 MPI applications. Two are internally written and
one is from NASA. We have extensively tested these 3 codes with
many varying data sets on all kinds of HPC equipment (Cray's,
SGI Origin's, SP's, clusters, etc.). However, I'll focus on clusters
(beowulf's in particular).
   We have tested on equipment with Myrinet, GigE, and FastE. The
nodes were the same and only the network changed along with
some tuning to get the best performance out of each. Here's what
we have found:

Code 1 - First internal code. Running on Myrinet compared to GigE
only gives you about 20% better wall-clock time for some cases. For
other cases, Myrinet is slower than GigE (still trying to explain that
one :). Myrinet is about twice as fast as FastE.

Observations - We think this code is more constrained by latency
than bandwidth when you compare Myrinet and GigE. We have
looked at the message sizes and they are fairly small (tiny). This
pushes this code down the bandwidth/mesage size curve almost
to the point where you measure latency. So latency appears to be
a driver for this code. Also, not much overlapping communication/
computation in this code.

Code 2 - Second internal code. Running on Myrinet compared to
GigE is only about 3% faster for just about all cases. Myrinet is
about twice as fast as FastE.

Observations - Although we should see better performance with
Myrinet compared to GigE due to better bandwidth, we think this
code is limited by bandwidth instead of latency. The message sizes
for this code are very large, pushing the code way up the bandwidth/
message size curve. We're still working on identifying all of the
bottlenecks, but from a networking standpoint, this is what we
have concluded so far. Also, not much overlapping communication/
computation in this code.

Code 3 - NASA code. This code only runs about 2-3% faster on
GigE and Myrinet compared to FastE. The code appears to be well
thought out with respect to overlapping communication/computation.

Obsverations - This code appears not to be constrained by either
latency nor bandwidth.

Disclaimer - There are lots of things I ignored in this simple analysis
such as memory bandwith, etc. The data to support these observations
also came from testing on other systems and on testing with other
types of networking (Quadrics, Scali, etc.). All of the numbers are
wall-clock times.

   With these general rules of thumb (we always test before we
buy) and knowing the mix between the codes, we do a price/performance
to configure the best system. Right now (and this is subject to change),
GigE provides better price/performance for our code mix.
   Of course, this also depends on what GigE equipment we're talking
about. I think Mark has pointed out in the past, as well as others, that
not all GigE equipment is created equal (this is also generally true for
FastE as well). However, for the GigE equipment we have tested on
and also have in production we have found GigE is the way to go for
us for our mix of codes.

> On the other hand, when I have applications that need to transfer a lot
> of data as well, I find that having two networks is the way to go.  One
> for control and messaging traffic (low latency - Myrinet) and one for
> data traffic (high throughput - Gigabit).

   What kinds of applications?
   So you run control and MPI messsage traffic over Myrinet and
NFS over GigE? Myrinet has better bandwidth than GigE, so
it appears that if data transfer is important I would switch NFS
to Myrinet and MPI traffic to GigE (unless of course you see a
big difference in performance). If you do see a big difference in
performance, what about using two Myrinet networks (trying to
get you some sales Patrick! :)?
   If latency is that important, have you tried Quadrics? In our
experience it has lower latencies than Myrinet. What MPI
implementations have you tried? Do you run 1 ppn with single
CPUs, or 1 ppn with SMP nodes, or 2 ppn with SMP nodes,
or something else? All of things can have a large impact on
performance.

> If you would rather take it off list, then feel free to email me
> directly, but I would really like to know because I can't think of one
> example that works.

   I hope my response answered your question. Anybody care to
present another example where bandwidth is more important than
latency? Greg? Mark? RGB? Doug? Don?

Jeff

--

Dr. Jeff Layton
Senior Engineer
Lockheed-Martin Aeronautical Company - Marietta
Aerodynamics & CFD

"Is it possible to overclock a cattle prod?" - Irv Mullins

This email may contain confidential information. If you have received this
email in error, please delete it immediately, and inform me of the mistake by
return email. Any form of reproduction, or further dissemination of this
email is strictly prohibited. Also, please note that opinions expressed in
this email are those of the author, and are not necessarily those of the
Lockheed-Martin Corporation.