The FNN (Flat Neighborhood Network) paradox

Wed Feb 28 23:56:22 PST 2001

Thanks a lot for your answer!

Parallelization of this code is based on domain decomposition, so in principle the communication time is
proportional to the number of cells at the interfaces between the domains (and computation time is proportional
to the number of cells enclosed in each domain). If one looks at a typical two dimensional computational domain
one will find that the amount of data that needs to be communicated between adjacent nodes is more or less
independent of the number of nodes in the cluster. However, the computational work for each domain is
proportional to one over the number of nodes (if the overall size of the problem is kept the same).

This also means that the comp:comm ratio will "degrade" while increasing the number of processors used in the
computation (again if the overall size of the problem is kept the same). So no matter what way of communicating
is used there will eventually be a point of saturation.

I have also made comparisons of the same code on a SGI-system, where I get super-linear speed-up (a factor of 14
on 8 nodes), but this I believe is a "cache-effect", and it is hard to separate this from the higher bandwidth
and lower latency on that system.

I have not investigated the speed-up of the problem while using a constant number of cells on each node, but I
would suspect that I would get linear speed-up in that case, and hence I don't think that the code is limited by
a serial part.

However, I will try MVIA (know too little about this, but would it be possible to use both MVIA and channel
bonding simultaneously, in order to both reduce latency and increase bandwidth?).

Thanks again,

/jon

"Robert G. Brown" wrote:

> On Thu, 22 Feb 2001, Jon Tegner wrote:
>
> > We are using an inexpensive D-link switch (DES-3225G), and are using
> > channel bonding through two VLANS on that  (one) switch. Really don't
> > know what kind of improvement to expect, so I'm attaching two figures,
> > the first one  shows signature graphs (throughput as a function of
> > time) using mpi and tcp with and without channel bonding  (obtained by
> > netpipe). From this one it looks like channel bonding is working OK.
> >
> >  But when it comes to my application (simulation of a detonation in gas
> > phase) improvement is not very spectacular  (not worth it?). The figure
> > shows the same case for three different meshes (480x50 cells to 1920x200
> > cells), and  displays the speed up as a function of number of nodes.
> > Obviously I couldn't expect better scaling for the large case, but for
> > the medium and the small one, are they  showing the kind of improvement
> > one could expect? Or is it likely that something is wrong with our
> > set-up?
> >
> > Thanks in advance for any comments,
>
> Questions:
>
>   a) Do you know if your application is CPU bound or network bound?
> >From the look of it, I'd say it is heavily CPU bound, with a ratio of
> computation time to communication time for even the smallest mesh of
> around 100 to 500:1, more for the larger meshes.
>
> If your application is CPU bound with a comp:comm time ratio like 1000:1
> with a fairly small serial fraction, then it will get pretty much linear
> speedup out to eight nodes.  If you increase it to 2000:1, then you'll
> get pretty much linear speedup out to eight nodes.  If you increase it
> to 10^6:1 you'll get pretty much linear speedup out to eight nodes.  All
> you're observing is that communications are irrelevant to your
> (essentially coarse grained) parallel application, at least for the
> switched 100BT that forms your basic network and granularities of your
> code.
>
> In which case yes, channel bonding is way overkill for your application,
> at least at the ranges you portray.  Instead of buying multiple switches
> and trying to build fatter pipes, buy bigger switches and more nodes.
> Only when/where you start to see some sort of saturation and bending
> over of the scaling curve (like that which might be visible for 9 nodes
> in the 480x50 mesh) will fatter data pipes be useful to you.
>
> At that point, what increasing the network bandwidth WILL do for you is
> increase the number of nodes you can apply to the problem and still get
> (nearly) linear speedup.  So on the 480x50 mesh, channel bonding and
> higher bandwidth keeps you near-linear a bit further out (from the look
> of this curve the serial fraction of your code is starting to compare to
> the parallel fraction, causing saturation -- maybe -- in the 8-9 node
> range, if this isn't an artifact of poor statistics).  Even this isn't
> strongly affected by mere doubling (and doesn't much affect the slope of
> this curve < 1, which is what I'm using to guess that the serial
> fraction is becoming important) -- it is more of a log scale kind of
> thing.  That is, increasing network speed by a factor of 10 might buy
> you a factor of two or three extension of the linear scaling region.
>
> To understand all this read the section on Amdahl's law, etc. (and look
> at the equations and figures) in the online beowulf book on brahma:
>
>   http://www.phy.duke.edu/brahma
>
>   b) The other possibility is that your communication pattern is
> dominated by tiny messages (still) interspersed with lots of
> computation.  In this case your communications is probably latency
> dominated, not bandwidth dominated.  Fattening the pipe won't help this;
> adding an extra channel will if anything INCREASE the latency (and hence
> slow the time per message).  I only mention this because it is possible
> that you have lots of small messages whose aggregate time is still in
> the 1000:1 range for your larger meshes.  In this case channel bonding
> would be a bad idea in general, not just not worth it for the scaling
> ranges you've achieved so far.  However, if you rearrange your
> application to use bigger messages, it might once again be worthwhile.
>
> Hope this helps.  Your scaling is really rather good out to 8-10 nodes
> -- eyeballing it it looks impossibly good for the largest mesh, but I
> rather hope this is because you are plotting the results for single run
> times and better statistics would make the line behave.  Although there
> >>are<< nonlinear breakpoints -- breaking the bigger mesh up means more
> of it runs in cache, which can actually yield better CPU rates (which
> then decrease the comp:comm ratio and really make a priori estimation of
> the speedup curve a bitch, actually:-).
>
>     rgb
>
> --
> Robert G. Brown                        http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf