[Beowulf] Performance characterising a HPC application

Tue Mar 20 21:05:14 PDT 2007

>>> 3. interconnect bound.
>> 
>> with ethernet, this is obvious, since you can just look at user/system/idle
>> times.
>
> You mean the system time will be high if nodes are busy sending/receiving?

well, if the node is compute-bound, nearly all time will be user-time.
if interconnect-bound, much time will be system or idle.  if system time
dominates, then cpu or memory is too slow.  if there is idle time, you
bottleneck is probably latency (perhaps network, but possibly also of 
whoever you're communicating with - compute node or fileserver.)

>>> 4. headnode bound.
>> 
>> do you mean for NFS traffic?
>
> More in terms of managing the responses from the compute nodes.

just job start/completes?  that's normally pretty trivial, though some 
queueing systems make a complete hash of it...

> What would you see in a tcpdump of the network was the bottleneck, lots of 
> resends?

if the net is a bandwidth bottleneck, then you'd see lots of back-to-back
packets, adding up to near wire-speed.  if latency is the issue, you'll see
relatively long delays between request and response (in NFS, for instance).
my real point is simply that tcpdump allows you to see the unadorned truth
about what's going on.  obviously, tcpdump will let you see the rate and 
scale of your flows, and between which nodes...

>> anything which doesn't speed up going from gigabit to IB/10G/quadrics is 
>> what I would call embarassingly parallel...
>
> True - I guess I'm trying to do some cost/benefit analysis so the magnitude 
> of the improvement is important to me .. but maybe measuring it on a test 
> cluster is the only way to be sure of this one.

well, maybe.  it's a bit jump from 1x Gb to IB or 10GE - I wish it were 
easier to advocate Myri 2G as an intermediate step, since I actually don't
see a lot of apps showing signs of dissatisfaction with ~250 MB/s
interconnect - and IB/10GE don't have much advantage, if any, in latency.

> Not in a while - I did some testing early on when I was testing different 
> compilers but I don't think I did any specific MPI testing. What would you 
> recommend - pallas or hpl? Or something else? Whats a good one that has other 
> good publicly available reference data?

http://www.sharcnet.ca/~hahn/m-g.C is a benchmark I'm working on.  it's 
mainly set up to just probe bw and latency for every pair of nodes in a 
cluster (obviously diagnostic).  I have some simple scripts to turn the 
results into some decent images.  it's obviously a work in progress, but 
has some nice properties.  I'm thinking of collecting at least a low-res
histogram for each measure, rather than just min/avg/max, since the 
lat/bw distibutions might be quite interesting.

> Interestingly enough - I enabled this on Friday and the first model we tested 
> with showed a 2-3% performance improvement in some quick testing.  We tested 
> it with another model which is uses a larger test set over the weekend and it 
> showed a 30% improvement. So that's good news, but it's still not entirely 
> obvious why we're seeing such a huge improvement when the network utilisation 
> doesn't indicate that the switch is saturated - but I guess latency could be 
> a big factor here.

I'm guessing you're simply bandwidth-limited, though it's unclear whether 
this is a simple bottleneck at the server, or affects "basal" intra-node 
communication as well.

>> I don't think you mentioned what your network looks like - all into one 
>> switch?  what kind is it?  have you verified that all the links are at 
>> 1000/fullduplex?
>
> All the nodes are Tyan s2891 boards with onboard Broadcom bcm5704 integrated 
> nics. They are all connected to a single hp procurve 3400cl 24-port switch. 
> And I've verified that all ports are running at 1000/full (the switch is

I think that's a reasonably good switch.  one interesting thing about it
is that it supports up to 2 10G ports.  if it turns out that your nodes 
are frequently waiting on your server, adding a 1G module, XFP and NIC 
might be a very nice tune-up.  that assumes that the server can _do_ 
something at much greater than 1x Gb speeds, of course!

regards, mark hahn.