[Beowulf] Performance characterising a HPC application

stephen mulcahy smulcahy at aplpi.com
Tue Mar 20 05:59:35 PDT 2007

Mark Hahn wrote:
>> 1. processor bound.
>> 2. memory bound.
> oprofile is the only thing I know of that will give you this distinction.

In practice, I don't think it is given the usage characteristics I 
mentioned in my previous mail.

>> 3. interconnect bound.
> with ethernet, this is obvious, since you can just look at user/system/idle
> times.

You mean the system time will be high if nodes are busy sending/receiving?

>> 4. headnode bound.
> do you mean for NFS traffic?

More in terms of managing the responses from the compute nodes.

> it's not _that_ hard to hit full wire speed with gigabit...
> however, saturating the wire means it's entirely possible that nodes
> are being bottlenecked by this.

It seems to be the case - but the peak usage of the network is 
relatively infrequent (every 30 minutes or so) - average usage is much 

>> * Network traffic in averages at about 50 Mbit/sec but peaks to about 
>> 200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks 
>> to about 200Mbit/sec. The peaks are very short (maybe a few seconds in 
>> duration, presumably at the end of an MPI "run" if that is the correct 
>> term).
> you don't think the peaks correspond to inter-node communication (_during_
> the MPI job)?

Sorry, thats what I meant - this particular model outputs to a history 
file every 30 minutes or so - and seems to do a lot of inter-node comms 
around the same time .. so yes, the traffic seems to be generated by the 
MPI job.

> ouch.  the cluster is doing very badly, and clearly bottlenecked on 
> either inter-node or headnode IO.  I guess I'd be tempted to capture some
> representative trace data with tcpdump (but I'm pretty oldfashioned and 
> fundamentalist about these things.)

Phew, that sounds hardcore :)

I tried wireshark on the headnode for a few minutes but ended up with 
gigs of data and wasn't sure what I was looking for so I'm currently 
trying to see what my model isn't generating mpe log data which might be 
more manageable. What would you see in a tcpdump of the network was the 
bottleneck, lots of resends?

>> quantify that. Do others here running MPI jobs see big improvements in 
>> using Infiniband over Gigabit for MPI jobs or does it really depend on 
>> the
> jeez: compare a 50 us interconnect to a 4 us one (or 80 MB/s vs >800).
> anything which doesn't speed up going from gigabit to IB/10G/quadrics is 
> what I would call embarassingly parallel...

True - I guess I'm trying to do some cost/benefit analysis so the 
magnitude of the improvement is important to me .. but maybe measuring 
it on a test cluster is the only way to be sure of this one.

>> characteristics of the MPI job? What characteristics should I be 
>> looking for?
> well, have you run a simple MPI benchmark, to make sure you're seeing
> reasonable performance?  single-pair latency, bandwidth and some form of 
> group communication are always good to know.

Not in a while - I did some testing early on when I was testing 
different compilers but I don't think I did any specific MPI testing. 
What would you recommend - pallas or hpl? Or something else? Whats a 
good one that has other good publicly available reference data?

>> a) to identify what parts of the system any tuning exercises should 
>> focus on.
>> - some possible low hanging fruit includes enabling jumbo frames [some 
>> rough
> jumbo frames are mainly a way to recover some CPU overhead - most systems,
> especially those which are only 20%, can handle back-to-back 1500B frames.
> it's easy enough to measure (with ttcp, netperf, etc).

Interestingly enough - I enabled this on Friday and the first model we 
tested with showed a 2-3% performance improvement in some quick testing. 
  We tested it with another model which is uses a larger test set over 
the weekend and it showed a 30% improvement. So that's good news, but 
it's still not entirely obvious why we're seeing such a huge improvement 
when the network utilisation doesn't indicate that the switch is 
saturated - but I guess latency could be a big factor here.

> I don't think you mentioned what your network looks like - all into one 
> switch?  what kind is it?  have you verified that all the links are at 
> 1000/fullduplex?

All the nodes are Tyan s2891 boards with onboard Broadcom bcm5704 
integrated nics. They are all connected to a single hp procurve 3400cl 
24-port switch. And I've verified that all ports are running at 
1000/full (the switch is reporting some ports as using MDI and some 
using MDIX but I'm not sure thats a cause for concern, although it is 
mildly surprising since they all use a standard cable and mobo).

>> I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters 
>> available through their developer program for testing. Has anyone 
>> actually used these?
> I haven't.  but if you'd like to try on our systems, we have quite a range.
> (no IB, but our quadrics systems are roughly comparable.)

Thanks for the offer (if it was :) - I need to have a think about the 
effort required to set this up and see how much assistance the 
AMD/Mellanox/Pathscale cluster folks give for this kind of testing - if 
they don't make it too difficult I'm inclined to avail of their kit 
rather than hassle you.

Thanks again,


Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,
    GMIT, Dublin Rd, Galway, Ireland.      http://www.aplpi.com

More information about the Beowulf mailing list