[Beowulf] Performance characterising a HPC application

Tue Mar 20 05:59:35 PDT 2007

Mark Hahn wrote:
>> 1. processor bound.
>> 2. memory bound.
> 
> oprofile is the only thing I know of that will give you this distinction.

In practice, I don't think it is given the usage characteristics I 
mentioned in my previous mail.

>> 3. interconnect bound.
> 
> with ethernet, this is obvious, since you can just look at user/system/idle
> times.

You mean the system time will be high if nodes are busy sending/receiving?

> 
>> 4. headnode bound.
> 
> do you mean for NFS traffic?

More in terms of managing the responses from the compute nodes.

> it's not _that_ hard to hit full wire speed with gigabit...
> however, saturating the wire means it's entirely possible that nodes
> are being bottlenecked by this.

It seems to be the case - but the peak usage of the network is 
relatively infrequent (every 30 minutes or so) - average usage is much 
lower.

>> * Network traffic in averages at about 50 Mbit/sec but peaks to about 
>> 200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks 
>> to about 200Mbit/sec. The peaks are very short (maybe a few seconds in 
>> duration, presumably at the end of an MPI "run" if that is the correct 
>> term).
> 
> you don't think the peaks correspond to inter-node communication (_during_
> the MPI job)?

Sorry, thats what I meant - this particular model outputs to a history 
file every 30 minutes or so - and seems to do a lot of inter-node comms 
around the same time .. so yes, the traffic seems to be generated by the 
MPI job.

> ouch.  the cluster is doing very badly, and clearly bottlenecked on 
> either inter-node or headnode IO.  I guess I'd be tempted to capture some
> representative trace data with tcpdump (but I'm pretty oldfashioned and 
> fundamentalist about these things.)

Phew, that sounds hardcore :)

I tried wireshark on the headnode for a few minutes but ended up with 
gigs of data and wasn't sure what I was looking for so I'm currently 
trying to see what my model isn't generating mpe log data which might be 
more manageable. What would you see in a tcpdump of the network was the 
bottleneck, lots of resends?

>> quantify that. Do others here running MPI jobs see big improvements in 
>> using Infiniband over Gigabit for MPI jobs or does it really depend on 
>> the
> 
> jeez: compare a 50 us interconnect to a 4 us one (or 80 MB/s vs >800).
> 
> anything which doesn't speed up going from gigabit to IB/10G/quadrics is 
> what I would call embarassingly parallel...

True - I guess I'm trying to do some cost/benefit analysis so the 
magnitude of the improvement is important to me .. but maybe measuring 
it on a test cluster is the only way to be sure of this one.

>> characteristics of the MPI job? What characteristics should I be 
>> looking for?
> 
> well, have you run a simple MPI benchmark, to make sure you're seeing
> reasonable performance?  single-pair latency, bandwidth and some form of 
> group communication are always good to know.

Not in a while - I did some testing early on when I was testing 
different compilers but I don't think I did any specific MPI testing. 
What would you recommend - pallas or hpl? Or something else? Whats a 
good one that has other good publicly available reference data?

>> a) to identify what parts of the system any tuning exercises should 
>> focus on.
>> - some possible low hanging fruit includes enabling jumbo frames [some 
>> rough
> 
> jumbo frames are mainly a way to recover some CPU overhead - most systems,
> especially those which are only 20%, can handle back-to-back 1500B frames.
> it's easy enough to measure (with ttcp, netperf, etc).

Interestingly enough - I enabled this on Friday and the first model we 
tested with showed a 2-3% performance improvement in some quick testing. 
  We tested it with another model which is uses a larger test set over 
the weekend and it showed a 30% improvement. So that's good news, but 
it's still not entirely obvious why we're seeing such a huge improvement 
when the network utilisation doesn't indicate that the switch is 
saturated - but I guess latency could be a big factor here.

> I don't think you mentioned what your network looks like - all into one 
> switch?  what kind is it?  have you verified that all the links are at 
> 1000/fullduplex?

All the nodes are Tyan s2891 boards with onboard Broadcom bcm5704 
integrated nics. They are all connected to a single hp procurve 3400cl 
24-port switch. And I've verified that all ports are running at 
1000/full (the switch is reporting some ports as using MDI and some 
using MDIX but I'm not sure thats a cause for concern, although it is 
mildly surprising since they all use a standard cable and mobo).

>> I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters 
>> available through their developer program for testing. Has anyone 
>> actually used these?
> 
> I haven't.  but if you'd like to try on our systems, we have quite a range.
> (no IB, but our quadrics systems are roughly comparable.)

Thanks for the offer (if it was :) - I need to have a think about the 
effort required to set this up and see how much assistance the 
AMD/Mellanox/Pathscale cluster folks give for this kind of testing - if 
they don't make it too difficult I'm inclined to avail of their kit 
rather than hassle you.

Thanks again,

-stephen

-- 
Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,
    GMIT, Dublin Rd, Galway, Ireland.      http://www.aplpi.com