[Beowulf] Performance characterising a HPC application
stephen mulcahy
smulcahy at aplpi.com
Tue Mar 20 05:59:35 PDT 2007
Mark Hahn wrote:
>> 1. processor bound.
>> 2. memory bound.
>
> oprofile is the only thing I know of that will give you this distinction.
In practice, I don't think it is given the usage characteristics I
mentioned in my previous mail.
>> 3. interconnect bound.
>
> with ethernet, this is obvious, since you can just look at user/system/idle
> times.
You mean the system time will be high if nodes are busy sending/receiving?
>
>> 4. headnode bound.
>
> do you mean for NFS traffic?
More in terms of managing the responses from the compute nodes.
> it's not _that_ hard to hit full wire speed with gigabit...
> however, saturating the wire means it's entirely possible that nodes
> are being bottlenecked by this.
It seems to be the case - but the peak usage of the network is
relatively infrequent (every 30 minutes or so) - average usage is much
lower.
>> * Network traffic in averages at about 50 Mbit/sec but peaks to about
>> 200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks
>> to about 200Mbit/sec. The peaks are very short (maybe a few seconds in
>> duration, presumably at the end of an MPI "run" if that is the correct
>> term).
>
> you don't think the peaks correspond to inter-node communication (_during_
> the MPI job)?
Sorry, thats what I meant - this particular model outputs to a history
file every 30 minutes or so - and seems to do a lot of inter-node comms
around the same time .. so yes, the traffic seems to be generated by the
MPI job.
> ouch. the cluster is doing very badly, and clearly bottlenecked on
> either inter-node or headnode IO. I guess I'd be tempted to capture some
> representative trace data with tcpdump (but I'm pretty oldfashioned and
> fundamentalist about these things.)
Phew, that sounds hardcore :)
I tried wireshark on the headnode for a few minutes but ended up with
gigs of data and wasn't sure what I was looking for so I'm currently
trying to see what my model isn't generating mpe log data which might be
more manageable. What would you see in a tcpdump of the network was the
bottleneck, lots of resends?
>> quantify that. Do others here running MPI jobs see big improvements in
>> using Infiniband over Gigabit for MPI jobs or does it really depend on
>> the
>
> jeez: compare a 50 us interconnect to a 4 us one (or 80 MB/s vs >800).
>
> anything which doesn't speed up going from gigabit to IB/10G/quadrics is
> what I would call embarassingly parallel...
True - I guess I'm trying to do some cost/benefit analysis so the
magnitude of the improvement is important to me .. but maybe measuring
it on a test cluster is the only way to be sure of this one.
>> characteristics of the MPI job? What characteristics should I be
>> looking for?
>
> well, have you run a simple MPI benchmark, to make sure you're seeing
> reasonable performance? single-pair latency, bandwidth and some form of
> group communication are always good to know.
Not in a while - I did some testing early on when I was testing
different compilers but I don't think I did any specific MPI testing.
What would you recommend - pallas or hpl? Or something else? Whats a
good one that has other good publicly available reference data?
>> a) to identify what parts of the system any tuning exercises should
>> focus on.
>> - some possible low hanging fruit includes enabling jumbo frames [some
>> rough
>
> jumbo frames are mainly a way to recover some CPU overhead - most systems,
> especially those which are only 20%, can handle back-to-back 1500B frames.
> it's easy enough to measure (with ttcp, netperf, etc).
Interestingly enough - I enabled this on Friday and the first model we
tested with showed a 2-3% performance improvement in some quick testing.
We tested it with another model which is uses a larger test set over
the weekend and it showed a 30% improvement. So that's good news, but
it's still not entirely obvious why we're seeing such a huge improvement
when the network utilisation doesn't indicate that the switch is
saturated - but I guess latency could be a big factor here.
> I don't think you mentioned what your network looks like - all into one
> switch? what kind is it? have you verified that all the links are at
> 1000/fullduplex?
All the nodes are Tyan s2891 boards with onboard Broadcom bcm5704
integrated nics. They are all connected to a single hp procurve 3400cl
24-port switch. And I've verified that all ports are running at
1000/full (the switch is reporting some ports as using MDI and some
using MDIX but I'm not sure thats a cause for concern, although it is
mildly surprising since they all use a standard cable and mobo).
>> I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters
>> available through their developer program for testing. Has anyone
>> actually used these?
>
> I haven't. but if you'd like to try on our systems, we have quite a range.
> (no IB, but our quadrics systems are roughly comparable.)
Thanks for the offer (if it was :) - I need to have a think about the
effort required to set this up and see how much assistance the
AMD/Mellanox/Pathscale cluster folks give for this kind of testing - if
they don't make it too difficult I'm inclined to avail of their kit
rather than hassle you.
Thanks again,
-stephen
--
Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland. http://www.aplpi.com
More information about the Beowulf
mailing list