[Beowulf] Performance characterising a HPC application

Thu Mar 15 09:03:34 PDT 2007

Hi,

I'm looking for any suggestions people might have on performance 
characterising a HPC application (how's that for a broad query :)

Background:
We have a 20 node opteron 270 (2.0GHz dual core, 4GB ram, diskless) 
cluster with gigabit ethernet interconnect. It is used primarily to run 
an Oceanography numerical model called ROMS (http://www.myroms.org/ in 
case anyone is interested). The nodes are running Debian GNU/Linux Etch 
(AMD64 version) and we're using the portland group fortan90 compiler and 
mpich2 for our MPI needs. The cluster has been in production mode pretty 
much since it was commissioned so I haven't gotten a chance to do much 
tuning and benchmarking.

I'm currently trying to characterise the performance of the model, in 
particular to determine where it is

1. processor bound.

2. memory bound.

3. interconnect bound.

4. headnode bound.

I'm curious about how others go about this kind of characterisation - 
I'm not at all familiar with the model at a code level (my expertise, if 
any!, is in the area of Linux and hardware rather than in fortran90 
code) so I don't have any particular insights from that perspective. I'm 
hoping I can characterise the app from outside using various measurement 
tools.

So far, I've used a mix of things including Ganglia, htop, iostat, 
vmtstat, wireshark, ifstat (and a few others) to try and get a picture 
of how the app behaves when running. One of my problems is having too 
much data to analyse and not being entirely certain what is significant 
and what isn't.

So far I've seen the following characteristics,

On the head node:
* Memory usage is pretty constant at about 1GB while the model is 
running. An additional 2-3GB is used in memory buffers and memory 
caches, presumably because this node does a lot of I/O.
* Network traffic in averages at about 40 Mbit/sec but peaks to about 
940 Mbit/sec (I was surprised by this - I didn't think gigabit was 
capable of even approaching this in practice, is this figure dubious or 
are bursts at this speed possible on good Gigabit hardware?). Network 
traffic out averages about 35 Mbit/sec but peaks to about 200Mbit/sec. 
The peaks are very short (maybe a few seconds in duration, presumably at 
the end of an MPI "run" if that is the correct term).
* Processor usage averages about 25% but if I watch htop activity for a 
while I see bursts of 80-90% user activity on each core so the average 
is misleading.

On a compute node:
* Memory usage is pretty constant at about 700MB while the model is 
running with very little used in buffers or caches.
* Network traffic in averages at about 50 Mbit/sec but peaks to about 
200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks 
to about 200Mbit/sec. The peaks are very short (maybe a few seconds in 
duration, presumably at the end of an MPI "run" if that is the correct 
term).
* Processor usage averages about 20% but if I watch htop activity for a 
while I see bursts of 50-60% user activity on each core so the average 
is misleading.

I'm inclined to install sar on these nodes and run it for a while - 
although again I'm wary about generating lots of performance data if I'm 
not sure what I'm looking for. I'm also a little wary of some of the RRD 
based tools which (for space-saving reasons) seem to do a lot of 
averaging which may actually hide information about bursts. Given that 
the model run here seems to be quite bursty I think that peak 
information is important.

I'm still unsure what the bottleneck currently is. My hunch is that a 
faster interconnect *should* give a better performance but I'm not sure 
how to quantify that. Do others here running MPI jobs see big 
improvements in using Infiniband over Gigabit for MPI jobs or does it 
really depend on the characteristics of the MPI job? What 
characteristics should I be looking for?

The goals of this characterisation exercise are two-fold,

a) to identify what parts of the system any tuning exercises should 
focus on.
- some possible low hanging fruit includes enabling jumbo frames [some 
rough calculations suggest that we have 2 sizes of MPI messages, one at 
40k and one at 205k ... use of jumbo frames should significantly reduce 
the number of packets to transmit a message, but would the gains be 
significant?].
- Do people here normally tune the tcp/ip stack? My experience is that 
it is very easy to reduce the performance by trying to tweak kernel 
buffer sizes due to the trade-offs in memory ... and 2.6 Linux kernels 
should be reasonably smart about this.
- Have people had much success with bonding and gigabit or is there 
significant overheads in bonding?

b) to allow us to specify a new cluster which will run the model *faster*!
- from a perusal of past postings it sounds like current Opterons lag 
current Xeons in raw numeric performance (but only by a little) but that 
the memory controller architecture of Opterons give them an overall 
performance edge in most typical HPC loads, is that a correct 36,000ft 
summary or does it still depend very much on the application?

I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters 
available through their developer program for testing. Has anyone 
actually used these? It sounds like what we really need before spec'ing 
a new system is to list our assumptions and then go and test them on 
some similar hardware - these clusters would seem to offer an ideal 
environment for doing that but I'm wondering, in practice, how many 
hoops one has to jump through to avail of them ... and whether parties 
from outside of the US are even allowed access to these.

Apologies for the long-winded email but all feedback welcome. I'll be 
happy to summarise any off-list comments back to the list,

-stephen
-- 
Stephen Mulcahy, Applepie Solutions Ltd, Innovation in Business Center,
    GMIT, Dublin Rd, Galway, Ireland.      http://www.aplpi.com