[Beowulf] Performance characterising a HPC application

Tue Mar 20 06:03:58 PDT 2007

A non-intrusive test you could try is to replace your MPI (mpich) with a 
lower-latency one. Scali or MPI/Gamma are just to name two. These can lower 
your latency down to 15muS or so.

If this drastically ups your efficiency you know where your bottleneck is.

More intrusive is to change your MPI communication pattern. Blocking v.s. 
non-blocking. Buffered v.s. non-buffered. But I guess this would determine 
whether the model is

5. software bound.

Mattijs

On Thursday 15 March 2007 16:03, stephen mulcahy wrote:
> Hi,
>
> I'm looking for any suggestions people might have on performance
> characterising a HPC application (how's that for a broad query :)
>
> Background:
> We have a 20 node opteron 270 (2.0GHz dual core, 4GB ram, diskless)
> cluster with gigabit ethernet interconnect. It is used primarily to run
> an Oceanography numerical model called ROMS (http://www.myroms.org/ in
> case anyone is interested). The nodes are running Debian GNU/Linux Etch
> (AMD64 version) and we're using the portland group fortan90 compiler and
> mpich2 for our MPI needs. The cluster has been in production mode pretty
> much since it was commissioned so I haven't gotten a chance to do much
> tuning and benchmarking.
>
> I'm currently trying to characterise the performance of the model, in
> particular to determine where it is
>
> 1. processor bound.
>
> 2. memory bound.
>
> 3. interconnect bound.
>
> 4. headnode bound.
>
> I'm curious about how others go about this kind of characterisation -
> I'm not at all familiar with the model at a code level (my expertise, if
> any!, is in the area of Linux and hardware rather than in fortran90
> code) so I don't have any particular insights from that perspective. I'm
> hoping I can characterise the app from outside using various measurement
> tools.
>
> So far, I've used a mix of things including Ganglia, htop, iostat,
> vmtstat, wireshark, ifstat (and a few others) to try and get a picture
> of how the app behaves when running. One of my problems is having too
> much data to analyse and not being entirely certain what is significant
> and what isn't.
>
> So far I've seen the following characteristics,
>
> On the head node:
> * Memory usage is pretty constant at about 1GB while the model is
> running. An additional 2-3GB is used in memory buffers and memory
> caches, presumably because this node does a lot of I/O.
> * Network traffic in averages at about 40 Mbit/sec but peaks to about
> 940 Mbit/sec (I was surprised by this - I didn't think gigabit was
> capable of even approaching this in practice, is this figure dubious or
> are bursts at this speed possible on good Gigabit hardware?). Network
> traffic out averages about 35 Mbit/sec but peaks to about 200Mbit/sec.
> The peaks are very short (maybe a few seconds in duration, presumably at
> the end of an MPI "run" if that is the correct term).
> * Processor usage averages about 25% but if I watch htop activity for a
> while I see bursts of 80-90% user activity on each core so the average
> is misleading.
>
> On a compute node:
> * Memory usage is pretty constant at about 700MB while the model is
> running with very little used in buffers or caches.
> * Network traffic in averages at about 50 Mbit/sec but peaks to about
> 200 Mbit/sec. Network traffic out averages about 50 Mbit/sec but peaks
> to about 200Mbit/sec. The peaks are very short (maybe a few seconds in
> duration, presumably at the end of an MPI "run" if that is the correct
> term).
> * Processor usage averages about 20% but if I watch htop activity for a
> while I see bursts of 50-60% user activity on each core so the average
> is misleading.
>
> I'm inclined to install sar on these nodes and run it for a while -
> although again I'm wary about generating lots of performance data if I'm
> not sure what I'm looking for. I'm also a little wary of some of the RRD
> based tools which (for space-saving reasons) seem to do a lot of
> averaging which may actually hide information about bursts. Given that
> the model run here seems to be quite bursty I think that peak
> information is important.
>
> I'm still unsure what the bottleneck currently is. My hunch is that a
> faster interconnect *should* give a better performance but I'm not sure
> how to quantify that. Do others here running MPI jobs see big
> improvements in using Infiniband over Gigabit for MPI jobs or does it
> really depend on the characteristics of the MPI job? What
> characteristics should I be looking for?
>
> The goals of this characterisation exercise are two-fold,
>
> a) to identify what parts of the system any tuning exercises should
> focus on.
> - some possible low hanging fruit includes enabling jumbo frames [some
> rough calculations suggest that we have 2 sizes of MPI messages, one at
> 40k and one at 205k ... use of jumbo frames should significantly reduce
> the number of packets to transmit a message, but would the gains be
> significant?].
> - Do people here normally tune the tcp/ip stack? My experience is that
> it is very easy to reduce the performance by trying to tweak kernel
> buffer sizes due to the trade-offs in memory ... and 2.6 Linux kernels
> should be reasonably smart about this.
> - Have people had much success with bonding and gigabit or is there
> significant overheads in bonding?
>
> b) to allow us to specify a new cluster which will run the model *faster*!
> - from a perusal of past postings it sounds like current Opterons lag
> current Xeons in raw numeric performance (but only by a little) but that
> the memory controller architecture of Opterons give them an overall
> performance edge in most typical HPC loads, is that a correct 36,000ft
> summary or does it still depend very much on the application?
>
> I notice that AMD (and Mellanox and Pathscale/Qlogic) have clusters
> available through their developer program for testing. Has anyone
> actually used these? It sounds like what we really need before spec'ing
> a new system is to list our assumptions and then go and test them on
> some similar hardware - these clusters would seem to offer an ideal
> environment for doing that but I'm wondering, in practice, how many
> hoops one has to jump through to avail of them ... and whether parties
> from outside of the US are even allowed access to these.
>
> Apologies for the long-winded email but all feedback welcome. I'll be
> happy to summarise any off-list comments back to the list,
>
> -stephen

-- 

Mattijs Janssens

OpenCFD Ltd.
9 Albert Road,
Caversham,
Reading RG4 7AN.
Tel: +44 (0)118 9471030
Email: M.Janssens at OpenCFD.co.uk
URL: http://www.OpenCFD.co.uk