[Warewulf] Re: [Beowulf] hpl size problems
Ashley Pittman
ashley at quadrics.com
Tue Sep 27 03:58:40 PDT 2005
On Mon, 2005-09-26 at 15:43 -0400, Andrew Piskorski wrote:
> It's amusing that Mark Hahn is already participating in this thread,
> because his post to the Beowulf list gave a link explaining a detailed
> real-world example of that effect very nicely:
>
> http://www.beowulf.org/archive/2005-July/013215.html
> http://www.sc-conference.org/sc2003/paperpdfs/pap301.pdf
>
> Basically, daemons cause interrupts which are not synchronized across
> nodes, which causes lots of variation in barrier latency across the
> nodes - AKA, jitter. And with barrier-heavy code, lots of jitter
> causes disastrous performance. On the 8192 processor ASCI Q, they saw
> a FACTOR OF TWO performance loss due to those effects...
Well remembered, that is indeed a very good paper and one that everybody
should read.
If I remember though the effect didn't really kick in until ~700 nodes
and would only be reproducible if the barrier time is significantly
lower than the tick rate of the scheduler. I doubt it's relevant in
this case.
There is a wonderful tool written by LANL specifically for measuring
this kind of background "jitter" on nodes, it's called 'whatelse' and is
a perl script that samples node state before and after <something> and
reports on the different. <something> can either be an application or a
sample time. It allows you to see precisely how many CPU cycles are
free for the application to use.
Running in on one (of the not particularly tuned) systems here I see
99.983% IDLE CPU time over a minute with two processes using JIFFIES and
four page faults. My desktop did worse with 70% idle whilst writing
this mail.
Ashley,
More information about the Beowulf
mailing list