[Beowulf] hpl size problems

Tue Sep 27 07:03:22 PDT 2005

On Tue, 2005-09-27 at 09:41 -0400, Robert G. Brown wrote:
> Ashley Pittman writes:
> 
> > There is a wonderful tool written by LANL specifically for measuring
> > this kind of background "jitter" on nodes, it's called 'whatelse' and is
> > a perl script that samples node state before and after <something> and
> > reports on the different.  <something> can either be an application or a
> > sample time.  It allows you to see precisely how many CPU cycles are
> > free for the application to use.
> > 
> > Running in on one (of the not particularly tuned) systems here I see
> > 99.983% IDLE CPU time over a minute with two processes using JIFFIES and
> > four page faults.  My desktop did worse with 70% idle whilst writing
> > this mail.
> 
> I'm very curious as to just what it does.  Something different than the
> /usr/bin/time command or what you can see running e.g. vmstat or top
> while the task is running?

I guess it's like time on steriods, you start it and it prints a single
report at the end.  It gives a lot more information that top or vmstat.

> Granted that a well-parallelized task is often a CPU bound task seeing
> how long a task spends in userspace, kernelspace and so on (and what the
> overal system duty cycle is while it is running) is certainly useful,
> but there are a lot of tools that can return this information already,
> including at least one (xmlsysd/wulfstat) that can do so for a whole
> cluster at once.  What exactly are they parsing and looking at?

It's different to this in that it's aim is to see what else (hence the
name) was happening on the node whilst your job was running.

> I ask because if it is NOT something implicitly in xmlsysd/wulfstat,
> I'll bet it is pretty easy to add.  I already can parse fields out of
> the pid structs -- I just haven't bothered returning utime, stime,
> cutime, cstime because it wasn't clear that most users would have any
> need for it while monitoring their tasks.

Basically it loads the contents
of /proc/*/stats, /proc/meminfo, /proc/stat, /proc/interrupts
and /proc/net/dev, waits for a time and then reads them again and
reports the difference.

One if it's really nice features (and presumably design requirements) is
that it doesn't consume any CPU cycles itself so it can be used during
benchmark runs.

I'll send copies off-list to anyone who asks, it's a 700 line perl
script so probably to big to post to the list.

Ashley,