[Beowulf] Re: Tracing down 250ms open/chdir calls

Mon Feb 16 05:24:00 PST 2009

Carsten Aulbert wrote:
> Hi all,
> 
> sorry in advance for this vague subject and also the vague email, I'm
> trying my best to summarize the problem:
> 
> On our large cluster we sometimes encounter the problem that our main
> scheduling processes are often in state D and in the end not capable
> anymore of pushing work to the cluster.

Hi Carsten

   Are you using a "standard" cluster scheduler (SGE, PBS, ...) or a 
locally written one?

> The head nodes are 8 core boxes with Xeon CPUs and equipped with 16 GB
> of memory, when certain types of jobs are running we see system loads of
> about 20-30 which might go up to 80-100 from time to time. Looking at
> the individual cores they are mostly busy with system tasks (e.g. htop
> shows 'red' bars).

Hmmm...  These are your head nodes?  Not your NFS server nodes?  Sounds 
like there are a large number of blocked IO processes ... try a

	vmstat 1

and look at the "b" column (usually second from left ... or nearly 
there).  Lots of blocked IO processes can have the affect of introducing 
significant latency into all system calls.

> stat -tt -c showed that several system calls of the scheduler take a
> long time to complete, most notably open and chdir which took between
> 180 and 230ms to complete (during our testing). Since most of these open
> and chdir are via NFSv3 I'm including that list as well. The NFS servers
> are Sun Fire X4500 boxes running Solaris 10u5 right now.

Hmmm.... What happens if you make these local to each box?  What are the 
mount options for the mount points?  We have spoken to other users with 
performance problems on such servers.  The NFS server/OS combination you 
indicate above isn't known to be very fast.  This isn't your problem 
(see later), but it looks like your own data suggests you are giving up 
nearly an order of magnitude performance using this NFS server/OS 
combination, likely at a premium price as well.

> A standard output line looks like:
> 93.37   38.997264     230753      169    78      open
> 
> i.e. 93.37% of the system-related time was spend in 169 successful open
> calls which took 230753us/call, thus 39 wall clock seconds were spend in
> a minute just doing open.
> 
> We tried several things to understand the problem, but apart from moving
> more files (mostly log files of currently running jobs) off NFS we did
> not move far ahead so far. On
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/H2Problems
> we have summarized some things.

I didn't catch if these are tcp/udp mounts.  What are the mount options?

> With the help of 'stress' and a tiny program just doing open/putc/close
> into a single file, I've tried to get a feeling how good or bad things
> are when compared to other head nodes with different tasks/loads:
> 
> https://n0.aei.uni-hannover.de/twiki/bin/view/ATLAS/OpenCloseIotest

... and these results do in fact confirm the comment I made about the 
NFS server/OS combo not known to be fast.

 From my laptop running a Ubuntu 8.10 variant over gigabit to a known 
very slow NFS server ...

./measure.exe /misc/crunch/home/landman/test2
Wrote 24905 times into /misc/crunch/home/landman/test2, speed is 
421.14/s, time elapsed: 59.14s

from a different server in a different location to a JackRabbit-S server 
over NFS on a single gigabit with default mounts (this server running a 
long IOzone)

./measure.exe /mnt/d/1
Wrote 124312 times into /mnt/d/1, speed is 2072.99/s, time elapsed: 59.97s

and from the same starting point but to a JackRabbit-S server over NFS 
on a single gigabit with default mounts (this server idle)

./measure.exe /mnt/d/x
Wrote 134426 times into /mnt/d/x, speed is 2270.79/s, time elapsed: 59.20s

This system would be roughly similar to your d02 machine.

ly
Assuming you aren't using mount options of noac,sync,...  Could you 
enlighten us as to what mount options are for the head nodes?

Also, the way the code is written, you are doing quite a few calls to 
gettimeofday ... you could probably avoid this with a little re-writing 
of the main loop.

If you are using noac or sync on your NFS mounts, then this could 
explain some of the differences you are seeing (certainly the 100/s vs 
800/s ... but not likely the 4/s)

However, if you notice that h2 in your table is an apparent outlier, 
there may be something more systematic going on there.  Since you 
indicate there is a high load going on while you are testing, this is 
likely what you need to explore.

Grab the atop program.  It is pretty good about letting you explore what 
is causing load.  That is, despite your x4500's/Solaris combo showing 
itself not to be a fast NFS server the problem appears to be more likely 
more localized on the h2 machine than on the NFS machines.

http://freshmeat.net/projects/atop/

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615