[Beowulf] serious NFS problem on Mandrake 10.0

David Mathog mathog at mendel.bio.caltech.edu
Tue Dec 7 09:59:33 PST 2004


Greg Lindahl pointed out the problem - soft nfs mounts.  I'm still
not entirely clear why this was triggering problems in the
original case, where the entire set of NFS copies took <4.5
seconds. From my understanding of timeo and retrans there
should have been at least .7 + 1.4 + 2.8 delays for 3 retrans
before a major timeout was declared.  That adds up to 4.9 seconds.
In any case, changing the mount to "hard" eliminated the problem.
Maybe it only happened when there was a short burst of other net
activity at the same time?

Oddly, leaving the mount at soft and changing retrans to 50
did not completely eliminate the problem when the test script
ran, it only reduced it to 2 errors out of 1800 transfers.  The
script ran in <360 seconds, so apparently the 60 second minor
timeout is promoted to a major timeout no matter how many retrans
are left.  Either that or something else went wrong unrelated
to retrans.

Is there any facility in linux, or as an add on, to serialize file
transfers?  In other words, in this case we know that N files of
roughly the same size must be transferred to one disk.  The current
method sends the data asynchronously and so there is some interference
between the nodes.  It also hops the head around on the disk as it
tries to write simultaneously to all N files.  (Subject to whatever
the disk subsystem can do to sort that out.)  Ideally rather than
just doing "mv /tmp/blah.nodename /wherever" on each compute node
in this situation a script could do instead:

   "queuemv /tmp/blah.nodename /wherever"

where "queuemv" would take care of moving the data as fast as possible
over the network _without contention_ and writing it sequentially
to the N files, one file at a time.  Is there something like
queuemv available?  I can see how to do this using a
standard SGE sort of qsub but the overhead for a
conventional queue system is awfully high for this
particular application. If it was just one node then my msgqueue
application (a command line interface to ipcs) could be used:

ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/msgqueue.html

but this particular operation requires synchronization between
multiple nodes, and ipcs doesn't share message queues across the
network.  Hmm, I suppose that each node could rsh to the master
node and run the msgqueue in a script there.  Alternatively, is
there a network/cluster variant of ipcs???

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list