josip at icase.edu
Mon May 13 11:19:53 PDT 2002
> The production code writes separate binary files from each processor
> using different file names (processor 1 creates a file called f1...
> processor n creates a file called fn) but all files are written to
> the NFS-mounted disk which resides on n1.
> Following completion of the production code, I run a routine that
> joins up the individual files into one large file. What I discovered
> is that some of these files created by the production code were
> corrupt (i.e. they contained extraneous bytes) which prevented my
> post-processing job from completing.
Try all of the following:
(1) Insert a short sleep (3 seconds) before starting joining routine
(2) Increase the number of nfsd threads on your server to about "n"
(3) Use the "noac" option in your /etc/fstab files
(4) If you use soft NFS mounts: use "retry=10" option in /etc/fstab
We had a similar problem which we believe was caused by timing of
events. The joining process was told to proceed as soon as the last of
the processes completed, which was faster than the files could be
actually sent to the server by NFS.
Better yet: use MPI instead of NFS to retrieve your code's results.
Dr. Josip Loncaric, Research Fellow mailto:josip at icase.edu
ICASE, Mail Stop 132C PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA Tel. +1 757 864-2192 Fax +1 757 864-6134
More information about the Beowulf