Writing/Reading Files

Mon May 13 14:04:06 PDT 2002

On Mon, 13 May 2002, Wheeler.Mark wrote:

> We have a problem writing and then reading files across the nodes.
>
> We have a fluid dynamics production code running on a 16-node, 32
> processor cluster (Portland compilers, ScaMPI, 933 MHz Pentium III
> processors with 512 MB of memory per node).
>
> We have cross mounted a disk (/home./cluster1) from the master node
> (n1) with all other nodes and can edit, RW and copy files from any node
> to the disk with no trouble.
>
> The production code writes separate binary files from each processor
> using different file names (processor 1 creates a file called
> f1...processor n creates a file called fn) but all files are written to
> the NFS-mounted disk which resides on n1.
>
> Following completion of the production code, I run a routine that
> joins up the individual files into one large file. What I discovered is
> that some of these files created by the production code were corrupt
> (i.e. they contained extraneous bytes) which prevented my
> post-processing job from completing.

How big is n?  How big are the f files?  Which kernel are you using?  If
you run the routine more than once and do NOT destroy the fn files in
between, do you get the right result the second or third time (always)
but not the first?

When you say "following completion" do you mean after (well after) the
production code has closed the files and exited, or do you mean
"immediately within milliseconds to seconds after" or even "during"?

To be specific, your application aside:

if one creates n files called f1...fn with (say) perl so that all n
files are sitting in a directory and can be read, one at a time, with
less or otherwise (md5sum) validated, can you run

  cat f* > big

or even more conservatively (tcsh, obvious equivalent in bash)

  cp /dev/null big
  foreach f (f*)
    > cat $f >> big
    > end

and find that big is corrupted, or is it only corrupted if a single
application is creating the f's at more or less the same time that the
concatenation is occurring?

I'd be moderately surprised if cat f* > big fails, provided that the
files are closed and actually written back to the NFS server at the time
the cat is initiated.  If the files are "closed" but still in cache or
buffer (perhaps delayed because of memory being hammered) then you're
likely reading from the files before they properly exist.

In this case, your real problem is in using the ASYNCHRONOUS NFS as an
IPC mechanism.  How can you tell when NFS has completed the write back
to disk on another host over a variable load network and server setup?
If you try to open the file before it is completed, you'll get
truncation and garbage as it probably won't even have an EOF mark.  

If this is the problem, you might still achieve success by any of many
means:

  a) increasing NFS server capacity, tuning.  Frankly, I wouldn't
recommend this hardly makes it reliable, only more reliable.

  b) increasing the delay between running and reaping.  Run now, reap
minutes to hours later, not milliseconds to seconds later.  This still
isn't reliable, but might well be "reliable enough" as NFS generally
completes writebacks as rapidly as it can, and few indeed won't be done
an hour after the file is closed unless the server load is
stratospheric.

This is what I do and I've never encountered this problem for N up to
dozens EXCEPT when one of the files I've tried reaping was actually not
yet closed by the writing application and written back to disk and was
in fact actively writing.  On the other hand, it works fine to graze
"results to date" that ARE written and flushed out to disk (after a
delay of seconds, to ensure that NFS has time to get them with a very
high degree of probability), even from a still-open file, as long as the
result is well and truly flushed and the writeback is completed.

  c) add some removable, terminating "feature" to the file that can be
used to flag proper completion -- e.g. a special EOF marker like <all
done now> as the last line.  That way your reaper application can loop,
polling (closing and reopening) EACH file until this marker appears,
then do the aggregation of that file.  Kludgy and wasteful, but
reliable.

  d) use a reliable messaging system -- mpi, pvm, raw sockets -- instead
of NFS.

Note well that many folks wouldn't consider even tcp sockets any more
"reliable" as an IPC mechanism than NFS (that is to say, not very)
without a fair bit of effort.  Precisely the same issues exist -- TCP
doesn't grok "message", is asynchronous (so you can never be sure a read
is "finished") and yet is "reliable" in that sooner or later it
sort-of-guarantees that whatever you write on one socket end will be
delivered, reliably, sequenced and all, to be read at the other end.
Consequently, one has to either decide on a message terminator at both
ends (often a \n, for example, in line-at-a-time contexts) or start your
message with the length of the message to be sent (as http 1.1 does in
the Content_Length message header, which is ITSELF sent on the \n line
at a time basis with a blank line as a terminator).

If you use NFS as a messaging protocol, you will probably need to adopt
a similar trick to ensure that NFS has a chance to behave "reliably" in
an asynchronous environment where the NFS protocol CANNOT guarantee that
it is finished reliably writing a file, only that given time
(uninterrupted by a crash, of course) that it WILL reliably finish
writing the file.  Note well that unlike a raw socket (where you can
just keep reading until you get the whole message) you will likely have
to start an NFS read OVER that hasn't finished when you start --
otherwise buffering and the lack of an EOF marker will probably
introduce both truncation and corruption.

This is not an NFS bug, it is just that you aren't correctly
anticipating its features.

If tests above reveal the problem, I'd recommend c) or d) as a solution,
depending on how efficient you need to be (c is fast to implement but
kludgy and expensive in runtime, but may be fast ENOUGH for all that, d
is more difficult to implement but can be made as efficient as your
network will stand).

HOWEVER, it is POSSIBLE that NFS has actual bugs in it (it certainly has
in the past:-), and even possible that those bugs are load or memory
occupancy dependent or (more likely) hardware dependent.  That's why the
kernel revision matters, and why you need to proceed to demonstrate that
the corruption occurs reproducibly with mundane tools, not just with a
custom written application that might be corrupting things for reasons
that have nothing to do with the "reliability" of NFS.

If you can make the corruption consistently occur for a cat of (say)
1024 dummy f-files that are DEFINITELY closed and totally static --
files that can be cat'd one at a time or md5'd to the exactly correct
checksum -- you should probably report this to the kernel and/or NFS
lists.  I'd be a bit surprised if you can, because as I think I
mentioned in a previous post, you can use e.g. tiobench to absolutely
hammer on an NFS mount (much harder than my modest application of cat,
or even than a bunch of dd if=/dev/zero of=testfile count=1000000&'s)
and I know that at least one of the NFS guys does.  OTOH, that SAME guy
readily acknowledges NFS problems in certain 2.4.x snapshots.

> I noted that the extraneous bytes were never in the same location or
> same file.  I thought it was the application so I wrote a small test job
> to mimic the fluid dynamics code in loading up the memory and then
> performing I/O. What I discovered is that under memory load, I could
> usually reproduce the problem - when I just performed I/O (with minimal
> or no memory load), all files seemed to be OK. I repeated the test using
> formatted I/O and still get corrupt files. I also cross mounted a disk
> on node4 (i.e. NFS mounted with all other nodes) and created corrupt
> files there as well suggesting that there is nothing unique (or bad)
> with disk sectors on /home/cluster1.
>
> I then ran a test where I had the application and test job write to
> local disks (on each node but not cross mounted with any other nodes).
> Under these conditions I cannot reproduce the problem. However, I found
> that upon completion of my production job, I produced corrupt files when
> my script executes a rcp to collect files from the individual nodes. I
> checked the files on the local nodes of each disk and they are fine
> (i.e. not corrupt so my application code does NOT seem to be the
> culprit). When I redo the rcp for the handful of corrupt files, I get
> successful transfer and can run my post-processing job to patch the
> files together.
>
> It seems to me that this problem is somehow related to NFS mounted
> disks and file transfers perhaps under memory load (i.e. even though my
> production code completes BEFORE I execute the rcp).

I'm trying to understand this one.  Are you saying that you do something
like this:

rgb at ganesh|T:284>dd if=/dev/zero of=/tmp/dummy count=1024
1024+0 records in
1024+0 records out
rgb at ganesh|T:285>md5sum /tmp/dummy
59071590099d21dd439896592338bf95  /tmp/dummy

(where /tmp is local disk space) and then scp the result to a local disk
(NFS mounted or not):

rgb at ganesh|T:286>scp /tmp/dummy .
dummy 100% |*****************************| 512 KB 00:00    
rgb at ganesh|T:287>md5sum dummy
59071590099d21dd439896592338bf95  /home/einstein/rgb/dummy

and get DIFFERENT md5sums?  I find that very, very hard to understand,
and am not sure what it has to do with NFS (as the local target
directory may or may not be an NFS mount).  That would sound like a
fairly serious memory corruption problem.

In summary: Try inserting an md5 checksum step in between completion and
the scp, and playing some of the games above to try to isolate whether
or not the problem is just that you're not waiting for or testing
whether an ASYNCHRONOUS process has completed before proceeding.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu