wheeler.mark at ensco.com
Mon May 13 09:53:50 PDT 2002
We have a problem writing and then reading files across the nodes.
We have a fluid dynamics production code running on a 16-node, 32 processor cluster (Portland compilers, ScaMPI, 933 MHz Pentium III processors with 512 MB of memory per node).
We have cross mounted a disk (/home./cluster1) from the master node (n1) with all other nodes and can edit, RW and copy files from any node to the disk with no trouble.
The production code writes separate binary files from each processor using different file names (processor 1 creates a file called f1...processor n creates a file called fn) but all files are written to the NFS-mounted disk which resides on n1.
Following completion of the production code, I run a routine that joins up the individual files into one large file. What I discovered is that some of these files created by the production code were corrupt (i.e. they contained extraneous bytes) which prevented my post-processing job from completing.
I noted that the extraneous bytes were never in the same location or same file. I thought it was the application so I wrote a small test job to mimic the fluid dynamics code in loading up the memory and then performing I/O. What I discovered is that under memory load, I could usually reproduce the problem - when I just performed I/O (with minimal or no memory load), all files seemed to be OK. I repeated the test using formatted I/O and still get corrupt files. I also cross mounted a disk on node4 (i.e. NFS mounted with all other nodes) and created corrupt files there as well suggesting that there is nothing unique (or bad) with disk sectors on /home/cluster1.
I then ran a test where I had the application and test job write to local disks (on each node but not cross mounted with any other nodes). Under these conditions I cannot reproduce the problem. However, I found that upon completion of my production job, I produced corrupt files when my script executes a rcp to collect files from the individual nodes. I checked the files on the local nodes of each disk and they are fine (i.e. not corrupt so my application code does NOT seem to be the culprit). When I redo the rcp for the handful of corrupt files, I get successful transfer and can run my post-processing job to patch the files together.
It seems to me that this problem is somehow related to NFS mounted disks and file transfers perhaps under memory load (i.e. even though my production code completes BEFORE I execute the rcp).
Does anyone have any thoughts on what might be causing this problem including additional tests we can perform to isolate the cause?
More information about the Beowulf