restartability problem
Kelley Wittmeyer
kelley at atmos.colostate.edu
Mon May 6 10:34:43 PDT 2002
Hi gang.
We've just stepped into the cluster arena and got everything going
except for program restartability. Hope someone might be able
to shine some light on the problem.
hardware: 16 node, dual proc amd athlon mp on a tyan thunder
intel 10/100/1000 network cards (switch is currently
still a 100, however)
lotsa ram, scsi disks
software: redhat 7.2
mpich 1.2.x
netcdf v3
f90 (pgi 3.3-1)
nfs (from the redhat distribution)
When our model is distributed over more than 1 process using MPI, we are
not getting the correct output files. The output file is a direct access
file in which each process writes to specific records in this single
file. This produces output files that are process independent. This
technique has been used successfully on IBM SP, SGI O2K, and Apple G4
platforms. Problem symptoms are that the restart files are generally too
large (too many records) and larger files (say, over 10MB) are always
corrupted, but smaller files (say, under 1MB) are generally, but not
always, ok. We suspect this is a problem w/ our file system setup (NFS).
Any ideas on how to fix this would be greatly appreciated.
Thank you in advance!
Kelley Wittmeyer
Atmospheric Science
Colorado State University
More information about the Beowulf
mailing list