restartability problem
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduMon May 6 11:40:05 PDT 2002
- Previous message: restartability problem
- Next message: nfs issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 6 May 2002, Kelley Wittmeyer wrote: > When our model is distributed over more than 1 process using MPI, we are > not getting the correct output files. The output file is a direct access > file in which each process writes to specific records in this single > file. This produces output files that are process independent. This When you say "direct access file" just what do you mean? hash? btree? flat (recno)? If it is flat (and I'd guess that it is) it sounds like the processes aren't successfully locking the file when they attempt their independent writes -- I get corruption just like this if I accidentally try opening and writing to an ordinary file that is already open and being written to by another process. Writes are not guaranteed to be atomic, and the two programs can interleave the writing back of their images and get all sort of "interesting" results like you describe. Collisions will be more likely for larger files, of course. For help towards a solution (if you think this is the problem after looking at it further) see "man open" and read about the O_EXCL flag and NFS. It sounds like you may have to play some games with link() and stat() for each MPI process to get the lock on the file in turn. Alternatively, you could have each process write to its own unique file and do a merge, or you could carefully control the timing of the dbopens, writes and closes so that two processes can never have the file open at once. Hope this helps. Although your problem seems familiar, I don't usually write to dbopen db's and it might be something else entirely. rgb > technique has been used successfully on IBM SP, SGI O2K, and Apple G4 > platforms. Problem symptoms are that the restart files are generally too > large (too many records) and larger files (say, over 10MB) are always > corrupted, but smaller files (say, under 1MB) are generally, but not > always, ok. We suspect this is a problem w/ our file system setup (NFS). > Any ideas on how to fix this would be greatly appreciated. > > Thank you in advance! > Kelley Wittmeyer > Atmospheric Science > Colorado State University > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: restartability problem
- Next message: nfs issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
