Writing/Reading Files
Donald Becker
becker at scyld.com
Mon May 13 10:54:22 PDT 2002
On Mon, 13 May 2002, Wheeler.Mark wrote:
> We have a problem writing and then reading files across the nodes.
...
> Following completion of the production code, I run a routine that
> joins up the individual files into one large file. What I discovered is
> that some of these files created by the production code were corrupt
> (i.e. they contained extraneous bytes) which prevented my
> post-processing job from completing.
>
> It seems to me that this problem is somehow related to NFS mounted
> disks and file transfers perhaps under memory load (i.e. even though my
> production code completes BEFORE I execute the rcp).
Are you using NFS v2 or v3?
What network hardware are you using?
Are you seeing any network errors reported in /proc/net/dev?
My first guess would be that the data is being corrupted in memory.
The next likely problem is that you are using network hardware that
computes and checks the UDP/IP packet checksum in the NIC, rather than
having the CPU compute the checksum.
It's also possible that your disk hardware is corrupting writes during
heavy PCI bus usage.
--
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf
mailing list