becker at scyld.com
Mon May 13 10:54:22 PDT 2002
On Mon, 13 May 2002, Wheeler.Mark wrote:
> We have a problem writing and then reading files across the nodes.
> Following completion of the production code, I run a routine that
> joins up the individual files into one large file. What I discovered is
> that some of these files created by the production code were corrupt
> (i.e. they contained extraneous bytes) which prevented my
> post-processing job from completing.
> It seems to me that this problem is somehow related to NFS mounted
> disks and file transfers perhaps under memory load (i.e. even though my
> production code completes BEFORE I execute the rcp).
Are you using NFS v2 or v3?
What network hardware are you using?
Are you seeing any network errors reported in /proc/net/dev?
My first guess would be that the data is being corrupted in memory.
The next likely problem is that you are using network hardware that
computes and checks the UDP/IP packet checksum in the NIC, rather than
having the CPU compute the checksum.
It's also possible that your disk hardware is corrupting writes during
heavy PCI bus usage.
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf