[Beowulf] Strange NFS corruption on Linux cluster to AIX 5.2 NFS server
csamuel at vpac.org
Thu Oct 7 18:54:06 PDT 2004
(system details at the end)
I'm having a real hard time trying to track down a really bizzare NFS related
issue on some clusters we're helping out on and I'm wondering if anyone here
quickly knows the answer to this question before I go off trawling through
the kernel sources.
I have a 72K assembler file (the results of a day of narrowing down the
problem) that when I do:
as -o /tmp/file.o file.s
generates a valid .o file, but when I do:
as -o /some/nfs/directory/file.o file.s
creates a corrupted object file (and in the original case leads to a link
error due to the corrupted ELF format).
However, cp'ing or cat'ing the object file from /tmp to the NFS filesystem is
fine, it's just the assemblers output that is corrupted.
I thought that this was just an NFS probem until I used strace to dump out the
entire contents of the file descriptors that 'as' reads and writes to for the
assembler file and for the object file, and then diff'd them.
The only significant differences is that the write(2)'s to the object files
are not the same, which I find extremely puzzling, I can see no way that the
assembler can generate different output depending on whether the file it's
just open()'d is on NFS or local disk. :-(
My only thought is that strace (which uses ptrace(2)) is reading the data from
the kernel at some point after it has been corrupted, presumably at some
point in the NFS parts of the kernel.
The problem with this file goes away (MD5 matches that of the one in created
in /tmp) if I change rsize & wsize from 8192 to 4096, but then other object
files get corrupted instead. :-(
We've tried this out on three nodes in the cluster, and they all corrupt the
output file, so it's unlikely to be a particular hardware problem.
What is hurting my brain is that there is a mirror of this cluster both in OS
installs (identical RPMs of the OS, especially kernel, gcc, assembler &
libraries were used) and in firmware (BIOS and firmware updates were from the
same CD) where this problem does not occur at all.
In both situations the NFS server is an AIX 5.2 box, it is possible that there
are minor differences there, but I cannot see how a difference in the NFS
server could affect the output of the assembler on the Linux box before it
goes anywhere near hitting the wire, let alone making it to the NFS server.
The mount options are identical (we've checked both /etc/fstab
and /proc/mounts) and rpm -Va doesn't show any unusual discrepancies between
the two clusters.
Hardware: IBM x335 and IBM x345 dual Xeons.
Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
More information about the Beowulf