[Beowulf] Strange NFS corruption on Linux cluster to AIX 5.2 NFS server

Chris Samuel csamuel at vpac.org
Thu Oct 7 18:54:06 PDT 2004


Hi folks,

(system details at the end)

I'm having a real hard time trying to track down a really bizzare NFS related 
issue on some clusters we're helping out on and I'm wondering if anyone here 
quickly knows the answer to this question before I go off trawling through 
the kernel sources.

I have a 72K assembler file (the results of a day of narrowing down the 
problem) that when I do:

 as -o /tmp/file.o file.s

generates a valid .o file, but when I do:

 as -o /some/nfs/directory/file.o file.s

creates a corrupted object file (and in the original case leads to a link 
error due to the corrupted ELF format).

However, cp'ing or cat'ing the object file from /tmp to the NFS filesystem is 
fine, it's just the assemblers output that is corrupted.

I thought that this was just an NFS probem until I used strace to dump out the 
entire contents of the file descriptors that 'as' reads and writes to for the 
assembler file and for the object file, and then diff'd them.

The only significant differences is that the write(2)'s to the object files 
are not the same, which I find extremely puzzling, I can see no way that the 
assembler can generate different output depending on whether the file it's 
just open()'d is on NFS or local disk. :-(

My only thought is that strace (which uses ptrace(2)) is reading the data from 
the kernel at some point after it has been corrupted, presumably at some 
point in the NFS parts of the kernel.

The problem with this file goes away (MD5 matches that of the one in created 
in /tmp) if I change rsize & wsize from 8192 to 4096, but then other object 
files get corrupted instead. :-(

We've tried this out on three nodes in the cluster, and they all corrupt the 
output file, so it's unlikely to be a particular hardware problem.

What is hurting my brain is that there is a mirror of this cluster both in OS 
installs (identical RPMs of the OS, especially kernel, gcc, assembler & 
libraries were used) and in firmware (BIOS and firmware updates were from the 
same CD) where this problem does not occur at all.

In both situations the NFS server is an AIX 5.2 box, it is possible that there 
are minor differences there, but I cannot see how a difference in the NFS 
server could affect the output of the assembler on the Linux box before it 
goes anywhere near hitting the wire, let alone making it to the NFS server.

The mount options are identical (we've checked both /etc/fstab 
and /proc/mounts) and rpm -Va doesn't show any unusual discrepancies between 
the two clusters.

OS:  RHEL3
Kernel: kernel-smp-2.4.21-15.EL
Binutils: binutils-2.14.90.0.4-35
NFS-utils: nfs-utils-1.0.6-21EL

Hardware: IBM x335 and IBM x345 dual Xeons.

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20041008/1e5e9377/attachment.sig>


More information about the Beowulf mailing list