[Beowulf] Strange NFS corruption on Linux cluster to AIX 5.2 NFS server
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Chris Samuel csamuel at vpac.orgThu Oct 7 18:54:06 PDT 2004
- Previous message: [Beowulf] Storage - vendors
- Next message: [Beowulf] NPC2004: Call For Participation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi folks, (system details at the end) I'm having a real hard time trying to track down a really bizzare NFS related issue on some clusters we're helping out on and I'm wondering if anyone here quickly knows the answer to this question before I go off trawling through the kernel sources. I have a 72K assembler file (the results of a day of narrowing down the problem) that when I do: as -o /tmp/file.o file.s generates a valid .o file, but when I do: as -o /some/nfs/directory/file.o file.s creates a corrupted object file (and in the original case leads to a link error due to the corrupted ELF format). However, cp'ing or cat'ing the object file from /tmp to the NFS filesystem is fine, it's just the assemblers output that is corrupted. I thought that this was just an NFS probem until I used strace to dump out the entire contents of the file descriptors that 'as' reads and writes to for the assembler file and for the object file, and then diff'd them. The only significant differences is that the write(2)'s to the object files are not the same, which I find extremely puzzling, I can see no way that the assembler can generate different output depending on whether the file it's just open()'d is on NFS or local disk. :-( My only thought is that strace (which uses ptrace(2)) is reading the data from the kernel at some point after it has been corrupted, presumably at some point in the NFS parts of the kernel. The problem with this file goes away (MD5 matches that of the one in created in /tmp) if I change rsize & wsize from 8192 to 4096, but then other object files get corrupted instead. :-( We've tried this out on three nodes in the cluster, and they all corrupt the output file, so it's unlikely to be a particular hardware problem. What is hurting my brain is that there is a mirror of this cluster both in OS installs (identical RPMs of the OS, especially kernel, gcc, assembler & libraries were used) and in firmware (BIOS and firmware updates were from the same CD) where this problem does not occur at all. In both situations the NFS server is an AIX 5.2 box, it is possible that there are minor differences there, but I cannot see how a difference in the NFS server could affect the output of the assembler on the Linux box before it goes anywhere near hitting the wire, let alone making it to the NFS server. The mount options are identical (we've checked both /etc/fstab and /proc/mounts) and rpm -Va doesn't show any unusual discrepancies between the two clusters. OS: RHEL3 Kernel: kernel-smp-2.4.21-15.EL Binutils: binutils-2.14.90.0.4-35 NFS-utils: nfs-utils-1.0.6-21EL Hardware: IBM x335 and IBM x345 dual Xeons. cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20041008/1e5e9377/attachment.bin
- Previous message: [Beowulf] Storage - vendors
- Next message: [Beowulf] NPC2004: Call For Participation
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
