[Beowulf] serious NFS problem on Mandrake 10.0

Fri Dec 3 15:23:02 PST 2004

> > This is very, very, VERY bad. 
> 
> indeed.  is it safe to assume your machines are quite stable
> (memtest86-wise)?

Yes.  They run days and days without any errors.  (S2466 motherboards
with single Athlon MP 2200+ processors, ECC enabled.)

>the fact that it's 32K is interesting, since 
> I suspect your NFS block size is that (see /proc/mounts to verify).

Yes, that is the size for /u1 in /proc/mounts. 

I wrote a little script to beat on the NFS system with copies
from remote nodes to the master, it's attached after my signature.
It did the mv over NFS operation 100 times on each of 18 nodes
and then did the md5sum when it got there. All running
simultaneously.

For 1800 of these network copies there were 142 where the md5sum
didn't match.  Note that I couldn't get an error out of this
just running two at a time in a couple of tries so total load
seems to matter, presumably on the master node, but maybe on
the switch?  Every node logged these sorts of errors and the variation
looks like random scatter. Given the low rate (relatively speaking)
it is probably one event per file, and since the files are around
1.3 Mb each, or about 40 blocks of 32k per mv, it seems like
the error rate per 32k block is 142/(1800*40) = .00197. 

> does your
> server have DIRECT_IO enabled on NFS? 

CONFIG_NFS_DIRECTIO=y

> what kind of block device is it 
> writing too,

It's an IBM scsi disk going through the Adaptec controller
on the Tyan S2468UGN motherboard.  Shows up as /dev/sde.
Not sure if that answers the question.

>and what filesystem for that matter?
ext2

>  or have you already
> tried a different filesystem?

No spare disk to build another filesystem on.

Doesn't seem likely to be the file system since if it was
giving those error rates in normal writes that disk would be
swiss cheese by now.

One last thing, one of these events was registered:

Dec  3 14:52:10 safserver ifplugd(eth1)[1649]: Link beat lost.
Dec  3 14:52:11 safserver ifplugd(eth1)[1649]: Link beat detected.

but it wouldn't explain all the errors because they were scattered
through the run time, and that took a lot longer than 1 second
to complete.

Network is 100baseT through a DLINK DSS-24 switch.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

cat testmv.sh #this may wrap!!!!
#!/bin/sh
cd ~safrun
NODE=`hostname`
count=100
set `md5sum /tmp/SAVELASTMEGABLAST.txt`
HOLDMD=$1
echo "initial md5sum is $HOLDMD" > /tmp/ERRORS.$NODE
while [ $count -gt 1 ]
do
  count=`expr $count - 1`
  cp /tmp/SAVELASTMEGABLAST.txt /tmp/TESTLAST.txt
  mv /tmp/TESTLAST.txt ./TESTLAST.txt.$NODE
  set `md5sum TESTLAST.txt.$NODE`
  NEWMD=$1
  /bin/rm ./TESTLAST.txt.$NODE
  if [ "$NEWMD" != "$HOLDMD" ] 
  then
    echo "error:  md5sum is $NEWMD at $count" >>/tmp/ERRORS.$NODE
  fi
done
echo "error:  done" >>/tmp/ERRORS.$NODE