[Beowulf] serious NFS problem on Mandrake 10.0

Fri Dec 3 15:41:44 PST 2004

> > > This is very, very, VERY bad. 
> > 
> > indeed.  is it safe to assume your machines are quite stable
> > (memtest86-wise)?
> 
> Yes.  They run days and days without any errors.  (S2466 motherboards
> with single Athlon MP 2200+ processors, ECC enabled.)

hmm.  I don't think I've ever talked to anyone who had a cluster
of AthlonMP's that they considered completely stable.  at least 
not at the level of stability of clusters of Opterons, Xeons, etc.

> just running two at a time in a couple of tries so total load
> seems to matter, presumably on the master node, but maybe on
> the switch?

I'd be most surprised if the switch was implicated, simply
because the 32K doesn't correspond well with switch behavior.

> Every node logged these sorts of errors and the variation
> looks like random scatter. Given the low rate (relatively speaking)
> it is probably one event per file, and since the files are around
> 1.3 Mb each, or about 40 blocks of 32k per mv, it seems like
> the error rate per 32k block is 142/(1800*40) = .00197. 

not bad ;)

seriously, I guess its also worth asking whether the 32K is aligned.

> > does your
> > server have DIRECT_IO enabled on NFS? 
> 
> CONFIG_NFS_DIRECTIO=y

hmm.  that was a "this is marked experimental and could be a race
of some sort <vigorous handwaving>" sort of hmm.

> > what kind of block device is it 
> > writing too,
> 
> It's an IBM scsi disk going through the Adaptec controller
> on the Tyan S2468UGN motherboard.  Shows up as /dev/sde.
> Not sure if that answers the question.

that's fine - the driver/disk seems unlikley to just lose track 
of the occasional 32K chunk ;)

> >and what filesystem for that matter?
> ext2

OK, so 32K doesn't really match there, since ext2 tends to think in 
terms of 4k blocks.

> >  or have you already
> > tried a different filesystem?
> 
> No spare disk to build another filesystem on.

I was mainly thinking ahead if you said xfs or reiserfs;
the former has some larger-sized blocks, and the latter is something 
I don't really trust.  I would definitely not rank ext2 as a high risk
here.

> Doesn't seem likely to be the file system since if it was
> giving those error rates in normal writes that disk would be
> swiss cheese by now.

right, though this is a different load than you generate by other means,
probably.

> One last thing, one of these events was registered:
> 
> Dec  3 14:52:10 safserver ifplugd(eth1)[1649]: Link beat lost.
> Dec  3 14:52:11 safserver ifplugd(eth1)[1649]: Link beat detected.
> 
> but it wouldn't explain all the errors because they were scattered
> through the run time, and that took a lot longer than 1 second
> to complete.
> 
> Network is 100baseT through a DLINK DSS-24 switch.

ouch!  no jumbo frames then :(

>   cp /tmp/SAVELASTMEGABLAST.txt /tmp/TESTLAST.txt
>   mv /tmp/TESTLAST.txt ./TESTLAST.txt.$NODE
>   set `md5sum TESTLAST.txt.$NODE`
>   NEWMD=$1
>   /bin/rm ./TESTLAST.txt.$NODE
>   if [ "$NEWMD" != "$HOLDMD" ] 

hmm. you're doing both the writes and reads from the slave node here.
was that part of your original description?  I'm wondering about 
bad writes vs bad reads.  what happens if you run the md5sum on 
the master instead?

in any case, I think I'd turn off DIRECT_IO first.  it's an attractive
feature, but it's easy to imagine how it might not quite work right,
given the length of time nfs has been doing IO only to/from page cache.

switching to a different wsize would be even easier.

regards, mark hahn.