[Beowulf] NFS Read Errors

Mon Dec 3 16:39:45 PST 2007

We were having trouble restarting from our homegrown parallel 
magnetohydrodynamic code's checkpoint files.  The files could be 
read, but funny things happened in the run afterward.  Eventually we 
figured out that the restarted parallel run differed from the serial 
restarted run from the same checkpoint.

After much gnashing of teeth and rending of apparel, we found that 
the checkpoint files were being read incorrectly across NFS.  That 
let us simplify our search for the problem.  We first found that the 
local md5 digest [openssl dgst -md5 (file...)] on an NFS cp'ed 
version of the file was different from that produced on the original 
file.  What was interesting was that the copy either took forEVER -- 
like 10 minutes or 20 minutes for a 1 GB file -- when the final 
result was bad or it took about a minute when the file was 
perfect.  I'm guessing that whatever error checking that gets done on 
the packets was rejecting so many it finally got a bad packet it 
couldn't tell was bad.

When we found that doing the md5 digest on a remote file produced a 
different result than doing it on the processor on which the disk was 
mounted, our tests got simpler.  And shorter, still, after we found 
that we could get fairly frequent failures with 10 MB files or 
smaller.  Clearly we had an NFS failure, probably associated with hardware.

This was all between two specific nodes of our small cluster.  [Old 
hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual 
(Tyan...) chip motherboards both running Redhat 9 one with the 
2.4.20-8 kernels, though one is the smp version; NetGear GA311 NICs; 
and a NetGear GS108 8 port Copper 1 GB/s switch.  The single 
processor motherboards have 32-bit PCI slots so their network speeds 
are limited to 300 kbps as shown by netpipe.  All of the LEDs at the 
ends of the cables show 1000Mb connections.]

Then we started checking other pairs.  Some were fine.  Some were bad 
in the same way.  So we replaced the switch, changing to a 16 port 
NetGear GS216.  That seemed to cure most of the problem.  But we 
continued to have problems copying a file on one particular single 
processor machine from the others.

That's where we are now.  The md5 digest run on that machine 
consistently shows the same result, whereas the digest for that file 
produced on a remote machine will be almost stochastic.  In some 
cases it will eventually settle in to the right answer, and then the 
speed goes WAY up.  I suppose that happens because the file request 
can be served from the local machine's cache.  But why doesn't it 
happen after it received bad blocks?

Most, if not all of the original network cards in those machines went 
bad and have been replaced in the last few years, so I decided to try 
a brand new GA311.  No joy there.  It still gives out the wrong 
info.  I guess the motherboard PCI bus controller is hinky, but I'm 
far from sure.

We are in the process of upgrading and thus replacing all the 
machines we have of that configuration due to space limitations and 
their age, but I'm still curious what the problem could be.

Suggestions?  Comments?

Mike