[Beowulf] Solved: SATA(?) errors locks up node
gebhardt at hrz.uni-marburg.de
Mon Jul 2 07:05:34 PDT 2007
thank you all for your advice!
After a Firmware upgrade (->20.06C06) of the SATA disks we had no
further incident until now. So I'm pretty sure that we have caught the bug.
Thanks again, Th. Gebhardt
On Wednesday 23 May 2007 11:13, Gebhardt Thomas wrote:
> we are running a cluster of 57 dual opteron nodes. Once or twice a week
> one of these nodes gets in an error state and can't connect to the
> I/O-subsystem anymore. I need to reboot that node. As far as I can see,
> the problem occurs randomly at any of our nodes, i.e., the MTBF of a single
> node is about 6-12 months.
> I still don't know whether this is a problem of the linux kernel sata
> driver, a hardware problem, a flaw of the disk firmware or something else.
> I'm looking for a possibilty to track down the problem without
> substantially interfering with the jobs on the cluster.
> This is our environment:
> TYAN S3992 motherboard with Serverworks HT1000+2000 chipset.
> 2 DualCore Opteron 2216 HE 2.4GHz, 16GByte Mem
> Western Digital 250GByte SATA disk, WDC WD2500YS-01SHB0, firmware rev.
More information about the Beowulf