[Beowulf] SATA(?) error locks up node
gebhardt at hrz.uni-marburg.de
Thu May 24 01:19:03 PDT 2007
Dear Mr. Hahn,
> the logs show that a command times out, and defies recovery. I don't think
> your chipset is the most common - is the SATA controller integrated, or
> something like a Promise chip?
The HT1000 is an integrated controller for USB, IDE and SATA. As far as I
understand, it is the same chip as the Broadcom BCM5785.
> do you have any guess about whether your disks are getting enough power?
> it seems to be a fairly common occurrance for people to report this kind of
> "stops working" bug to the list (linux-ide at vger.kernel.org), only later to
> discover that the problem was a marginal power supply.
24 of the 57 nodes have an additional infiniband HA. If power were marginal I
would expect that this subset of nodes had a higher error rate than the other
nodes. But there seems to be no difference that is statistically significant.
> > I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at
> > the disk. This does not help either.
> it wouldn't, unless you had a noise problem with the cable.
it has been an advise from our hardware vendor.
Eoin McHugh gave me a hint that our disks might have a firmware bug and
there is an update available. (For whatever reason I affiliated our disks
with Maxtor. So I hadn't found any firmware update on their website.
But of course the disks are from Western Digital). This is the most
promising trace I'm following now.
Thanks for your advice! SY, Th. Gebhardt
More information about the Beowulf