[Beowulf] SATA(?) errors locks up node
Mark Hahn
hahn at mcmaster.ca
Wed May 23 08:37:12 PDT 2007
> I still don't know whether this is a problem of the linux kernel sata driver,
> a hardware problem, a flaw of the disk firmware or something else. I'm
the logs show that a command times out, and defies recovery. I don't think
your chipset is the most common - is the SATA controller integrated, or
something like a Promise chip?
do you have any guess about whether your disks are getting enough power?
it seems to be a fairly common occurrance for people to report this kind of
"stops working" bug to the list (linux-ide at vger.kernel.org), only later to
discover that the problem was a marginal power supply.
> looking for a possibilty to track down the problem without substantially
> interfering with the jobs on the cluster.
the sata developers hang out on linux-ide, and seem very responsive.
quite a lot of work has been done on exception handling, but as always,
it's the most common controllers which are best tested/supported.
> I tried several linux kernel versions (eg. 2.6.18.1, currently: 2.6.20.3
> from kernel.org) which seems to make no difference.
well, by kernel standards, 2.6.20.3 is fairly old; there have certainly been
plenty of SATA updates this year.
> I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at
> the disk. This does not help either.
it wouldn't, unless you had a noise problem with the cable.
> NCQ is disabled:
> # cat /sys/block/sda/device/queue_depth
> 1
such features wouldn't cause the fairly low-level hang in your logs -
to me it looks like power, given that it appears to affect even the phy-level
disk interface. it wouldn't hurt to see what smart says about it (health,
metrics and even a self-test.) you might also try stressing the disk with
IO to see whether you can repeatably trigger the problem.
regards, mark hahn.
More information about the Beowulf
mailing list