[Beowulf] SATA(?) errors locks up node

Wed May 23 09:08:38 PDT 2007

Well just generally I was thinking about a block design, spending some money
for extra 1) cooling, 2) shielding, and 3) power, for overlapping sections
of the cluster, and see if the incidence rate of failures correlates with
anything. You can imagine stacking your nodes in a (3-dimensional) cube; the
top X percent get extra shielding, the front X get cooling, and the right X
get power. If X is 50% you are spending alot of time and money on the
experiment but would get a statistically meaningful result (which might be
no correlation at all) in a few weeks; if X is tiny you would have to wait
long enough for a random failure to occur in the uprgraded volumes, so you'd
invest less but have a longer wait. If this is has been an issue for a long
time and the expected working lifetime of the cluster is long into the
future, it could be worth doing something like this for X fairly small. A
side-benefit would be data for a broader cost-benefit analysis of plausible
upgrades, if you can measure other performance characteristics besides the
failures.
Peter

On 5/23/07, Mark Hahn <hahn at mcmaster.ca> wrote:
>
> > I still don't know whether this is a problem of the linux kernel sata
> driver,
> > a hardware problem, a flaw of the disk firmware or something else. I'm
>
> the logs show that a command times out, and defies recovery.  I don't
> think
> your chipset is the most common - is the SATA controller integrated, or
> something like a Promise chip?
>
> do you have any guess about whether your disks are getting enough power?
> it seems to be a fairly common occurrance for people to report this kind
> of
> "stops working" bug to the list (linux-ide at vger.kernel.org), only later to
> discover that the problem was a marginal power supply.
>
> > looking for a possibilty to track down the problem without substantially
> > interfering with the jobs on the cluster.
>
> the sata developers hang out on linux-ide, and seem very responsive.
> quite a lot of work has been done on exception handling, but as always,
> it's the most common controllers which are best tested/supported.
>
> > I tried several linux kernel versions (eg. 2.6.18.1, currently: 2.6.20.3
> > from kernel.org) which seems to make no difference.
>
> well, by kernel standards, 2.6.20.3 is fairly old; there have certainly
> been
> plenty of SATA updates this year.
>
> > I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at
> > the disk. This does not help either.
>
> it wouldn't, unless you had a noise problem with the cable.
>
> > NCQ is disabled:
> > # cat  /sys/block/sda/device/queue_depth
> > 1
>
> such features wouldn't cause the fairly low-level hang in your logs -
> to me it looks like power, given that it appears to affect even the
> phy-level
> disk interface.  it wouldn't hurt to see what smart says about it (health,
> metrics and even a self-test.)  you might also try stressing the disk with
> IO to see whether you can repeatably trigger the problem.
>
> regards, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20070523/0fbb45ad/attachment.html>