Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] SATA(?) errors locks up node

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Gebhardt Thomas gebhardt at hrz.uni-marburg.de
Wed May 23 02:13:59 PDT 2007


Hi,

we are running a cluster of 57 dual opteron nodes. Once or twice a week
one of these nodes gets in an error state and can't connect to the 
I/O-subsystem anymore. I need to reboot that node. As far as I can see,
the problem occurs randomly at any of our nodes, i.e., the MTBF of a single
node is about 6-12 months.

I still don't know whether this is a problem of the linux kernel sata driver,
a hardware problem, a flaw of the disk firmware or something else. I'm
looking for a possibilty to track down the problem without substantially
interfering with the jobs on the cluster.

This is our environment:
TYAN S3992 motherboard with Serverworks HT1000+2000 chipset.
2 DualCore Opteron  2216 HE 2.4GHz, 16GByte Mem
Maxtor 250GByte SATA disk, WDC WD2500YS-01SHB0, firmware rev. 20.06C03
Debian sarge amd64 (custom kernel)

I tried several linux kernel versions (eg. 2.6.18.1, currently: 2.6.20.3
from kernel.org) which seems to make no difference.

I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at
the disk. This does not help either.

NCQ is disabled:
# cat  /sys/block/sda/device/queue_depth
1

Any ideas?

Thanks, Thomas

+++++++++++++++++++

Here is a typical console error log. As far as I can see, this means that the
communication between the kernel and the disk suddenly get interupted.

May 17 04:39:51 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 
0x2 frozen
May 17 04:39:51 ata1.00: cmd ca/00:50:9a:32:7b/00:00:00:00:00/e0 tag 0 cdb 0x0 
data 40960 out
May 17 04:39:51          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
(timeout)
May 17 04:39:58 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:40:21 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:40:21 ata1: soft resetting port
May 17 04:40:28 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:40:51 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:40:51 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C
May 17 04:41:21 ata1.00: qc timeout (cmd 0xec)
May 17 04:41:22 ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May 17 04:41:22 ata1.00: revalidation failed (errno=-5)
May 17 04:41:22 ata1: failed to recover some devices, retrying in 5 secs
May 17 04:41:26 ata1: hard resetting port
May 17 04:41:34 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:41:57 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:41:57 ata1: COMRESET failed (device not ready)
May 17 04:41:57 ata1: hardreset failed, retrying in 5 secs
May 17 04:42:02 ata1: hard resetting port
May 17 04:42:09 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:42:32 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:42:32 ata1: COMRESET failed (device not ready)
May 17 04:42:32 ata1: hardreset failed, retrying in 5 secs
May 17 04:42:37 ata1: hard resetting port
May 17 04:42:45 ata1: port is slow to respond, please be patient (Status 0xd0)
May 17 04:43:08 ata1: port failed to respond (30 secs, Status 0xd0)
May 17 04:43:08 ata1: COMRESET failed (device not ready)
May 17 04:43:08 ata1: reset failed, giving up
May 17 04:43:08 ata1.00: disabled
May 17 04:43:08 ata1: EH complete
May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000
May 17 04:43:08 end_request: I/O error, dev sda, sector 8073882
May 17 04:43:08 Buffer I/O error on device sda2, logical block 9189
May 17 04:43:08 lost page write due to I/O error on sda2
May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000
May 17 04:43:08 end_request: I/O error, dev sda, sector 16099660
May 17 04:43:08 Buffer I/O error on device sda3, logical block 12365
May 17 04:43:08 lost page write due to I/O error on sda3
May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000
May 17 04:43:08 end_request: I/O error, dev sda, sector 73606884
May 17 04:43:08 Buffer I/O error on device sda3, logical block 7200768
May 17 04:43:08 lost page write due to I/O error on sda3
....



More information about the Beowulf mailing list