[Beowulf] SATA(?) errors locks up node
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Gebhardt Thomas gebhardt at hrz.uni-marburg.deWed May 23 02:13:59 PDT 2007
- Previous message: [Beowulf] CFP SBAC-PAD 2007: Extended Deadline
- Next message: [Beowulf] SATA(?) errors locks up node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, we are running a cluster of 57 dual opteron nodes. Once or twice a week one of these nodes gets in an error state and can't connect to the I/O-subsystem anymore. I need to reboot that node. As far as I can see, the problem occurs randomly at any of our nodes, i.e., the MTBF of a single node is about 6-12 months. I still don't know whether this is a problem of the linux kernel sata driver, a hardware problem, a flaw of the disk firmware or something else. I'm looking for a possibilty to track down the problem without substantially interfering with the jobs on the cluster. This is our environment: TYAN S3992 motherboard with Serverworks HT1000+2000 chipset. 2 DualCore Opteron 2216 HE 2.4GHz, 16GByte Mem Maxtor 250GByte SATA disk, WDC WD2500YS-01SHB0, firmware rev. 20.06C03 Debian sarge amd64 (custom kernel) I tried several linux kernel versions (eg. 2.6.18.1, currently: 2.6.20.3 from kernel.org) which seems to make no difference. I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at the disk. This does not help either. NCQ is disabled: # cat /sys/block/sda/device/queue_depth 1 Any ideas? Thanks, Thomas +++++++++++++++++++ Here is a typical console error log. As far as I can see, this means that the communication between the kernel and the disk suddenly get interupted. May 17 04:39:51 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 0x2 frozen May 17 04:39:51 ata1.00: cmd ca/00:50:9a:32:7b/00:00:00:00:00/e0 tag 0 cdb 0x0 data 40960 out May 17 04:39:51 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 04:39:58 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:40:21 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:40:21 ata1: soft resetting port May 17 04:40:28 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:40:51 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:40:51 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:41:21 ata1.00: qc timeout (cmd 0xec) May 17 04:41:22 ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) May 17 04:41:22 ata1.00: revalidation failed (errno=-5) May 17 04:41:22 ata1: failed to recover some devices, retrying in 5 secs May 17 04:41:26 ata1: hard resetting port May 17 04:41:34 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:41:57 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:41:57 ata1: COMRESET failed (device not ready) May 17 04:41:57 ata1: hardreset failed, retrying in 5 secs May 17 04:42:02 ata1: hard resetting port May 17 04:42:09 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:42:32 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:42:32 ata1: COMRESET failed (device not ready) May 17 04:42:32 ata1: hardreset failed, retrying in 5 secs May 17 04:42:37 ata1: hard resetting port May 17 04:42:45 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:43:08 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:43:08 ata1: COMRESET failed (device not ready) May 17 04:43:08 ata1: reset failed, giving up May 17 04:43:08 ata1.00: disabled May 17 04:43:08 ata1: EH complete May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000 May 17 04:43:08 end_request: I/O error, dev sda, sector 8073882 May 17 04:43:08 Buffer I/O error on device sda2, logical block 9189 May 17 04:43:08 lost page write due to I/O error on sda2 May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000 May 17 04:43:08 end_request: I/O error, dev sda, sector 16099660 May 17 04:43:08 Buffer I/O error on device sda3, logical block 12365 May 17 04:43:08 lost page write due to I/O error on sda3 May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000 May 17 04:43:08 end_request: I/O error, dev sda, sector 73606884 May 17 04:43:08 Buffer I/O error on device sda3, logical block 7200768 May 17 04:43:08 lost page write due to I/O error on sda3 ....
- Previous message: [Beowulf] CFP SBAC-PAD 2007: Extended Deadline
- Next message: [Beowulf] SATA(?) errors locks up node
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
