Riser card -mainboard conflicts?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Jan 8 09:03:22 PST 2003
- Previous message: Scyld contracts.
- Next message: Looking for remote file view utility
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 8 Jan 2003 tegner at nada.kth.se wrote: > Hi all! > > We have a cluster consisting of 30 athlon 2000+ nodes on a KT3 Ultra > MS-6380E mainboard (using ide discs) connected by a fast Ethernet > network. > > For the nodes we use 2U chassis, and the NIC and the graphic card sit on a > PCI-301 riser card. > > We are experiencing odd problems; > > On one of the nodes we can newer get the network to function, there > are messages about bus-master dirty, PCI bus error, etc, and we never > get any contact with the rest of the cluster. > > The other nodes "seem" to work OK, but for some parallel applications > one or more of the nodes just "give up" after some time, and in those > cases we get similar messages as above - but it have also happened > that a node just died in which case we have to use the reset button to > get it back. > > We start to suspect that mainboard and the riser card are in some way > incompatible, but we would greatly appreciate any hints of other > reasons for these problems. We had hienous problems with Tyan 2460 motherboards and risers in our first attempt to build dual Athlon nodes. Our original 32 bit three slot riser cards would simply not support 3c905's and an inexpensive video card simultaneously. Video wouldn't work at all, and if video was even present the network wouldn't work. If we took the riser out, pulled the backplate off the cards, and mounted them vertically in the mobo slots themselves (most cards these days are half-height and if the backplate is taken off will fit into a 2U chassis just fine) they'd work perfectly. Part of the problem appeared to be how the risers wired the lines to the keys for the neighboring slots. We ended up having to return all the risers we originally got and replaced them with risers that seemed to connect more lines through and finally got video and network to work simultaneously. We had less of a problem with the 64 bit risers we got in the 2466's we got for the next cluster increments. We had the best success with risers that piped the AGP slot through to an AGP riser slot -- it looked like the motherboards might have had a bit of a problem with cheap PCI video even aside from the riser. Eventually we got the nodes sort-of-stabilized. I say sort-of because even after the riser problem was resolved, the nodes turned out to both run hot and be VERY sensitive to thermally induced crashes and VERY sensitive to harmonic distortion of the incoming power supply. Unfortunately our incoming power lines had a significant degree of harmonic distortion because the idiots who wired our server room ignored the architect's specifications and wired three supply phases with a single common neutral. As described in considerable detail in the list archives, this lead to a significant overload of the neutral wire, a sizeable 3V ground loop on the neutral in the power poles, a harmonic brownout on the supply line (shifting the supply sinusoid towards a square wave and starving the systems for power), a consequent DC saturation of the primary supply capacitors, and an inferred loss of a power supply's inherent surge absorption capability. Whatever, the nodes crashed all the time and lots of them broke, blowing the power supplies the first time they were plugged in or burning out e.g. a disk over a few months. Since the room was rewired to add a neutral per phase according to the architect's spec and our thermal control has improved, node downtime and crashes have much improved. I'm listing all of this so that you can see the range of problems one can encounter: thermal, electrical, and yes, riser/timing. In the worst case scenario (which alas we've lived through) all three. All I can suggest is patience and trying a different source of risers as at least an experiment, especially ones that connect AGP through on a separate key so you don't have to share the pci bus with both video and network. Surge protectors and power factor/harmonic correcting power supplies are also a good idea. Heavy duty fans in your cases, make sure that e.g. ribbon cables don't obstruct airflow (we ultimately opted for round cables and replaced all the ribbon cables as well). Keep ambient air entering at the case fronts at 70F or lower, ideally much lower. Watch out for line overload, brownout, and ground loops -- our systems seem remarkably intolerant of marginal power supply. HTH, although it IS a different mobo than ours and everything above might turn out to be utterly irrelevant to you. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Scyld contracts.
- Next message: Looking for remote file view utility
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
