Riser card -mainboard conflicts?
Robert G. Brown
rgb at phy.duke.edu
Wed Jan 8 09:03:22 PST 2003
On Wed, 8 Jan 2003 tegner at nada.kth.se wrote:
> Hi all!
>
> We have a cluster consisting of 30 athlon 2000+ nodes on a KT3 Ultra
> MS-6380E mainboard (using ide discs) connected by a fast Ethernet
> network.
>
> For the nodes we use 2U chassis, and the NIC and the graphic card sit on a
> PCI-301 riser card.
>
> We are experiencing odd problems;
>
> On one of the nodes we can newer get the network to function, there
> are messages about bus-master dirty, PCI bus error, etc, and we never
> get any contact with the rest of the cluster.
>
> The other nodes "seem" to work OK, but for some parallel applications
> one or more of the nodes just "give up" after some time, and in those
> cases we get similar messages as above - but it have also happened
> that a node just died in which case we have to use the reset button to
> get it back.
>
> We start to suspect that mainboard and the riser card are in some way
> incompatible, but we would greatly appreciate any hints of other
> reasons for these problems.
We had hienous problems with Tyan 2460 motherboards and risers in our
first attempt to build dual Athlon nodes. Our original 32 bit three
slot riser cards would simply not support 3c905's and an inexpensive
video card simultaneously. Video wouldn't work at all, and if video was
even present the network wouldn't work. If we took the riser out,
pulled the backplate off the cards, and mounted them vertically in the
mobo slots themselves (most cards these days are half-height and if the
backplate is taken off will fit into a 2U chassis just fine) they'd work
perfectly.
Part of the problem appeared to be how the risers wired the lines to the
keys for the neighboring slots. We ended up having to return all the
risers we originally got and replaced them with risers that seemed to
connect more lines through and finally got video and network to work
simultaneously.
We had less of a problem with the 64 bit risers we got in the 2466's we
got for the next cluster increments. We had the best success with
risers that piped the AGP slot through to an AGP riser slot -- it looked
like the motherboards might have had a bit of a problem with cheap PCI
video even aside from the riser. Eventually we got the nodes
sort-of-stabilized.
I say sort-of because even after the riser problem was resolved, the
nodes turned out to both run hot and be VERY sensitive to thermally
induced crashes and VERY sensitive to harmonic distortion of the
incoming power supply. Unfortunately our incoming power lines had a
significant degree of harmonic distortion because the idiots who wired
our server room ignored the architect's specifications and wired three
supply phases with a single common neutral. As described in
considerable detail in the list archives, this lead to a significant
overload of the neutral wire, a sizeable 3V ground loop on the neutral
in the power poles, a harmonic brownout on the supply line (shifting the
supply sinusoid towards a square wave and starving the systems for
power), a consequent DC saturation of the primary supply capacitors, and
an inferred loss of a power supply's inherent surge absorption
capability.
Whatever, the nodes crashed all the time and lots of them broke, blowing
the power supplies the first time they were plugged in or burning out
e.g. a disk over a few months. Since the room was rewired to add a
neutral per phase according to the architect's spec and our thermal
control has improved, node downtime and crashes have much improved.
I'm listing all of this so that you can see the range of problems one
can encounter: thermal, electrical, and yes, riser/timing. In the worst
case scenario (which alas we've lived through) all three. All I can
suggest is patience and trying a different source of risers as at least
an experiment, especially ones that connect AGP through on a separate
key so you don't have to share the pci bus with both video and network.
Surge protectors and power factor/harmonic correcting power supplies are
also a good idea. Heavy duty fans in your cases, make sure that e.g.
ribbon cables don't obstruct airflow (we ultimately opted for round
cables and replaced all the ribbon cables as well). Keep ambient air
entering at the case fronts at 70F or lower, ideally much lower. Watch
out for line overload, brownout, and ground loops -- our systems seem
remarkably intolerant of marginal power supply.
HTH, although it IS a different mobo than ours and everything above
might turn out to be utterly irrelevant to you.
rgb
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf
mailing list