MP S2460 Problem
Robert G. Brown
rgb at phy.duke.edu
Wed Feb 26 09:50:46 PST 2003
On Wed, 26 Feb 2003, John Morelle wrote:
> Hi all,
> I would like to notice here to everyone about our complaining problems on
> the TYAN MP based : the Tiger S2460 motherboard.
> We bought this mobo since its official launch and we have integrated almost
> more than a hundred pieces of this kind in our beowulf cluster, but we have
> almost changed more than the half of them by the next one : Tiger S2466,
> because of "hang on" problems.
> I saw here and here many people who tells about their technical problems on
> this Tiger MP board.
> And we are trying to taking back all the informations that users could send
> us about the same problem.
> So, please free to mail us briefly your experience.
> Thanks in advance.
You can probably find plenty of them in the list archives. I've
detailed ours numerous times.
a) They persist in crashing (even to this day) when hammered by a
memory-intensive application. We have some 23 2460's, and probably 2 or
3 times that many 2466's. When the owners run the computation(s) on
them that they got the workstations for, they crash, often with three
days to a week.
b) Even that is maddeningly inconsistent. The same job run with the
same parameters might crash in one day one time, four days another time,
and not crash at all on a different 2460 -- until the time it does.
c) We had incredibly horrible problems initially getting them to work
with off the shelf risers. Some cards would work, in some slots,
sometimes. Some of the same cards that failed would work in some of the
slots on the motherboard if plugged directly in (no riser at all).
We're not talking odd cards, either -- things like 3c905's and
off-the-shelf PCI video.
d) When we finally found risers that would work, and cards that would
work in the risers, we started struggling with the BIOS, which needed
reflashing. For that matter, the 2466's have had plenty of BIOS
problems, some of which are just plain stupid design.
e) Such as the fact that if you flash the BIOS, it resets the serial
console (which doesn't work horribly well, as it requires a keyboard to
be plugged directly in if you want to do all sorts of important things
but which does work). So if you actually bought 2466's WITHOUT a video
card, expecting to use the serial console, if you reflash the BIOS you
have to disassemble the case, insert a video card, reenter the BIOS,
turn the serial console on again, shut it down and take out the video
card, rerack it, power it up and do whatever via the serial console, and
God help you if you made any sort of mistake or anything failed to
"take" because you then get to do it all over again.
f) Then there is their general sensitivity to heat, power supply,
memory, and the phase of the moon. We replaced all the power supplies
and all the cooling fans once just trying to find a combination that
would stabilize them.
Overall, the 2460's are just plain broken unstable pieces of shit that
suck systems administration time like a black hole and have cost us
something like 1/3 of the productivity of the cluster in question and
infinite annoyance at the management level. We are finally biting the
bullet and trying to gradually replace them with 2466's (reusing all the
rest of the hardware).
If you can get Tyan to replace them for free, please let us know. God
knows that they should -- they should replace ours as well and those
belonging to any other poor suckers who bought them. These systems
overall drove us to seriously consider e.g. dual Xeons (at a fairly
similar price) just because they are relatively stable. Alas, the Xeons
don't run my particular problem as well as Athlons...
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf