MP S2460 Problem

Ken Chase math at velocet.ca
Wed Feb 26 11:39:50 PST 2003


On Wed, Feb 26, 2003 at 12:50:46PM -0500, Robert G. Brown's all...
> On Wed, 26 Feb 2003, John Morelle wrote:
> 
> > Hi all,
> > 
> > I would like to notice here to everyone about our complaining problems on
> > the TYAN MP based : the Tiger S2460 motherboard.

> You can probably find plenty of them in the list archives.  I've
> detailed ours numerous times.
> 
>   a) They persist in crashing (even to this day) when hammered by a
> memory-intensive application.  We have some 23 2460's, and probably 2 or
> 3 times that many 2466's.  When the owners run the computation(s) on
> them that they got the workstations for, they crash, often with three
> days to a week.

We put Tbird 1.33s into ours like we were NOT ADVISED TO DO, into
2460s with the original bios, and we've had NO PROBLEMS at all.

I wish there was a way to flash back to this old bios, because you can
still get 1.33 and 1.4 Tbirds, and they're rock solid.

>   b) Even that is maddeningly inconsistent.  The same job run with the
> same parameters might crash in one day one time, four days another time,
> and not crash at all on a different 2460 -- until the time it does.

Now you need a RAIC - Redundant Array of Inexpensive Clusters. Then
you verify the results and toss out the garbage.

I actually did this with a bunch of overclocked boxen to weed out the
bad ones. I ran each job 3 times, and started flagging which boxes
gave results that differed from others. When all 3 differed, I ran it
two more times and hand picked through results.

(I dont condone OC anymore though at all, this was back in the C300A
days -- tho it is interesting to note my C300A OC'd to 450 is going
on 5 years at home, as my firewall -- I WISH they built things stable
as that still! ;)

>   c) We had incredibly horrible problems initially getting them to work
> with off the shelf risers.  Some cards would work, in some slots,
> sometimes.  Some of the same cards that failed would work in some of the
> slots on the motherboard if plugged directly in (no riser at all).
> We're not talking odd cards, either -- things like 3c905's and
> off-the-shelf PCI video.

Man that bios musta been garbage. Hows the new bios? We had no problems
with the bios on the 2460 (mfg'd between oct and dec 2001) and on the
2466 (initial release mfg'd between dec 2001 and feb 2002). We NEVER
upgraded the bios and dont intend to for fear of having the same problems.

Oh do you mean PCI riser cards to let 2U servers have full height
PCI cards in them? Bad karma on those - we've used them in a few
machines around here and we've had problems with almost every board
we've tried them on. PCI doesnt like to have extra goop in the way of
its cards.

>   e) Such as the fact that if you flash the BIOS, it resets the serial
> console (which doesn't work horribly well, as it requires a keyboard to
> be plugged directly in if you want to do all sorts of important things
> but which does work).  So if you actually bought 2466's WITHOUT a video
> card, expecting to use the serial console, if you reflash the BIOS you
> have to disassemble the case, insert a video card, reenter the BIOS,
> turn the serial console on again, shut it down and take out the video
> card, rerack it, power it up and do whatever via the serial console, and
> God help you if you made any sort of mistake or anything failed to
> "take" because you then get to do it all over again.

WOW. Thats amazing. We can run them without keyboards fine. Occasionally
the BIOS is reset by as power flux or something like that, and yes
you have to get out the video card and stuff, but thats a minor deal.
It happens infrequently enough - when there's a power fail and subsequent
very unfriendly repowering (where it flickers and comes back on and
just stresses ALL the gear as much as possible..) we find only 2-3% of
machines have a problem. [ dont ask why there's no ups. its political
at the customer end. ]

> Overall, the 2460's are just plain broken unstable pieces of shit that
> suck systems administration time like a black hole and have cost us
> something like 1/3 of the productivity of the cluster in question and
> infinite annoyance at the management level.  We are finally biting the
> bullet and trying to gradually replace them with 2466's (reusing all the
> rest of the hardware).

My experience has been completely contrary to this. Perhaps its because
we bought them at different times, or perhaps because we dont have MPs
in them. We have regular Tbirds, which were "not supported" by AMD.
We used them after 2 months of testing and having no problems, and here
we are a year later with no problems after the install. We've had to
replace two boards, but they were both 2466s (and then we had to buy
MPs because the replacements were the 4.01 bios that doesnt work with
tbirds. Those new BIOS boards have had more problems than all the
2460s and 2466s in the cluster since then as well! Yay!).
 
> If you can get Tyan to replace them for free, please let us know.  God

We got RMAs on dead boards pretty easily through our supplier. Not
sure what your deal is. extra 10% for 3 year warranty on CPUs and boards
was worth it.

> knows that they should -- they should replace ours as well and those
> belonging to any other poor suckers who bought them.  These systems
> overall drove us to seriously consider e.g. dual Xeons (at a fairly
> similar price) just because they are relatively stable.  Alas, the Xeons
> don't run my particular problem as well as Athlons...

We use alot of athlon gear around here and have built two clusters of
them, so far, no complaints. They blow around the same speed as all
our intel gear, in the long run.

Again, however, if you buy inexpensive gear, just set up so its most
cost effective to throw it away! You and I have come to this conclusion
before actually. :) Im sure it makes environmentalists (including the
tiny one inside me) cringe. As long as you experience regular fail
rates, like perhaps a board a month, then yer set. If you have
problems across the whole cluster, where 30-60% of your gear is affected
you have a different problem.

Im just waiting to see what happens at 18-24 months -- see if the 246x's
get bit by the exploding capacitor problem, yay!

Anyone else experiencing that?

/kc

> 
>    rgb
> 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 



More information about the Beowulf mailing list