MP S2460 Problem

Robert G. Brown rgb at phy.duke.edu
Wed Feb 26 14:13:56 PST 2003


On Wed, 26 Feb 2003, Ken Chase wrote:

> Man that bios musta been garbage. Hows the new bios? We had no problems
> with the bios on the 2460 (mfg'd between oct and dec 2001) and on the
> 2466 (initial release mfg'd between dec 2001 and feb 2002). We NEVER
> upgraded the bios and dont intend to for fear of having the same problems.
> 
> Oh do you mean PCI riser cards to let 2U servers have full height
> PCI cards in them? Bad karma on those - we've used them in a few
> machines around here and we've had problems with almost every board
> we've tried them on. PCI doesnt like to have extra goop in the way of
> its cards.

Yes, that kind of riser.  And we finally did find risers that worked
adequately.  The PCI bus on the 2460 has (I would guess) some critical
timing issues with cards.  Some cards that work fine in any OTC box just
won't work or will instantly crash the box (preboot or midboot).  They
were incredibly picky about what slots would work with even cards that
DID work.

We worked our way through various BIOS flashes, and have them running as
stably as possible (for us) now.  They don't generally crash under no
load, they don't generally crash under MY application, but certain
applications that run like a charm on P4's or 2466's crash the
essentially IDENTICALLY configured 2460's as described.

Tyan refused to own the riser problem OR the card problem.  It was just
plain the fault of the card manufacturer, not theirs.  Damn those
incompetent folks at 3com, anyway.  Strange how they opted to use their
onboard chipset for the same device on the 2466.

> WOW. Thats amazing. We can run them without keyboards fine. Occasionally
> the BIOS is reset by as power flux or something like that, and yes
> you have to get out the video card and stuff, but thats a minor deal.
> It happens infrequently enough - when there's a power fail and subsequent
> very unfriendly repowering (where it flickers and comes back on and
> just stresses ALL the gear as much as possible..) we find only 2-3% of
> machines have a problem. [ dont ask why there's no ups. its political
> at the customer end. ]

Well, there is the reboot issue.  Hard to do a Ctrl-Alt-Del reboot
through the serial line.  But the killer turned out to be entering the
PXE bios of the onboard 3com on the 2466's.  To do this you have to
enter Ctrl-B or some such as the appropriate line flashes by in the boot
process.  You NEED to do this to reset e.g. whether PXE times out or
waits forever for a keystroke if you put PXE boots first.  The horror is
that only Ctrl-B from an attached keyboard will do.  We could never get
that to work from a serial port.  Ditto escaping out of the INFINITELY
long POST memory test -- only from a keyboard.  I got to the point where
I stick my laptop on the serial port and plug a real keyboard into the
keyboard port, and ignore the laptop keyboard when working with the
BIOS.

> > If you can get Tyan to replace them for free, please let us know.  God
> 
> We got RMAs on dead boards pretty easily through our supplier. Not
> sure what your deal is. extra 10% for 3 year warranty on CPUs and boards
> was worth it.

Not too many boards died, alas, they just crash a lot.  I don't know if
I can still RMA them if I smash them with a hammer into little pieces
first.  D'ya think? ;-)

> We use alot of athlon gear around here and have built two clusters of
> them, so far, no complaints. They blow around the same speed as all
> our intel gear, in the long run.

We're happy enough with 2466 clusters ditto, and >>I<< can even use the
2460's.  It seems that only certain kinds of tasks destabilize them.
The worst kind of YMMV -- it can vary from run to run on a single task,
but not show up at all on somebody else's code.

We haven't tried to figure out what, precisely, the program that kills
them is doing when they die.  It could be something really horrible,
like a particular subroutine that heats up one little part of the CPU
past some dissipation threshold so they drop bits and die, as their
stability is also very visibly heat dependent.  However, they have high
quality fans, the room itself is cold, the CPUs don't register as
particularly hot overall... so who knows?

That would explain a Tbird difference, as I'm sure they are different
masks.  Not so easy to see why 2466's run stably with CPUs moved from
unstable 2460's, though... (an experiment we've done and in fact are
still doing).

> Again, however, if you buy inexpensive gear, just set up so its most
> cost effective to throw it away! You and I have come to this conclusion
> before actually. :) Im sure it makes environmentalists (including the

Well, they aren't THAT inexpensive when you by 23 dual processor units
for order of $50K with "all" the money in a grant...

Eventually we'll do just that, but it still costs you productivity in
the short run while we still have to live with it.

> tiny one inside me) cringe. As long as you experience regular fail
> rates, like perhaps a board a month, then yer set. If you have
> problems across the whole cluster, where 30-60% of your gear is affected
> you have a different problem.

A goodly chunk of a cluster, or all of one or two groups' part of a
cluster, depending on how you view it.

> Im just waiting to see what happens at 18-24 months -- see if the 246x's
> get bit by the exploding capacitor problem, yay!
> 
> Anyone else experiencing that?

Not experiencing, but what an "interesting" problem!  In the sense of
the chinese curse...

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list