MP S2460 Problem
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Feb 26 14:13:56 PST 2003
- Previous message: MP S2460 Problem
- Next message: Beginning with beowulf
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 26 Feb 2003, Ken Chase wrote: > Man that bios musta been garbage. Hows the new bios? We had no problems > with the bios on the 2460 (mfg'd between oct and dec 2001) and on the > 2466 (initial release mfg'd between dec 2001 and feb 2002). We NEVER > upgraded the bios and dont intend to for fear of having the same problems. > > Oh do you mean PCI riser cards to let 2U servers have full height > PCI cards in them? Bad karma on those - we've used them in a few > machines around here and we've had problems with almost every board > we've tried them on. PCI doesnt like to have extra goop in the way of > its cards. Yes, that kind of riser. And we finally did find risers that worked adequately. The PCI bus on the 2460 has (I would guess) some critical timing issues with cards. Some cards that work fine in any OTC box just won't work or will instantly crash the box (preboot or midboot). They were incredibly picky about what slots would work with even cards that DID work. We worked our way through various BIOS flashes, and have them running as stably as possible (for us) now. They don't generally crash under no load, they don't generally crash under MY application, but certain applications that run like a charm on P4's or 2466's crash the essentially IDENTICALLY configured 2460's as described. Tyan refused to own the riser problem OR the card problem. It was just plain the fault of the card manufacturer, not theirs. Damn those incompetent folks at 3com, anyway. Strange how they opted to use their onboard chipset for the same device on the 2466. > WOW. Thats amazing. We can run them without keyboards fine. Occasionally > the BIOS is reset by as power flux or something like that, and yes > you have to get out the video card and stuff, but thats a minor deal. > It happens infrequently enough - when there's a power fail and subsequent > very unfriendly repowering (where it flickers and comes back on and > just stresses ALL the gear as much as possible..) we find only 2-3% of > machines have a problem. [ dont ask why there's no ups. its political > at the customer end. ] Well, there is the reboot issue. Hard to do a Ctrl-Alt-Del reboot through the serial line. But the killer turned out to be entering the PXE bios of the onboard 3com on the 2466's. To do this you have to enter Ctrl-B or some such as the appropriate line flashes by in the boot process. You NEED to do this to reset e.g. whether PXE times out or waits forever for a keystroke if you put PXE boots first. The horror is that only Ctrl-B from an attached keyboard will do. We could never get that to work from a serial port. Ditto escaping out of the INFINITELY long POST memory test -- only from a keyboard. I got to the point where I stick my laptop on the serial port and plug a real keyboard into the keyboard port, and ignore the laptop keyboard when working with the BIOS. > > If you can get Tyan to replace them for free, please let us know. God > > We got RMAs on dead boards pretty easily through our supplier. Not > sure what your deal is. extra 10% for 3 year warranty on CPUs and boards > was worth it. Not too many boards died, alas, they just crash a lot. I don't know if I can still RMA them if I smash them with a hammer into little pieces first. D'ya think? ;-) > We use alot of athlon gear around here and have built two clusters of > them, so far, no complaints. They blow around the same speed as all > our intel gear, in the long run. We're happy enough with 2466 clusters ditto, and >>I<< can even use the 2460's. It seems that only certain kinds of tasks destabilize them. The worst kind of YMMV -- it can vary from run to run on a single task, but not show up at all on somebody else's code. We haven't tried to figure out what, precisely, the program that kills them is doing when they die. It could be something really horrible, like a particular subroutine that heats up one little part of the CPU past some dissipation threshold so they drop bits and die, as their stability is also very visibly heat dependent. However, they have high quality fans, the room itself is cold, the CPUs don't register as particularly hot overall... so who knows? That would explain a Tbird difference, as I'm sure they are different masks. Not so easy to see why 2466's run stably with CPUs moved from unstable 2460's, though... (an experiment we've done and in fact are still doing). > Again, however, if you buy inexpensive gear, just set up so its most > cost effective to throw it away! You and I have come to this conclusion > before actually. :) Im sure it makes environmentalists (including the Well, they aren't THAT inexpensive when you by 23 dual processor units for order of $50K with "all" the money in a grant... Eventually we'll do just that, but it still costs you productivity in the short run while we still have to live with it. > tiny one inside me) cringe. As long as you experience regular fail > rates, like perhaps a board a month, then yer set. If you have > problems across the whole cluster, where 30-60% of your gear is affected > you have a different problem. A goodly chunk of a cluster, or all of one or two groups' part of a cluster, depending on how you view it. > Im just waiting to see what happens at 18-24 months -- see if the 246x's > get bit by the exploding capacitor problem, yay! > > Anyone else experiencing that? Not experiencing, but what an "interesting" problem! In the sense of the chinese curse... rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: MP S2460 Problem
- Next message: Beginning with beowulf
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
