[Beowulf] Tyan 2466 crashes, no obvious reason why
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduMon Sep 6 07:45:16 PDT 2004
- Previous message: [Beowulf] Tyan 2466 crashes, no obvious reason why
- Next message: [Beowulf] 3com isa and other ethernet adapter : no link
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, 5 Sep 2004, David Mathog wrote: > After a few more crashes with nothing in the log files a shell > script was run that logged all sensors readings every 10 seconds > to a file. When it next crashed (6 hours after a restart) there > was no significant difference between any of the numbers, be > they voltage, RPM, or Temp. > > I would have expected that if the power supply or on board > voltage regulator was flaking out it would most likely result > in noise showing up in sensors - but it didn't. > > This time I also left a monitor plugged into the node > and was greeted by this message on the down machine: > > CPU 0: Machine Check Exception: 000000000000004 > Bank 0: e67aa00000000833 at 000000003f9c8688 > Bank 1: f600200000000853 at 00000000001ab948 > > Kernel panic CPU context corrupt > In interrupt handler - not syncing > > That message must be new though, because when I plugged in > that monitor the system had recently crashed, and there > was nothing on the screen then. fwiw, we have had pretty miserable total experiences with the entire 246x line from tyan. The 2460 was openly broken, the 2466 works ok but is damned finicky and breaks easily. IIRC, the 2466's come with a three year warranty from Tyan and the processors typically are also warranted by AMD (depending a bit on where/how you got the systems). The original CPU fans distributed by AMD with the CPUs totally suck. We have had a tremendous failure rate with them -- literally a box or so to send back and/or replace ourselves, maybe 25-35% of our total cluster. We have found that a dying fan is a common source of trouble in a cluster node -- if it goes up, stays up for a while, then crashes chances are good that it is a load/heat related problem and that as soon as the load reaches a critical point a slightly dying fan can no longer keep one or the other CPU cool enough and it destabilizes and the system crashes. Sometimes the fans die all the way. Sometimes the underlying CPUs then cook (we also have a smaller pile of cooked CPUs). Sometimes the power supplies themselves smoke. Sometimes the smoking PS's take other system components with them (or rather, it may be that a smoking motherboard is shorted internally enough to take the PS with it). We have had pretty low reliability overall in these systems, to the point where we have only RARELY had our entire 2466 cluster up and running perfectly. This is in strict counterpoint to the e.g. Opteron cluster(s), that have functioned perfectly since powerup. It isn't about AMD (except for the fan issue, which they have owned and are willingly replacing any fans that we have troubles with). It is to some extent about Tyan -- it would take some effort to convince us at this point that Tyan's motherboards are built with tremendously great quality control, and the 2460 was a motherboard that they should have just swapped for 2466's across the board for free it was so bad. > The motherboard capacitors have all been visually inspected > and none of them are leaking, bulging, or otherwise showing > signs of failure. > > memtest86 is running now (and for the next 36 hours or so) but > if it doesn't find anything, does the console error suggest > a region of memory to test more intensively, or a particular test > to run in memtest86??? > > Looks like I'm going to need a bunch of spare parts for a "fun" > game of "swap components and wait for the crash"... That is exactly what we do. Only rarely do we get a clean signal of failure, and we have 2-3 system running at any time that crash intermittently just as you describe. We know that it is hardware only because all the systems are identical and running identical tasks, and the failures tend to appear in particular systems and then persist in those systems until eventually they crash all the way. In order, for an intermittant crash we suspect: a) The CPU fans. Knee jerk replace if there is ANY wobble, noise, visible difference in speed (don't trust sensors output for fan speed or temperature). b) The case fans. Athlons are notoriously sensitive to heat, and obstructions in case flow, insufficient flow (too small fans), hot components between the intake and the CPUs all can cause case overheating and destabilize memory or possible components on the motherboard (who knows?). c) The power supply. Just because it is one of the most common parts to fail, although fortunately they tend to fail all the way (often with smoke) or not at all. d) Roughly equally, motherboard, memory, other components, gremlins, CPU. Again motherboards USUALLY fail catastrophically (eventually) but a few systems' flakiness has followed the motherboard and then the motherboard has died the rest of the way. Memory failure is not uncommon, but memtest86 or at worst the tried and true swap-the-DIMM game usually finds the culprit, eventually. "Other components" is actually pretty rare but you gotta look. CPUs definitely fail (often associated with failure of fan(s), motherboard, PS) and sometimes they fail slowly -- flaking out under load or intermittantly. Swapping the CPUs around has proven that some CPUs are perfectly capable of booting a system as CPU0, running for a day, and then exhibiting a fault. If a CPU has EVER been overheated this is not even that unlikely. Since we have had a lot more fan failures than CPU failures, we have plenty of CPUs at risk. Basically, we're just trying to keep our 2466's going until new grant money buys replacement nodes and we can sanely retire them. They are all less than 3 years old, though, and we really need them to run one to two more years. We >>have<< gotten lots of work done on them, and they >>do<< run well a lot of the time -- just not a terribly stable design and SO sensitive to heat and power problems... Hope this helps, rgb (with WAY more experience fixing 2466's that he ever hoped to accrue). > > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Tyan 2466 crashes, no obvious reason why
- Next message: [Beowulf] 3com isa and other ethernet adapter : no link
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
