Problems with dual Athlons

Steve Gaudet SGaudet at turbotekcomputer.com
Wed Jul 31 10:06:15 PDT 2002


Hello Ray,

> > You might try the noapic option. I'm thinking there may be 
> some kind of 
> > issues with APIC, AMD and 2.4.18.
> 
> We don't have ASUS systems but instead a mix of Tyan 2460 and 2466
> systems and see very similar things, including the bizarreness of the
> blind crash problems appearing on one system (consistently are
> repeatedly) but not another IDENTICAL system sitting right next to it.

We've run them and they appear to be solid.
 
> We have found that power supplies (both the power line itself and the
> switching power supply in the chassis) can make a difference on the
> 2466's -- a marginal power supply is an invitation to 
> problems for sure
> on these beasties.  This is reflected in the completely outrageous
> observation that I have some nodes that will boot and run stably when
> plugged into certain receptacles on the power pole, but not other
> receptacles.  If I put a polarity/circuit tester on the receptacles,
> they pass.  If I check the line voltages, they are nominal (120+ VAC).
> If I plug any 2466 into them (I tried 3), it fails to POST.  If I move
> the plug two receptacles up on the same pole and same 
> circuit, it POSTS,
> installs, and works fine.  I haven't put an oscilloscope on the line
> when plugging it in, but I'm sure it would be fascinating to do so.

I agree here, 400w+ is needed.  Tyan claims that on the 2466 it's not,
however, if you look at AMD's site they don't support this conclusion.

> We're also in the problem of investigating kernel snapshot 
> dependencies
> and the SMP issues aforementioned as we continue to try to 
> stabilize our
> 2460's, which seem even more sensitive than the 2466's (which so far
> seem to run stably and and give decent performance overall).
> Unfortunately, our crashes occur with a mean time of days to a week or
> two under load in between (consistent with a rare interrupt 
> conflict or
> SMP issue) so it takes a long time to test a potential fix.  We did
> avoid a crash for about 9 days on a 2460 running 2.4.18-5 (Red Hat's
> build id) after experiencing crashes on the node every 5-10 days, but
> are only just now accumulating better statistics on a group of nodes
> instead of just the one.
> 
> So overall, I concur -- try different smp kernel releases and 
> snapshots,
> try rearranging the cards (order often seems to matter) and bios
> settings, try --noapic (which we should probably also do -- we haven't
> so far) and yes, try rearranging the way the nodes are plugged in.
> Notice that this is evil and insidious -- you can pull a node from a
> rack and bench it and it will run fine forever, but if you 
> plug it back
> in to the same receptacle when you put it back, it has problems.
> Maddening.

Few things I'd look at memory and cooling.  The MP Athlons I feel must have
copper core heat sinks with excellent fan match up.  I noticed you didn't
mention the case.  If its a rackmount make sure there is adequate space
between the case cover and the fan.  If not this could be the problem.

Look at the memory and verify all the chips are the same.  Some memory chips
sets don't play well together.

Might want to try memtest86, can be found at http://www.memtest86.com/

Another one is http://sourceforge.net/projects/va-ctcs/

We use ctcs for 72 hour burn in and it works at finding hardware problems.

Once you verify that the hardware is infact solid.  I'd just reload the
software from scratch and start over.  In the long run, it's sometimes
quicker.

Hope this helps.

Cheers,

Steve Gaudet 
Linux Solutions Engineer
   ..... 
  <(©¿©)> 
 
===================================================================
| Turbotek Computer Corp.    tel:603-666-3062 ext. 21             |
| 161 Abby Rd                fax:603-666-4519                     |
| Manchester, NH 03103       e-mail:sgaudet at turbotekcomputer.com  |
| toll free:800-573-5393     web: http://www.turbotekcomputer.com |
===================================================================

  




More information about the Beowulf mailing list