[Beowulf] Errors on IBM e325

Jeff Layton jeffrey.b.layton at lmco.com
Mon Jun 28 09:50:41 PDT 2004

Michael Will wrote:

>Was this not tested before it was deployed? Or is it a problem that only
>recently developed?

Well supposedily it was tested before deployment. We're seeing these
errors (among others) on a number of nodes at random times. :(

>It sounds similar to http://lists.suse.com/archive/suse-amd64/2003-Sep/0063.html 
>suggesting that you should make sure that you run the latest kernel, and if the problem 
>persists is a case for your service contract. (i.E. hardware broken)

Well, I hate to say it, but it's not SuSE. It's the other guy :) The
kernel is only 2.4.21 but has been patched quite a bit. The NUMA
patches are in there, but not built in the binary kernel. I'm not sure
if we will continue to get support if we rebuild the kernel with
NUMA activated (out IT people require support at all times).

>also see http://www.cs.caltech.edu/~weixl/research/fast-mon/arch/x86_64/kernel/bluesmoke.c

I'll try this code to see what it finds out.



>Michael Will
>On Friday 25 June 2004 08:21 am, Jeff Layton wrote:
>>Good morning,
>>   We've got a shiny new IBM cluster with e325 nodes (Opteron).
>>However, we're having some trouble with a number of nodes.
>>We keep getting 'GART' errors showing up in the logs. Here is
>>an example,
>>Jun 21 07:07:42 c3n32.cluster kernel: Lost an northbridge error
>>Jun 21 07:40:52 c1n4.cluster kernel: Lost an northbridge error
>>Jun 21 07:07:42 c3n32.cluster kernel: GART error 3
>>Jun 21 07:40:52 c1n4.cluster kernel: GART error 3
>>Jun 21 14:03:49 c1n2.cluster kernel:     extended error chipkill ecc error
>>Jun 21 14:03:50 c1n2.cluster kernel:     corrected ecc error
>>   Does anybody have any ideas what the cause might be?

Dr. Jeff Layton
Aerodynamics and CFD
Lockheed-Martin Aeronautical Company - Marietta

More information about the Beowulf mailing list