[Beowulf] Errors on IBM e325

Michael Will mwill at penguincomputing.com
Mon Jun 28 09:39:12 PDT 2004


Was this not tested before it was deployed? Or is it a problem that only
recently developed?

It sounds similar to http://lists.suse.com/archive/suse-amd64/2003-Sep/0063.html 
suggesting that you should make sure that you run the latest kernel, and if the problem 
persists is a case for your service contract. (i.E. hardware broken)

also see http://www.cs.caltech.edu/~weixl/research/fast-mon/arch/x86_64/kernel/bluesmoke.c

Michael Will
On Friday 25 June 2004 08:21 am, Jeff Layton wrote:
> Good morning,
> 
>    We've got a shiny new IBM cluster with e325 nodes (Opteron).
> However, we're having some trouble with a number of nodes.
> We keep getting 'GART' errors showing up in the logs. Here is
> an example,
> 
> Jun 21 07:07:42 c3n32.cluster kernel: Lost an northbridge error
> Jun 21 07:40:52 c1n4.cluster kernel: Lost an northbridge error
> Jun 21 07:07:42 c3n32.cluster kernel: GART error 3
> Jun 21 07:40:52 c1n4.cluster kernel: GART error 3
> Jun 21 14:03:49 c1n2.cluster kernel:     extended error chipkill ecc error
> Jun 21 14:03:50 c1n2.cluster kernel:     corrected ecc error
> 
> 
>    Does anybody have any ideas what the cause might be?
> 
> Thanks!
> 
> Jeff
> 

-- 
Michael Will, Linux Sales Engineer
NEWS: We have moved to a larger iceberg :-)
NEWS: 300 California St., San Francisco, CA.
Tel:  415-954-2822  Toll Free:  888-PENGUIN
Fax:  415-954-2899 
www.penguincomputing.com




More information about the Beowulf mailing list