[Beowulf] Errors on IBM e325
Michael Will
mwill at penguincomputing.com
Mon Jun 28 09:39:12 PDT 2004
Was this not tested before it was deployed? Or is it a problem that only
recently developed?
It sounds similar to http://lists.suse.com/archive/suse-amd64/2003-Sep/0063.html
suggesting that you should make sure that you run the latest kernel, and if the problem
persists is a case for your service contract. (i.E. hardware broken)
also see http://www.cs.caltech.edu/~weixl/research/fast-mon/arch/x86_64/kernel/bluesmoke.c
Michael Will
On Friday 25 June 2004 08:21 am, Jeff Layton wrote:
> Good morning,
>
> We've got a shiny new IBM cluster with e325 nodes (Opteron).
> However, we're having some trouble with a number of nodes.
> We keep getting 'GART' errors showing up in the logs. Here is
> an example,
>
> Jun 21 07:07:42 c3n32.cluster kernel: Lost an northbridge error
> Jun 21 07:40:52 c1n4.cluster kernel: Lost an northbridge error
> Jun 21 07:07:42 c3n32.cluster kernel: GART error 3
> Jun 21 07:40:52 c1n4.cluster kernel: GART error 3
> Jun 21 14:03:49 c1n2.cluster kernel: extended error chipkill ecc error
> Jun 21 14:03:50 c1n2.cluster kernel: corrected ecc error
>
>
> Does anybody have any ideas what the cause might be?
>
> Thanks!
>
> Jeff
>
--
Michael Will, Linux Sales Engineer
NEWS: We have moved to a larger iceberg :-)
NEWS: 300 California St., San Francisco, CA.
Tel: 415-954-2822 Toll Free: 888-PENGUIN
Fax: 415-954-2899
www.penguincomputing.com
More information about the Beowulf
mailing list