[Beowulf] Errors on IBM e325

Joe Landman landman at scalableinformatics.com
Mon Jun 28 11:21:34 PDT 2004


On Fri, 2004-06-25 at 11:21, Jeff Layton wrote:
> Good morning,
> 
>    We've got a shiny new IBM cluster with e325 nodes (Opteron).
> However, we're having some trouble with a number of nodes.
> We keep getting 'GART' errors showing up in the logs. Here is
> an example,
> 
> Jun 21 07:07:42 c3n32.cluster kernel: Lost an northbridge error
> Jun 21 07:40:52 c1n4.cluster kernel: Lost an northbridge error
> Jun 21 07:07:42 c3n32.cluster kernel: GART error 3
> Jun 21 07:40:52 c1n4.cluster kernel: GART error 3
> Jun 21 14:03:49 c1n2.cluster kernel:     extended error chipkill ecc error
> Jun 21 14:03:50 c1n2.cluster kernel:     corrected ecc error

Does booting with iommu=off help?

> 
> 
>    Does anybody have any ideas what the cause might be?

The e325's have an onboard ATI VGA bit.  Last I checked it was PCI based
(I don't have a unit here to see).  There was a little discussion of
GART based issues in RH
https://www.redhat.com/archives/amd64-list/2004-May/date.html .  Which
kernel, how much memory, how is it distributed?  I have noticed that
some vendors do not configure the memory on Opteron systems correctly,
though I would expect the IBM folks not to have a problem with this. 

There are also some BIOS settings on the e325 that directly impact
memory layout, NUMA use,  etc.  Of course, I don't remember what they
are :(.

Joe

> 
> Thanks!
> 
> Jeff
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615




More information about the Beowulf mailing list