[Beowulf] Tyan S2882

Fri Sep 29 04:32:29 PDT 2006

Hi,

thank you very much for your reply!

On Thursday 28 September 2006 16:17, you wrote:
> > * Dual AMP Opteron DP270 (2.0 GHz)
>
> which rev?

How can I figure out? Is 

# cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 270
stepping        : 2
cpu MHz         : 1992.624
cache size      : 1024 KB

the information you are looking for?

I remember that someone talked about " E* stepping", but I'm not sure about
that.

> > * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung
> > CM72SD1024RLP-3200/SB ( 12 nodes have 8*2GB)
>
> this dimm is 2-rank, I believe; corsair's datasheet is pretty lame.
> that means that each bank of memory is 4x2=8 ranks.  that's definitely
> pushing the limit; I'm sure it can be done in some cases, but it's
> definitely not supported by some rev's of the opteron, and will always be
> pretty bleeding-edge.

http://www.tyan.com/support/html/memory_s2882d.html lists

1 GB     DDR RAM (reg., ECC) Samsung CM72SD1024RLP-3200/S

as  "Recommended PC3200 (DDR 400) Memory Modules".
(don't know whether /SB <-> /S is a significant difference. "/S" ; /S stands 
for Samsung)

They have eighteen 64Mx8 DDR SDRAM units. 
The BIOS sets the memory bus speed down to 166MHz when all (2x4) memory slots 
are populated.

> try running mcelog periodically; I bet you see lots of corrected ECC's.

I'm already doing so. About 10-20 nodes show corrected ECC errors at rate of
about 1-100 events/week. It seems that the latest BIOS upgrade reduced the
ECC error rates. We have both nodes that never crashed, but suffer from
corrected ECC errors, and nodes that crashed, but never had any corrected
ECC error.

Most ECC errors + crashes are hard to reproduce. Sometimes the ECC error
rate suddenly drops to zero (under the same mix of test jobs). Sometimes
it helped just to pull out the memory modules and to reinsert them (within the
same slot)

>
> IMO, the config is the problem, not the boards, cpus, dimms, etc.
>
> > It seems that the last BIOS upgrade has reduced the ECC error rate
> > somewhat.

> it would be interesting to know whether derating the clock (->pc2700)
> helps this situation more or less than derating the latency.

It is very difficult to test that since we cannot trigger the crashes 
reliably. The cluster is now running stable for more than a week.
If I'd slow down the the memory bus speed it would take months
to get a statistically significant conclusion. On the other side an
average rate of 2 crashes per week is rather annoying.

Cheers, Thomas