[Beowulf] power usage, Intel 5160 vs. AMD 2216

Fri Jul 13 06:00:10 PDT 2007

Mark Hahn wrote:
>> The 2GB dimms emit the same heat as the 1 GB dimms.  So if you have a 
>> 1000 node cluster, and you use the larger (slightly more expensive) 
>> 2GB dimms vs the 1GB dimms, you will emit somewhat less heat.  I 
>> haven't done the
> 
> assuming the same number of chips per dimm.  if your 1G are single-sided,
> and 2G are double, you save nothing.  it's also interesting that for a 
> given
> generation chip, the higher-clocked dimms are significantly hotter
> (say, 200 vs 300 mA max draw for 1G pc2/667 vs /800).

I haven't seen too many single sided DIMMs these days for registered ECC 
ram in x4/x8 flavors.  Maybe my horizons are not broad enough :)  You 
are correct though, I had been assuming the same number of chips. 
Though as I understand things ...

> 
> also, I notice that x16 chips dissipate a lot more than x4 or x8, even
> though the chips have the same number of onchip banks.  I guess this 
> says that the main power issue is driving wide parallel buses at speed...

... the drivers are the power expensive elements.

>> That and few parts means lower absolute number of failures, but that 
>> is another issue.
> 
> a very interesting one.  I wonder how many people have scrubbing turned 
> on in their cluster, and how many use mcelog to monitor the ECC rate.  

We do on clusters we ship/build.  I specifically run a tests to flesh 
out the memory errors.  Sadly, memtest86 only gets the "obvious" errors, 
you will catch errors with that in most cases fairly quickly.  I run 
several heavy duty (electronic structure) codes that pound on memory and 
CPU.  Using that, we have found many mce errors that memtest86 misses. 
Most of the mce errors are single bit ecc errors, more often due to 
timing and access patterns than simple sequential walk through memory 
(memtest86).  Nothing stresses memory like real applications.

Moreover, it is pretty easy to deduce which chip is problematic 
(assuming it is ram) based upon the address.  It isn't always ram, 
mcelog has shown us some northbridge/southbridge type errors as well.

 From this

MCE 0
CPU 0 4 northbridge TSC 2ce665a9f4c0
ADDR 117600
   Northbridge Chipkill ECC error
   Chipkill ECC syndrome = e214
        bit32 = err cpu0
        bit46 = corrected ecc error
        bit62 = error overflow (multiple errors)
   bus error 'local node origin, request didn't time out
       generic read mem transaction
       memory access, level generic'
STATUS d40a4001e2080813 MCGSTATUS 0

you see address 0x117600.  With a quick bit of Octave, you can convert 
that address to a DIMM pair (you are inserting them in pairs, right?) if 
you have bank interleaving on, and node interleaving off.  THe latter 
messes up this calculation.

octave:1> gigabyte=1024*1024*1024
gigabyte = 1073741824
octave:2> 0x117600/gigabyte
ans = 0.0010657

which suggests it is in the 0-1 DIMM pair (gigabyte sized dimms).  You 
can replace one, and try it again.  I err on the side of replacing both 
(the banking impacts the calculation as well).

> comments?

mcelog is your friend.  Install it/use it if possible.  Keep a few spare 
RAM dimms on hand in a storage locker somewhere for fast swap out.

> 
> thanks, mark.

-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615