[Beowulf] power usage, Intel 5160 vs. AMD 2216
landman at scalableinformatics.com
Fri Jul 13 06:00:10 PDT 2007
Mark Hahn wrote:
>> The 2GB dimms emit the same heat as the 1 GB dimms. So if you have a
>> 1000 node cluster, and you use the larger (slightly more expensive)
>> 2GB dimms vs the 1GB dimms, you will emit somewhat less heat. I
>> haven't done the
> assuming the same number of chips per dimm. if your 1G are single-sided,
> and 2G are double, you save nothing. it's also interesting that for a
> generation chip, the higher-clocked dimms are significantly hotter
> (say, 200 vs 300 mA max draw for 1G pc2/667 vs /800).
I haven't seen too many single sided DIMMs these days for registered ECC
ram in x4/x8 flavors. Maybe my horizons are not broad enough :) You
are correct though, I had been assuming the same number of chips.
Though as I understand things ...
> also, I notice that x16 chips dissipate a lot more than x4 or x8, even
> though the chips have the same number of onchip banks. I guess this
> says that the main power issue is driving wide parallel buses at speed...
... the drivers are the power expensive elements.
>> That and few parts means lower absolute number of failures, but that
>> is another issue.
> a very interesting one. I wonder how many people have scrubbing turned
> on in their cluster, and how many use mcelog to monitor the ECC rate.
We do on clusters we ship/build. I specifically run a tests to flesh
out the memory errors. Sadly, memtest86 only gets the "obvious" errors,
you will catch errors with that in most cases fairly quickly. I run
several heavy duty (electronic structure) codes that pound on memory and
CPU. Using that, we have found many mce errors that memtest86 misses.
Most of the mce errors are single bit ecc errors, more often due to
timing and access patterns than simple sequential walk through memory
(memtest86). Nothing stresses memory like real applications.
Moreover, it is pretty easy to deduce which chip is problematic
(assuming it is ram) based upon the address. It isn't always ram,
mcelog has shown us some northbridge/southbridge type errors as well.
CPU 0 4 northbridge TSC 2ce665a9f4c0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = e214
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d40a4001e2080813 MCGSTATUS 0
you see address 0x117600. With a quick bit of Octave, you can convert
that address to a DIMM pair (you are inserting them in pairs, right?) if
you have bank interleaving on, and node interleaving off. THe latter
messes up this calculation.
gigabyte = 1073741824
ans = 0.0010657
which suggests it is in the 0-1 DIMM pair (gigabyte sized dimms). You
can replace one, and try it again. I err on the side of replacing both
(the banking impacts the calculation as well).
mcelog is your friend. Install it/use it if possible. Keep a few spare
RAM dimms on hand in a storage locker somewhere for fast swap out.
> thanks, mark.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf