[Beowulf] cheap PCs this christmas

Tony Travis ajt at rri.sari.ac.uk
Mon Nov 14 12:45:08 PST 2005


Douglas Eadline wrote:
> [...]
> I read your experience with interest. Here are some of my experiences
> working with COTs memory. The value cluster
> (http:clustermonkey.net//content/view/41/29/) Jeff Layton and I built uses
> non-ECC PC2700 memory. Aware that not all memory is the same, I purchased
> Infineon memory and ran Memtest86 (http://www.memtest86.com) on each node
> for at least 4 hours. I found no problems with any of the memory running
> Memtest86. I have run the system quite hard on several occasions. I have
> run the NAS suite (which is self checking) and never had an error. I also
> ran HPL a "whole lot" and never had an issue with bad residuals (while HPL
> is not self checking, a memory problem might case a bad residual). So on
> Kronos I have a certain level of confidence that the memory is sound.

Hello, Doug.

I was, initially, confident that the systems we bought were reliable and 
as you did I tested them using "mentest86/memtest86+" - most of them 
passed this test, and I got new memory from the vendor to replace the 
memory in systems that failed the burn-in test.

However, we soon noticed simulations that were producing odd results and 
when I investigated it looked like faulty memory. However, few of the 
faults were picked up even running mentest86+ overnight. The picture was 
different when I used "memtester" under Linux (openMosix) and I was able 
to identify faulty RAM in all the machines producing suspect results...

> I also know that the possibility of errors does exist and that without
> ECC, you are living a bit on the dangerous side, but I think with any
> cluster you get certain feel for when things are not right.

That's true, and it's the reason I started to investigate the problem.

> I have also found that leading edge memory also seems to be the least
> stable and if you wait a while (6 months?) the memory seems to get more
> stable. Plus, there are DIMMS and then there are DIMMS. Companies can make
> "junk" DIMMS and sell them in the Windows Market because users have the
> expectation that the system is not stable and when problem occur they just
> reboot and everything is fine.

I know, you get what you pay for...

> Now, about your experience. I am curious:
> 
> Were the systems prebuilt with memory installed or did you buy the
> components yourself?

Built the first eight systems myself, and they all work perfectly ;-)
	Athlon XP2400+/DDR266, 1GB RAM, 40GB IDE disk, 2 at 100Base-T NIC

Bought 24 nodes pre-built by a company that sells Beowulf clusters - OK!
	Athlon XP2400+/DDR266, 1GB RAM, 40GB IDE disk, 2 at 100Base-T NIC

Bought 32 nodes at £287 each from a different company - memory problems
	Athlon XP2600+/DDR333, 1GB RAM, 40GB IDE disk, 100Base-T+Gb NIC

> If you bought them yourself, did you buy name brand DIMMS or the cheapest
> your could find (or somewhere in between)?

I prefer to use brand-name memory, but cost was a factor when we bought 
the last 32 nodes. To be fair to the company, we got quite a lot for our 
money but we ordered slimline cases and had CPU temperature problems 
that required me to replace all the CPU coolers and fit twin 'server' 
grade' 40mm case cooling fans after moving the cluster into another room 
and upgrading the air conditioning had failed to solve the problem(!).

> Are your mixing DIMMs from two vendors?

No, but I have twinned up DIMM's where the commercially built systems 
had different memory. I also replaced the DDR400 memory they had fitted 
in some of the machines with 'Kingston' DDR333 memory because I don't 
think the (Abit) motherboards run reliably using DDR400 memory.

> Have you changed any memory settings in BIOS?

Setup defaults on all machines, except for PXE boot and power on after 
AC loss (to reboot automatically).

> If you run memtest86 on the DIMMS for several hours, do they show problems
> later on or is this an indication that they will work in the future.

No problems running mentest86 on any of the nodes: All pass these tests.

> Finally, one of the problems with value PC systems from Walmart is that
> your have no control of what is inside. Having purchased several of these
> to play with, I found that every system could be different unless they
> were ordered at the same time and even then there were slight differences.
> I found that for about the same cost (and some screw driver time) you can
> build systems with
> a higher level of quality and reproducibility.

You're right, but time *is* a factor - I won't press another vendor as 
hard to bring the price down after this experience: I think there is a 
lesson in that for many of us. We got three quotes and took the 
cheapest, but it cost me a lot of time to fix all the problems that 
resulted from that decision. Anyway, I learned a valuable lesson :-)

Best wishes,

	Tony.
-- 
Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687



More information about the Beowulf mailing list