[Beowulf] Node Drop-Off

Tim Moore twm at tcg-hsv.com
Mon Dec 4 07:15:34 PST 2006

Update to node drop-off:

I wrote a few weeks ago to ask about node drop-off.  A quick note...I 
had a cluster run for 3 years without failure and I upgraded the Opteron 
240 CPUs to 250s.  The upgrade required a BIOS upgrade and while I was 
at it, upgraded the OS and security.  Some readers provided good 
suggestions for diagnosis.  As it turned out, of the 16 CPU batch...two 
were flawed.  No success was derived from replacing power supplies, HDD, 
resetting memory and the cooling solution.  The CPU flaw only manifested 
itself (at first) after several hours of CPU usage.  With each failure, 
the time duration shortened before the next failure and by the time I 
figured it out was down to about 2 minutes.

The AMD engineer with whom I talked was amazed that such CPUs made it 
beyond quality control.  He also suggested that the vendor may have 
inadvertently mixed returned (previously fetermined to be flawed 
processors) with the new ones and sent them out (again) as new.

Just for future reference...is there an easy way to determine if a CPU 
is flawed with 2 weeks of down time and extensive hair extraction????


-------------- next part --------------
A non-text attachment was scrubbed...
Name: twm.vcf
Type: text/x-vcard
Size: 336 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20061204/c1a5452d/attachment.vcf>

More information about the Beowulf mailing list