[Beowulf] Node Drop-Off
Vincent Diepeveen
diep at xs4all.nl
Mon Dec 4 08:57:50 PST 2006
To let nodes fail quickly, go find BIG primes with latest GMP nonstop at the
opterons.
After a few days of nonstop calculation bad nodes should get total hung,
when you hit worst case path.
most use prime95 for such stuff, but that thing is using exclusively "intel"
assembly SIMD code
and the cpu's seem quite well tested/fixed for SIMD.
Seems to me that a mix of integer multiplication and modulo there is worst
case path of those opteron chips.
It is of course also possible it hits a bug in the linux kernel, because my
own code doesn't get hung at all when running at 4 cores, whereas GMP did do
exactly that.
My code currently works fastest at windows, as it doesn't have inline
assembly for linux yet.
I wouldn't rule out that linux kernel simply has bugs there. The testing of
those kernels is total amateuristic.
Vincent
----- Original Message -----
From: "Tim Moore" <twm at tcg-hsv.com>
To: <beowulf at beowulf.org>
Sent: Monday, December 04, 2006 4:15 PM
Subject: Re: [Beowulf] Node Drop-Off
> Update to node drop-off:
>
> I wrote a few weeks ago to ask about node drop-off. A quick note...I
> had a cluster run for 3 years without failure and I upgraded the Opteron
> 240 CPUs to 250s. The upgrade required a BIOS upgrade and while I was
> at it, upgraded the OS and security. Some readers provided good
> suggestions for diagnosis. As it turned out, of the 16 CPU batch...two
> were flawed. No success was derived from replacing power supplies, HDD,
> resetting memory and the cooling solution. The CPU flaw only manifested
> itself (at first) after several hours of CPU usage. With each failure,
> the time duration shortened before the next failure and by the time I
> figured it out was down to about 2 minutes.
>
> The AMD engineer with whom I talked was amazed that such CPUs made it
> beyond quality control. He also suggested that the vendor may have
> inadvertently mixed returned (previously fetermined to be flawed
> processors) with the new ones and sent them out (again) as new.
>
> Just for future reference...is there an easy way to determine if a CPU
> is flawed with 2 weeks of down time and extensive hair extraction????
>
> Tim
>
>
--------------------------------------------------------------------------------
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list