[Beowulf] Odd AMD quad core SuperMicro power off issues
csamuel at vpac.org
Thu Jul 2 22:17:17 PDT 2009
----- "Chris Samuel" <csamuel at vpac.org> wrote:
In April I wrote:
> Well we've been gradually replacing the Barcelona chips
> with Shanghai (same clockspeed) and we are yet to see a
> power off on a Shanghai node!
Since I wrote that we have seen far fewer with 2.3GHz
Shanghai (2376, a 75W part), *but* we have some nodes
upgraded to the ULP 2.4 GHz Shanghai (2379 HE, a 55W
part) which do exhibit this issue very regularly! :-(
Gaussian is still a classic for doing this, but we've
also been able to trigger it with VASP, Amber and (far
less frequently) InterProScan.
The compute nodes are using SuperMicro H8DM8-2 based
with 32GB of ECC RAM. The boxes are running CentOS 5.3
with mainline kernels (currently 18.104.22.168, though we
have demonstrated it with 2.6.30-rc6 and the EDAC patches
which catch nothing before it dies). We've seen the
same behaviour with the standard CentOS kernels too.
This is driving us up the wall!
Is nobody else seeing this ?
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the Beowulf