[Beowulf] Odd SuperMicro power off issues
csamuel at vpac.org
Sun Dec 7 19:33:57 PST 2008
We've been tearing our hair out over this for a little
while and so I'm wondering if anyone else has seen anything
like this before, or has any thoughts about what could be
Very occasionally we find one of our Barcelona nodes with
a SuperMicro H8DM8-2 motherboard powered off. IPMI reports
it as powered down too.
No kernel panic, no crash, nothing in the system logs.
Nothing in the IPMI logs either, it's just sitting there
as if someone has yanked the power cable (and we're pretty
sure that's not the cause!).
There had not been any discernible pattern to the nodes
affected, and we've only a couple nodes where it's happened
twice, the rest only have had it happen once and scattered
over the 3 racks of the cluster.
For the longest time we had no way to reproduce it, but then
we noticed that for 3 of the power off's there was a particular
user running Fluent on there. They've provided us with a copy
of their problem and we can (often) reproduce it now with that
problem. Sometimes it'll take 30 minutes or so, sometimes it'll
take 4-5 hours, sometimes it'll take 3 days or so and sometimes
it won't do it at all.
It doesn't appear to be thermal issues as (a) there's nothing in
the IPMI logs about such problems and (b) we inject CPU and system
temperature into Ganglia and we don't see anything out of the
ordinary in those logs. :-(
We've tried other codes, including HPL, and Advanced Clustering's
Breakin PXE version, but haven't managed to (yet) get one of the
nodes to fail with anything except Fluent. :-(
The only oddity about Fluent is that it's the only code on
the system that uses HP-MPI, but we used the command line
switches to tell it to use the Intel MPI it ships with and
it did the same then too!
I just cannot understand what is special about Fluent,
or even how a user code could cause a node to just turn
off without a trace in the logs.
Obviously we're pursuing this through the local vendor
and (through them) SuperMicro, but to be honest we're
all pretty stumped by this.
Does anyone have any bright ideas ?
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the Beowulf