[Beowulf] Odd SuperMicro power off issues

Mon Dec 8 11:05:19 PST 2008

Hi Chris -

We've had similar problems on two different clusters using Barcelonas with two 
different motherboards.

Our new cluster uses SuperMicro TwinU's (two H8DMT-INF+ motherboards in each) 
and was delivered in early November.  Out of the roughly 590 motherboards, we 
had maybe 20 that powered down under load.  Like yours, IPMI was still working, 
and so we could power these up remotely.

For nearly all of these, swapping memory fixed the problem.  For systems that 
multiple memory swaps did not fix the problem, the vendor swapped motherboards. 
I do not believe we've had to swap a power supply yet for this.

On an older, smaller cluster, which uses Asus KFSN4-DRE motherboards, the 
incidence rate has been much higher - 20% or so - and swapping memory has not 
fixed the problem.  On some of the systems, slowing the memory clock fixes this, 
but of course this causes lower computational throughput.  We are still working 
with the vendor to fix the problem nodes; for now, we are scheduling only 6 of 8 
available cores.  For the job mix on that cluster, this has been a temporary
solution for most of the power off issues.

Like you, many of the codes that our users run do not cause a problem. On the 
Asus-based cluster, a computational cosmology code will trigger the power 
shutdowns.  The best torture code that we've found has been xhpl (linpack) built 
using a threaded version of libgoto; when this is executed on a single dual
Barcelona node with "-np 8", each of the 8 MPI processes spawns 8 threads.
This particular binary will cause our bad nodes to power off very quickly
(you are welcome to a copy of the binary - just let me know).

The power draw from our Barcelona systems is very strongly dependent on the 
code.  The power draw difference between the xhpl binary mentioned above and the 
typical Lattice QCD codes we run is at least 25%.  Because of this we've always 
suspected thermal or power issues, but the vendor of our Asus-based cluster has 
done the obvious things to check both (eg, using active coolers on the CPU's, 
using larger power supplies, and so forth) and hasn't had any luck.  Also, the 
fact that swapping memory on our SuperMicro systems helps without affecting
computational performance probably means that it is not a thermal issue on the
CPU's.

Don Holmgren
Fermilab

On Mon, 8 Dec 2008, Chris Samuel wrote:

> Hi folks,
>
> We've been tearing our hair out over this for a little
> while and so I'm wondering if anyone else has seen anything
> like this before, or has any thoughts about what could be
> happening ?
>
> Very occasionally we find one of our Barcelona nodes with
> a SuperMicro H8DM8-2 motherboard powered off.  IPMI reports
> it as powered down too.
>
> No kernel panic, no crash, nothing in the system logs.
>
> Nothing in the IPMI logs either, it's just sitting there
> as if someone has yanked the power cable (and we're pretty
> sure that's not the cause!).
>
> There had not been any discernible pattern to the nodes
> affected, and we've only a couple nodes where it's happened
> twice, the rest only have had it happen once and scattered
> over the 3 racks of the cluster.
>
> For the longest time we had no way to reproduce it, but then
> we noticed that for 3 of the power off's there was a particular
> user running Fluent on there.   They've provided us with a copy
> of their problem and we can (often) reproduce it now with that
> problem.  Sometimes it'll take 30 minutes or so, sometimes it'll
> take 4-5 hours, sometimes it'll take 3 days or so and sometimes
> it won't do it at all.
>
> It doesn't appear to be thermal issues as (a) there's nothing in
> the IPMI logs about such problems and (b) we inject CPU and system
> temperature into Ganglia and we don't see anything out of the
> ordinary in those logs. :-(
>
> We've tried other codes, including HPL, and Advanced Clustering's
> Breakin PXE version, but haven't managed to (yet) get one of the
> nodes to fail with anything except Fluent. :-(
>
> The only oddity about Fluent is that it's the only code on
> the system that uses HP-MPI, but we used the command line
> switches to tell it to use the Intel MPI it ships with and
> it did the same then too!
>
> I just cannot understand what is special about Fluent,
> or even how a user code could cause a node to just turn
> off without a trace in the logs.
>
> Obviously we're pursuing this through the local vendor
> and (through them) SuperMicro, but to be honest we're
> all pretty stumped by this.
>
> Does anyone have any bright ideas ?
>
> cheers,
> Chris