[Beowulf] Odd SuperMicro power off issues
stephen mulcahy
smulcahy at aplpi.com
Mon Dec 8 04:59:00 PST 2008
Chris Samuel wrote:
> Very occasionally we find one of our Barcelona nodes with
> a SuperMicro H8DM8-2 motherboard powered off. IPMI reports
> it as powered down too.
Hi Chris,
We had a similar exerience with one of our compute nodes - intermittent
power-offs when running our model and absolutely nothing in the logs. I
modified Ganglia to track voltage and temp in an effort to see if
anything unusual happened to those before-hand but there was no
discernable trends.
I can memtest86+ a number of times on the problem node and neither it
nor mcelog showed any problems.
Subsequent to that, I found aBIOS upgrade for those systems which
included an Opteron microcode update to fix an AMD processor erratum
(sp?) - I can dig out the details if the specific problem is of interest.
Around the same time, we finally started to see memory errors, so we
also replaced the bad mmory in the system.
Unfortunately I can't tell you which was responsible for fixing the
problem. My understanding is that Fluent is quite memory and I/O
intensive - do you run other equally intensive models without seeing the
failure?
Anyways, in summary - if you're totally stumped - try swapping out the
memory and/or rolling to the latest firmware and see if that improves
the stability.
-stephen
--
Stephen Mulcahy Applepie Solutions Ltd. http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)
More information about the Beowulf
mailing list