[Beowulf] Tips for diagnosing intermittent problems on a small cluster

Wed Nov 21 09:27:57 PST 2007

Hi,

As I mentioned in my previous posting, the 20 node Tyan S2891 Dual 
Opteron dual core Debian cluster (1 NFS providing head node, 19 diskless 
compute nodes) is currently experiencing 2 intermittent problems which 
I'm trying to diagnose.

After a few days of testing and digging through system logs I'm pretty 
much stumped as to what may be causing these. There are 2 separate 
problems - anyones opinions on how to go about diagnosing these problems 
or things I might have missed would be most welcome.

Problem #1
Over the last 6 months, 3 different nodes have been found in a powered 
down state - the nodes seem to have powered off during a run of the 
model. There are no interesting messages in the system logs co-inciding 
with the time of these shutdowns. My first suspect was the power supply 
to cluster but the UPS power system has logged no errors co-inciding 
with these failures. I've run a bunch of stress testers on the systems 
that failed including cpuburn and cpustress in the hope that a failing 
component such as psu or processors would be triggered again -- but all 
the systems happily ran 24 hours of tests without any problems.
2 of the 3 failing systems are logging some MCE messages - but they seem 
to be standard memory errors which are being corrected by the system. 
Any suggestions on where to go next?

Problem #2
On 2 occasions over the last 6 months one of the 2 oceanographic models 
we run on this cluster (ROMS, the other being SWAN) has gone into a 
state where it is running significantly slower than usual. This seems to 
have been preceeded by us running the other model but we can't 
reproducibly get the system into this state. Looking at various process 
stats - when the model is in the slowed down state - the model goes from 
about 30% system cpu time, 60% user cpu time to about 60% system cpu 
time and 30% user cpu time. Again, nothing unusual in the logs, nor in 
the gigabit switch logs. A quick strace of one of the running model 
processes didn't show anything significantly unusual (although I don't 
normally sit there watching straces of the model during normal 
operational so I could well have missed all sorts of things here).
Again, any suggestions on where to go next on this would be welcome, I'm 
wondering if I'm seeing some strange kernel-level or MPI-level problem 
which only manifests under certain conditions but I can't even guess at 
this stage what those conditions might be.

Thanks,

-stephen

-- 
Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland.  +353.91.751262  http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)