[Beowulf] Tips for diagnosing intermittent problems on a small cluster

Andrew M.A. Cater amacater at galactic.demon.co.uk
Sun Nov 25 03:27:21 PST 2007


On Thu, Nov 22, 2007 at 01:53:04PM +0100, Jürgen Kabelitz wrote:
> 
> Hi,
> 
> We had the same problems with a cluster of 40 nodes. The motherboard has problems with great IO. We have some test programs they used only the cpu and make no or less IO. These programmes runs and runs. But when you have a program like Gaussian with a big IO then this can happen.
> At last we change the motherboard against the S2882.
> J. Kabelitz
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] Im Auftrag von stephen mulcahy
> Gesendet: Mittwoch, 21. November 2007 18:28
> An: beowulf at beowulf.org
> Betreff: [Beowulf] Tips for diagnosing intermittent problems on a small cluster
> 
> Hi,
> 
> As I mentioned in my previous posting, the 20 node Tyan S2891 Dual
> Opteron dual core Debian cluster (1 NFS providing head node, 19 diskless
> compute nodes) is currently experiencing 2 intermittent problems which
> I'm trying to diagnose.
> 
> After a few days of testing and digging through system logs I'm pretty
> much stumped as to what may be causing these. There are 2 separate
> problems - anyones opinions on how to go about diagnosing these problems
> or things I might have missed would be most welcome.
> 
> Problem #1
> Over the last 6 months, 3 different nodes have been found in a powered
> down state - the nodes seem to have powered off during a run of the
> model. 

Same here with on a single machine with an earlier model Tyan board - it 
happened to us either after a very occasional kernel panic/exception or 
after 25-28 days of continuous running. I've got a 2885 here, if I can 
just find two Opterons, memory and a case :-) I'll let you know if this 
one does it too. 

There _may_ be some PSU involvement with ours: the machine and fans are 
running but not accepting connections. You have to disconnect the power
for a few minutes for it to even boot again properly. Powercycling from 
the front panel doesn't always work

Debian etch, stock Debian kernel (2.6.18-5 from memory).

Andy





More information about the Beowulf mailing list