[Beowulf] Cluster doesn't like being moved

Steve Herborn herborn at usna.edu
Tue Mar 10 11:35:39 PDT 2009


I have a small test cluster built off Novell SUES Enterprise Server 10.2
that is giving me fits.  It seems that every time the hardware is physically
moved (keep getting kicked out of the space I'm using), I end up with any
number of different problems. 

Personally I suspect some type of hardware issue (this equipment is about 5
years old), but one of my co-workers isn't so sure hardware is in play.  I
was having problems with the RAID initializing after one move back which I
resolved a while back by reseating the RAID controller card.

This time It appears that the file system & configuration databases became
corrupted after moving the equipment. Several services aren't starting up
(LADP, DHCP, PBS to name a few) and YAST2 hangs any time an attempt is made
to use it. For example adding a printer or software package. My co-worker
feels the issue maybe related to the ReiserFS file system with AMD
processors. The ReiserFS file system was the default presented when I
initially installed SLES so I went with it.

Do you know of any issues with using the ReiserFS file system on AMD based
systems or have any other ideas what I maybe facing?

 

Steven A. Herborn

U.S. Naval Academy

Advanced Research Computing

410-293-6480 (Desk)

757-418-0505 (Cell)

 

  _____  

From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On
Behalf Of gossips J
Sent: Monday, March 09, 2009 5:08 AM
To: beowulf at beowulf.org
Subject: [Beowulf] HPCC "intel_mpi" error


Hi, 



We are using ICR validation.



We are facing following problem while running below command:



cluster-check --debug --include_only intel_mpi /root/sample.xml

 



Problem is: 



Output of cluster checker shows us that "intel_mpi" FAILED, where as by

looking into debug.out file it is seen that "Hello World" is returned from

all nodes. 

 



I have 16 nodes configuration and we are running 8 proc/node.



Above behavior is observed with even 1 proc/node, 2 proc/node, 4 proc/node

as well. I also tried "rdma" and "rdssm" as a DEVICE in XML file but no
luck. 



If anyone can shed some light on this issue, it would be great help. 



Another thing I would like to know is:


Is there a way to specify "-env RDMA_TRANSLATION_CACHE" option with Intel
Cluster Checker?
Awaiting for kind response,


Thanks in advance,
Polk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090310/57513a00/attachment.html>


More information about the Beowulf mailing list