[Beowulf] Cluster doesn't like being moved

Tue Mar 10 12:00:27 PDT 2009

Just some thoughts...  Since you are physically moving the machines, 
things like loose cards, processors, heat sinks/fans, memory, cables 
come to mind.  I've personally have had loose heat sinks cause 
processors to do funky things (software crashes/corruption, etc...).  
I've heard of issues with the disk heads hitting the platters while they 
were moved which lead to data loss. Have you tried running a full file 
system check?  I think most modern disks lock the disk armatures in 
place now but the disks/raid device might have software to do this for 
you though still.  Other problem sources might include weird 
environmental ones, like excessive heat and magnetic fields playing 
havoc with the hardware during the transition.

Good luck figuring it out.

Bart

Steve Herborn wrote:
>
> I have a small test cluster built off Novell SUES Enterprise Server 
> 10.2 that is giving me fits.  It seems that every time the hardware is 
> physically moved (keep getting kicked out of the space I'm using), I 
> end up with any number of different problems.
>
> Personally I suspect some type of hardware issue (this equipment is 
> about 5 years old), but one of my co-workers isn't so sure hardware is 
> in play.  I was having problems with the RAID initializing after one 
> move back which I resolved a while back by reseating the RAID 
> controller card.
>
> This time It appears that the file system & configuration databases 
> became corrupted after moving the equipment. Several services aren't 
> starting up (LADP, DHCP, PBS to name a few) and YAST2 hangs any time 
> an attempt is made to use it. For example adding a printer or software 
> package. My co-worker feels the issue maybe related to the ReiserFS 
> file system with AMD processors. The ReiserFS file system was the 
> default presented when I initially installed SLES so I went with it.
>
> Do you know of any issues with using the ReiserFS file system on AMD 
> based systems or have any other ideas what I maybe facing?
>
>  
>
> *Steven A. Herborn*
>
> *U.S. Naval Academy*
>
> *Advanced Research Computing*
>
> *410-293-6480 (Desk)*
>
> *757-418-0505 (Cell)*
>
>  
>
> ------------------------------------------------------------------------
> *From:* beowulf-bounces at beowulf.org 
> [mailto:beowulf-bounces at beowulf.org] *On Behalf Of *gossips J
> *Sent:* Monday, March 09, 2009 5:08 AM
> *To:* beowulf at beowulf.org
> *Subject:* [Beowulf] HPCC "intel_mpi" error
>
> Hi, 
>
> We are using ICR validation.
>
> We are facing following problem while running below command:
>
> cluster-check --debug --include_only intel_mpi /root/sample.xml
>  
>
> Problem is: 
>
> Output of cluster checker shows us that "intel_mpi" FAILED, where as by
> looking into debug.out file it is seen that "Hello World" is returned from
> all nodes. 
>  
>
> I have 16 nodes configuration and we are running 8 proc/node.
>
> Above behavior is observed with even 1 proc/node, 2 proc/node, 4 proc/node
> as well. I also tried "rdma" and "rdssm" as a DEVICE in XML file but no luck. 
>
> If anyone can shed some light on this issue, it would be great help. 
>
>   
> Another thing I would like to know is:
>   
> Is there a way to specify "-env RDMA_TRANSLATION_CACHE" option with Intel Cluster Checker?
> Awaiting for kind response,
>   
> Thanks in advance,
> Polk.
> ------------------------------------------------------------------------
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>