BEOWULF cluster hangs

Michael Prinkey mprinkey at aeolusresearch.com
Thu Sep 26 08:23:13 PDT 2002


There are many problems with the virtual memory manager in the 2.4 
series of kernels.  These have been mostly fixed in the later 2.4 
series.  I recommend trying 2.4.19 and see if this fixes the problem.

Mike Prinkey
Aeolus Research, Inc.


G.de-With wrote:

> Hello
>
> Since a month we have a LINUX BEOWULF cluster, the clusters contains 7 
> P4 dual processor 2GHz computers, with 8Gb of RAM per machine. For our 
> network we have used Gigabit ethernet.
>
> The problem we have with our cluster is as follows.
> When running large computational fluid simulations the simulation 
> starts to slow down. At some point the response of the computer is so 
> poor that we have to kill the simulation. In a worst case when the 
> simulation was running overnight the computer hangs and a reset of the 
> computer is necessary.
> Nevertheless, even when we manage to kill the simulation in time the 
> computer remains very slow and a reboot is necessary to regain full 
> computer power.
>
> My first suspicion was that the computer simply started swapping, but 
> there is no swap space being used, instead free RAM memory is still 
> apparent
> (between 5-10%) and only 90% of the RAM is in use whereby 50% is 
> cached and another 50% is in usage. In addition the cpu usage is very 
> low as well.
>
> May be it is of use to mention that this problem occurs with both 
> sequential and parallel simulations.
>  
>
> On our cluster we are running RH7.2 with the LINUX kernel version 
> 2.4.7-10. We have set-up our cluster using oscar-1.2.1rh72. The /home 
> partition on the world client is a shared via the network using NFS.
>
> /etcfstab
>
> 192.168.1.100:/home /home nfs rw 0 2
>  
>  
>
> 1) In case anyone could do me some suggestions why our computers are 
> slowing down/hanging or if some one has got a similar experience 
> please let me know.
> 2) To my understanding the most important indicators to indicate the 
> computer usage are:
> - memory usage
> - cpu usage
> Are there other key components/indicators which could lead to a 
> reduction in computer performance, and if so how can I see the status 
> of them.
>
> Govert
>  
>
>-- 
> ------------------------------------------------------------
>| Dr. Govert de With     Research Fellow                     |
>| Fluid Mechanics Research Group                             |
>| University of Hertfordshire                                |
>| Tel: 01707 284124 Fax: 01707 285086                        |
> ------------------------------------------------------------
>| Der Horizont vieler Menschen ist ein Kreis mit Radius Null |
>| und das nennen sie ihren Standpunkt.                       |
> ------------------------------------------------------------
>
>  







More information about the Beowulf mailing list