BEOWULF cluster hangs
Michael Prinkey
mprinkey at aeolusresearch.com
Thu Sep 26 08:23:13 PDT 2002
There are many problems with the virtual memory manager in the 2.4
series of kernels. These have been mostly fixed in the later 2.4
series. I recommend trying 2.4.19 and see if this fixes the problem.
Mike Prinkey
Aeolus Research, Inc.
G.de-With wrote:
> Hello
>
> Since a month we have a LINUX BEOWULF cluster, the clusters contains 7
> P4 dual processor 2GHz computers, with 8Gb of RAM per machine. For our
> network we have used Gigabit ethernet.
>
> The problem we have with our cluster is as follows.
> When running large computational fluid simulations the simulation
> starts to slow down. At some point the response of the computer is so
> poor that we have to kill the simulation. In a worst case when the
> simulation was running overnight the computer hangs and a reset of the
> computer is necessary.
> Nevertheless, even when we manage to kill the simulation in time the
> computer remains very slow and a reboot is necessary to regain full
> computer power.
>
> My first suspicion was that the computer simply started swapping, but
> there is no swap space being used, instead free RAM memory is still
> apparent
> (between 5-10%) and only 90% of the RAM is in use whereby 50% is
> cached and another 50% is in usage. In addition the cpu usage is very
> low as well.
>
> May be it is of use to mention that this problem occurs with both
> sequential and parallel simulations.
>
>
> On our cluster we are running RH7.2 with the LINUX kernel version
> 2.4.7-10. We have set-up our cluster using oscar-1.2.1rh72. The /home
> partition on the world client is a shared via the network using NFS.
>
> /etcfstab
>
> 192.168.1.100:/home /home nfs rw 0 2
>
>
>
> 1) In case anyone could do me some suggestions why our computers are
> slowing down/hanging or if some one has got a similar experience
> please let me know.
> 2) To my understanding the most important indicators to indicate the
> computer usage are:
> - memory usage
> - cpu usage
> Are there other key components/indicators which could lead to a
> reduction in computer performance, and if so how can I see the status
> of them.
>
> Govert
>
>
>--
> ------------------------------------------------------------
>| Dr. Govert de With Research Fellow |
>| Fluid Mechanics Research Group |
>| University of Hertfordshire |
>| Tel: 01707 284124 Fax: 01707 285086 |
> ------------------------------------------------------------
>| Der Horizont vieler Menschen ist ein Kreis mit Radius Null |
>| und das nennen sie ihren Standpunkt. |
> ------------------------------------------------------------
>
>
More information about the Beowulf
mailing list