[Beowulf] Performance degrading
Gus Correa
gus at ldeo.columbia.edu
Tue Dec 15 11:36:51 PST 2009
Hi Jorg
If you have single quad core nodes as you said,
then top shows that you are oversubscribing the cores.
There are five nwchem processes are running.
In my experience, oversubscription only works in relatively
light MPI programs (say the example programs that come with OpenMPI or
MPICH).
Real world applications tend to be very inefficient,
and can even hang on oversubscribed CPUs.
What happens when you launch four or less processes
on a node instead of five?
My $0.02.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
Jörg Saßmannshausen wrote:
> Dear all,
>
> I am scratching my head but apart from getting splinters into my fingers I
> cannot find a good answer for the following problem:
> I am running a DFT program (NWChem) in parallel on our cluster (AMD Opterons,
> single quad cores in the node, 12 GB of RAM, Gigabit network) and at certain
> stages of the run top is presenting me with that:
>
> top - 15:10:48 up 13 days, 22:20, 1 user, load average: 0.26, 0.24, 0.19
> Tasks: 106 total, 1 running, 105 sleeping, 0 stopped, 0 zombie
> Cpu0 : 8.0% us, 2.7% sy, 0.0% ni, 82.7% id, 0.0% wa, 1.3% hi, 5.3% si
> Cpu1 : 4.1% us, 1.4% sy, 0.0% ni, 94.6% id, 0.0% wa, 0.0% hi, 0.0% si
> Cpu2 : 2.7% us, 0.0% sy, 0.0% ni, 97.3% id, 0.0% wa, 0.0% hi, 0.0% si
> Cpu3 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
> Mem: 12250540k total, 5581756k used, 6668784k free, 273396k buffers
> Swap: 16779884k total, 0k used, 16779884k free, 3841688k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 16885 sassy 15 0 3928m 1.7g 1.4g S 4 14.4 312:19.92 nwchem
> 16886 sassy 15 0 3928m 1.7g 1.4g S 4 14.5 313:08.77 nwchem
> 16887 sassy 15 0 3920m 1.7g 1.4g S 3 14.4 316:18.24 nwchem
> 16888 sassy 15 0 3923m 1.6g 1.3g S 3 13.3 316:13.55 nwchem
> 16890 sassy 15 0 2943m 1.7g 1.7g S 3 14.8 104:32.33 nwchem
>
> It is not a few seconds it does it, it appears to be for a prolonged period of
> time. I checked it randomly for say 1 min and the performance is well below
> 50 % (most of the time around 20 %). I have not noticed that when I am
> running the job within one node.
>
> I have the suspicion that the Gigabit network is the problem, but I really
> would like to pinpoint that so I can get my boss to upgrade to a better
> network for parallel computing (hence my previous question about Open-MX).
> Now how, as I am not an admin of that cluster, would I be able to do that?
>
> Thanks for your comments.
>
> Best wishes from Glasgow!
>
> Jörg
>
More information about the Beowulf
mailing list