Poor scaling (was Re: Question about custers)
bogdan.costescu at iwr.uni-heidelberg.de
Mon Feb 10 02:47:38 PST 2003
On Fri, 7 Feb 2003, Ken Chase wrote:
> Im curious, when people see really poor scaling on their clusters (HSI
> or GBE or 100BT, doesnt matter) at like 16 or 32 or more nodes (Im thinking
> CHARMM and Gromacs here), what do you do with the extra cpu?
Well, this is for the case when you have SMP nodes and run only one
process per node. I agree that this scales better, but for administrative
reasons this is not always the case...
Based on a similar reasoning and Moore's law, we never bought a
one-time large number of nodes for running CHARMM. Instead, we always buy
multiple of 4 in the range 4-16, sometimes UP, sometimes SMP (the multiple
of 4 thing is not because it's a power of 2 or somehow related to CHARMM
but because it's the number of computer cases that fit on one of our
shelves :-)). Because they are often of different speeds and sometimes
connected to different switches, it's really not efficient to run on nodes
bought in different batches, so we limit in this way the maximum number of
CPUs that can be allocated to a job.
> Just let it float away unused? Do you use it? Do you run other jobs on
> them at the same time?
We usually use the SMP nodes in 2 ways:
- 2 jobs: one parallel and one single. As CHARMM still has important
features that do not run in parallel (f.e. normal modes), we run one of
these (usually large memory-) jobs along with a (usually low memory-)
parallel one. This requires SMP nodes with large amounts of memory (>=1Gb)
- 2 parallel jobs. We have found (by trying, so don't get this as the
definitive answer!) that the total throughput is higher; this however is
true only when the jobs have similar data sizes - if the jobs have numbers
of atoms that are one degree or more different, this is not true anymore.
> Do you nice those jobs to 19?
No, we usually let them run at normal priority. At least in the first case
they don't seem to interfere with each other. In the second case, we know
that there will be interference (even in the case of HSI - Myrinet here),
so we just say "That's life" :-)
> Do you see your cache being thrashed by this
How do you quantify the cache thrashes ?
I've found that CHARMM scales surprisingly well with CPU speed for
classical MD jobs and not affected much by cache size; this comes only
from job-level timing, for various reasons I've never been able to
include some CPU counter library in the "production" cluster kernels.
Until now (P4-Xeon era) I haven't seen a significant speed improvement
when compiled with PGI compilers vs. fort77+f2c+gcc2.x (which for me
always generated faster code than g77 -O6 and all other optimizations
present in the CHARMM Makefile). I didn't have time yet to test how the
speed compares on Xeons with f2c+gcc-2.x vs. g77-3.x vs. PGI vs. Intel
Why I'm saying this: because I believe that the speed gain from
gcc-3.x/PGI/Intel compilers would come from better memory access patterns
which would more fully use the cache and so the effect of cache thrashing
would be more evident - please correct me if I'm talking rubbish.
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De
More information about the Beowulf