Poor scaling (was Re: Question about custers)

Mon Feb 10 02:47:38 PST 2003

On Fri, 7 Feb 2003, Ken Chase wrote:

> Im curious, when people see really poor scaling on their clusters (HSI
> or GBE or 100BT, doesnt matter) at like 16 or 32 or more nodes (Im thinking
> CHARMM and Gromacs here), what do you do with the extra cpu?

Well, this is for the case when you have SMP nodes and run only one 
process per node. I agree that this scales better, but for administrative 
reasons this is not always the case...

Based on a similar reasoning and Moore's law, we never bought a 
one-time large number of nodes for running CHARMM. Instead, we always buy 
multiple of 4 in the range 4-16, sometimes UP, sometimes SMP (the multiple 
of 4 thing is not because it's a power of 2 or somehow related to CHARMM 
but because it's the number of computer cases that fit on one of our 
shelves :-)). Because they are often of different speeds and sometimes 
connected to different switches, it's really not efficient to run on nodes 
bought in different batches, so we limit in this way the maximum number of 
CPUs that can be allocated to a job.

> Just let it float away unused? Do you use it? Do you run other jobs on
> them at the same time?

We usually use the SMP nodes in 2 ways:
- 2 jobs: one parallel and one single. As CHARMM still has important 
features that do not run in parallel (f.e. normal modes), we run one of 
these (usually large memory-) jobs along with a (usually low memory-) 
parallel one. This requires SMP nodes with large amounts of memory (>=1Gb)
- 2 parallel jobs. We have found (by trying, so don't get this as the 
definitive answer!) that the total throughput is higher; this however is 
true only when the jobs have similar data sizes - if the jobs have numbers 
of atoms that are one degree or more different, this is not true anymore.

> Do you nice those jobs to 19?

No, we usually let them run at normal priority. At least in the first case 
they don't seem to interfere with each other. In the second case, we know 
that there will be interference (even in the case of HSI - Myrinet here), 
so we just say "That's life" :-)

> Do you see your cache being thrashed by this

How do you quantify the cache thrashes ?
I've found that CHARMM scales surprisingly well with CPU speed for 
classical MD jobs and not affected much by cache size; this comes only 
from job-level timing, for various reasons I've never been able to 
include some CPU counter library in the "production" cluster kernels.
Until now (P4-Xeon era) I haven't seen a significant speed improvement
when compiled with PGI compilers vs. fort77+f2c+gcc2.x (which for me
always generated faster code than g77 -O6 and all other optimizations
present in the CHARMM Makefile). I didn't have time yet to test how the
speed compares on Xeons with f2c+gcc-2.x vs. g77-3.x vs. PGI vs. Intel
compilers.
Why I'm saying this: because I believe that the speed gain from
gcc-3.x/PGI/Intel compilers would come from better memory access patterns
which would more fully use the cache and so the effect of cache thrashing
would be more evident - please correct me if I'm talking rubbish.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De