[Beowulf] single machine with 500 GB of RAM

Wed Jan 9 11:25:57 PST 2013

On 01/09/2013 12:00 PM, Andrew Holway wrote:
> As its a single thread I doubt that faster memory is going to help you much. It's going to suck whatever you do.
>
Hi

I don't know anything about computational chemistry
or its grid/mesh requirements,
but if you look at the makefile, you'll see that
the code is compiled with OpenMP:

CCFLAGS = -O3 -Wno-deprecated -ffor-scope -fopenmp
LFLAGS  = -O3 -fopenmp -static

Hence, it should work multi-threaded.

Whether the use of OpenMP in the code leads to good scaling or not,
that is a different story.
There are just a handful of OpenMP pragmas, and restricted
to three source files only.

basin_eval-4.6.cxx:    #pragma omp parallel
basin_eval-4.6.cxx:    #pragma omp for
basin_eval-4.6.cxx:    #pragma omp parallel
basin_eval-4.6.cxx:    #pragma omp for
compute_wf-4.6.cxx:    #pragma omp parallel
compute_wf-4.6.cxx:    #pragma omp for
grid_util-4.6.cxx:    #pragma omp parallel
grid_util-4.6.cxx:        #pragma omp for private (ijk, position)

Who knows, if this covers the algorithm's core/expensive loops,
that may do it.

It may be worth test scaling in any multicore machine,
before buying a bigger one.

Also, the code seems to be all C++ (irregular grid/adaptive mesh
folks seem to love it).
I couldn't find any good ol' Fortran or bona fide C.

I hope this helps,
Gus Correa

> Am 9 Jan 2013 um 17:29 schrieb Jörg Saßmannshausen<j.sassmannshausen at ucl.ac.uk>:
>
>> Dear all,
>>
>> many thanks for the quick reply and all the suggestions.
>>
>> The code we want to use is that one here:
>>
>> http://www.cpfs.mpg.de/~kohout/dgrid.html
>>
>> Feel free to download and dig into the code. I am no expert in Fortran so I
>> won't be able to help you much if you got specific questions to the code :-(
>> However, my understanding is that it will only run on one core/thread.
>>
>> As for the budget: That is where it is getting a bit tricky. The ceiling is
>> 10k GBP. I know that machines with less memory, say 256 GB, are cheaper, so
>> one solution would be to get two of the beast so we can do two calculations at
>> the same time. If there are enough slots free, we could upgrade to 500 GB once
>> we got another pot of money.
>>
>> I guess I would go for DDR3, simply as it is faster. Waiting 2 weeks for a
>> calculation is no fun, so if we can save a bit of time here (faster RAM) we
>> gain actually quite a bit here.
>>
>> I am not convinced with the AMD Bulldozer to be honest. From what I understand
>> the Sandybridge has the faster memory access (higher bandwidth). Is that
>> correct or do I miss out something here.
>>
>> I gather that the idea of just using one CPU is not a good one. So we need to
>> have a dual CPU machine, which is fine with me.
>>
>> I am wondering about the vSMP / ScaleMP suggestion from Joe. If I am using an
>> InfiniBand network here, would I be able to spread the 'bottlenecks' a bit
>> better? What I am after is, when I tested out the InfiniBand on the new cluster
>> we got, I noticed that if you are running a job in parallel between nodes, the
>> same amount of cores are marginally faster. At the time I put that down due to
>> a slightly faster memory access as there was no bottleneck to the RAM.
>> I am not familiar with vSMP (i.e. I never used it), but is it possible to
>> aggregate RAM from a number of nodes (say 40) and use it as a large virtual
>> SMP? So one node would be slaving away with the calculations and the other
>> nodes are only doing memory IO. Is that possible with vSMP?
>> In a related context, how about NUMAScale?
>>
>> The idea of the aggregates SDD is nice as well. I know some storage vendors
>> are using a mixture of RAM and SDD for their meta-data (fast access) and that
>> seems to work quite well. So that would be a large swap file / partition or is
>> there another way to use disc-space as RAM? I need to read the paper of
>> NVMalloc I suppose. Is that actually used or is that just a good idea and we
>> got a working example here?
>>
>> I don't think there is much disc IO here. There is most certainly no network
>> bound traffic as it is a single thread. A fast CPU would be of advantage as
>> well, however, I gut the feeling the trade-off would be the memory access speed
>> (bandwidth).
>>
>> I have tried to answer the questions raised. Let me know whether there are
>> still some unclear points.
>>
>> Thanks for all your help and suggestions so far. I will need to digest that.
>>
>> All the best from a sunny London
>>
>> Jörg
>>
>> --
>> *************************************************************
>> Jörg Saßmannshausen
>> University College London
>> Department of Chemistry
>> Gordon Street
>> London
>> WC1H 0AJ
>>
>> email: j.sassmannshausen at ucl.ac.uk
>> web: http://sassy.formativ.net
>>
>> Please avoid sending me Word or PowerPoint attachments.
>> See http://www.gnu.org/philosophy/no-word-attachments.html
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf