[Beowulf] single machine with 500 GB of RAM

Wed Jan 9 09:18:31 PST 2013

On 1/9/13 11:29 AM, Jörg Saßmannshausen wrote:

> I am wondering about the vSMP / ScaleMP suggestion from Joe. If I am using an
> InfiniBand network here, would I be able to spread the 'bottlenecks' a bit
> better? What I am after is, when I tested out the InfiniBand on the new cluster
Well, it depends upon how the code hit memory.

> we got, I noticed that if you are running a job in parallel between nodes, the
> same amount of cores are marginally faster. At the time I put that down due to
> a slightly faster memory access as there was no bottleneck to the RAM.
> I am not familiar with vSMP (i.e. I never used it), but is it possible to
> aggregate RAM from a number of nodes (say 40) and use it as a large virtual
> SMP? So one node would be slaving away with the calculations and the other
Yes ... though this will probably exceed the budget for licenses. I'd 
suggest contacting a UK reseller to see if there are edu discounts that 
you can leverage, but this is one of the nicer features of vSMP, in that 
you can create the large machine when you need it.  Helps if you have a 
set of machines to work with

[disclaimer:  we have no financial interest in ScaleMP]

> nodes are only doing memory IO. Is that possible with vSMP?
> In a related context, how about NUMAScale?
Yes it is possible in vSMP.  I don't know much about NUMAScale, couldn't 
tell you.

>
> The idea of the aggregates SDD is nice as well. I know some storage vendors
> are using a mixture of RAM and SDD for their meta-data (fast access) and that
> seems to work quite well. So that would be a large swap file / partition or is

Yes it does.

> there another way to use disc-space as RAM? I need to read the paper of
> NVMalloc I suppose. Is that actually used or is that just a good idea and we
> got a working example here?

You could look at RAM as a cache against a large SSD/Flash storage 
system.  To get to a few GB/s, you would need sequential large block 
access, so your calculation should be fairly good at locality if possible.

Worst case, you could simply add N SSDs, add them all to swap with the 
same priority (very important so you don't serialize access too badly), 
and then just run your code.  Performance is going to be pretty poor 
without significant app tuning though (see locality above).  The kernel 
paging mechanism is not a high performance pathway, and pages are done 
only 4kB at a time.  If you could trick the system into using mmap files 
and could use huge pages, this would be better, but you would still be 
(badly) performance bound.

>
> I don't think there is much disc IO here. There is most certainly no network
> bound traffic as it is a single thread. A fast CPU would be of advantage as
> well, however, I gut the feeling the trade-off would be the memory access speed
> (bandwidth).
>
> I have tried to answer the questions raised. Let me know whether there are
> still some unclear points.
>
> Thanks for all your help and suggestions so far. I will need to digest that.
>
> All the best from a sunny London
>
> Jörg
>

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/siflash
phone: +1 734 786 8423 x121
cell : +1 734 612 4615