[Beowulf] Can one Infiniband net support MPI and a parallel filesystem?

Wed Aug 13 10:00:23 PDT 2008

Hi resource management concerned experts and list

I started the thread, before it gained a life of its own and its current 
incarnation.
So, please let me add my two cents to this interesting discussion.

We do share nodes on our cluster.
After all, we have only 32 nodes, 64 dual processors, single core) on 
our 6.5+ year old cluster,
and many climate and ocean modeling projects to run there.
We gladly thank NOAA for the helping to us to get the cluster in 2002!  :)
Its been hard to get support for a replacement ...

We share nodes for the reasons pointed out by Chris, Joe, and others.
One reason not mentioned is serial programs.
Well a cluster is to run parallel jobs.
However, we don't have the money to buy a farm of serial machines,
and a few users have valuable scientific projects but don't know
(or don't want to know)  how to translate their serial code into a 
parallel algorithm.
You really want multiple instances of this type of job to share nodes 
whenever possible.

Last time I checked our cluster was used more heavily than NCAR 
machines, for instance.
We had an average of then less than 72 hours downtime per year
(I stayed awake, I am the IT team, programmer, factotum),
and an average of about 75% use of its maximum capacity
(i.e. all nodes and processors working all the time 24 / 7 / 365 / 
6.5+years).
I couldn't find usage data of other public, academic, or industry 
machines to compare.
However, I guess there are small clusters like ours out there which are 
under more intensive use
then ours, doing good science and useful applications, and sharing nodes.

Yes, we did have cases of jobs failing on a node and breaking another 
job sharing the node.
After I banned the use of Matlab (to the dismay and revolt of many 
users) things improved on this front,
but still happen occasionally.

As Craig pointed out, the current trend of overpopulating a single node 
with many cores
may pose further challenges, to manage things like processor and memory 
affinity requests, etc.

Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Craig Tierney wrote:

> Joe Landman wrote:
>
>> Craig Tierney wrote:
>>
>>> Chris Samuel wrote:
>>>
>>>> ----- "I Kozin (Igor)" <i.kozin at dl.ac.uk> wrote:
>>>>
>>>>>> Generally speaking, MPI programs will not be fetching/writing data
>>>>>> from/to storage at the same time they are doing MPI calls so there
>>>>>> tends to not be very much contention to worry about at the node
>>>>>> level.
>>>>>
>>>>> I tend to agree with this. 
>>>>
>>>>
>>>> But that assumes you're not sharing a node with other
>>>> jobs that may well be doing I/O.
>>>>
>>>> cheers,
>>>> Chris
>>>
>>>
>>> I am wondering, who shares nodes in cluster systems with
>>> MPI codes?  We never have shared nodes for codes that need
>>
>>
>> The vast majority of our customers/users do.  Limited resources, they 
>> have to balance performance against cost and opportunity cost.
>>
>> Sadly not every user has an infinite budget to invest in contention 
>> free hardware (nodes, fabrics, or disks).  So they have to maximize 
>> the utilization of what they have, while (hopefully) not trashing the 
>> efficiency too badly.
>>
>>> multiple cores since be built our first SMP cluster
>>> in 2001.  The contention for shared resources (like memory
>>> bandwidth and disk IO) would lead to unpredictable code performance.
>>
>>
>> Yes it does.  As does OS jitter and other issues.
>>
>>> Also, a poorly behaved program can cause the other codes on
>>> that node to crash (which we don't want).
>>
>>
>> Yes this happens as well, but some users simply have no choice.
>>
>>>
>>> Even at TACC (62000+ cores) with 16 cores per node, nodes
>>> are dedicated to jobs.
>>
>>
>> I think every user would love to run on a TACC like system.  I think 
>> most users have a budget for something less than 1/100th the size.   
>> Its easy to forget how much resource (un)availability constrains 
>> actions when you have very large resources to work with.
>>
>
> TACC probably wasn't a good example for the "rest of us".  It hasn't been
> difficult to dedicate nodes to jobs when the number of cores was 2 or 4.
> We now have some 8 core nodes, and we are wondering if the policy of
> not sharing nodes is going to continue, or at least modified to minimize
> waste.
>
> Craig
>
>
>> Joe
>>
>>
>
>