[Beowulf] first cluster [was [OMPI users] trouble using openmpi under slurm]

Gus Correa gus at ldeo.columbia.edu
Fri Jul 9 16:06:05 PDT 2010


Douglas Guptill wrote:
> On Thu, Jul 08, 2010 at 09:43:48AM -0400, Gus Correa wrote:
>> Douglas Guptill wrote:
>>> On Wed, Jul 07, 2010 at 12:37:54PM -0600, Ralph Castain wrote:
>>>
>>>> No....afraid not. Things work pretty well, but there are places
>>>> where things just don't mesh. Sub-node allocation in particular is
>>>> an issue as it implies binding, and slurm and ompi have conflicting
>>>> methods.
>>>>
>>>> It all can get worked out, but we have limited time and nobody cares
>>>> enough to put in the effort. Slurm just isn't used enough to make it
>>>> worthwhile (too small an audience).
>>> I am about to get my first HPC cluster (128 nodes), and was
>>> considering slurm.  We do use MPI.
>>>
>>> Should I be looking at Torque instead for a queue manager?
>>>
>> Hi Douglas
>>
>> Yes, works like a charm along with OpenMPI.
>> I also have MVAPICH2 and MPICH2, no integration w/ Torque,
>> but no conflicts either.
> 
> Thanks, Gus.
> 
> After some lurking and reading, I plan this:
>   Debian (lenny)
>   + fai                   - for compute-node operating system install
>   + Torque                - job scheduler/manager
>   + MPI (Intel MPI)       - for the application
>   + MPI (OpenMP)          - alternative MPI
> 
> Does anyone see holes in this plan?
> 
> Thanks,
> Douglas


Hi Douglas

I never used Debian, fai, or Intel MPI.

We have two clusters with cluster management software, i.e.,
mostly the operating system install stuff.

I made a toy Rocks cluster out of old computers.
Rocks is a minimum-hassle way to deploy and maintain a cluster.
Of course you can do the same from scratch, or do more, or do better,
which makes some people frown at Rocks.
However, Rocks works fine, particularly if your network(s)
is (are) Gigabit Ethernet,
and if you don't mix different processor architectures (i.e. only i386 
or only x86_64, although there is some support for mixed stuff).
It is developed/maintained by UCSD under an NSF grant (I think).
It's been around for quite a while too.

You may want to take a look, perhaps experiment with a subset of your
nodes before you commit:

http://www.rocksclusters.org/wordpress/

There is a decent user guide:

http://www.rocksclusters.org/roll-documentation/base/5.3/

and additional documentation/tutorials:

http://www.rocksclusters.org/wordpress/?page_id=4

The basic software comes in what they call "rolls".
The (default) OS is actually CentOS.
They only support a few "Red-Hat-type" distributions (IIRR, RHEL and 
Scientific Linux), but CentOS is fine.
You could use the mandatory rolls (Kernel/Boot, Core,
OS disks 1,2.  I would suggest installing all OS disks,
so as to have any packages that you may need later on.
In addition, there a roll with Torque+Maui that you can get
from the Univ. of Tromso, Norway:

ftp://ftp.uit.no/pub/linux/rocks/torque-roll/

If you want to install Torque,
*don't install the SGE (Sun Grid Engine) roll*.
It is either one resource manager or the other (they're incompatible).
I am a big fan and old user of Torque, so my bias is to
recommend Torque, but other people prefer SGE.

The basic software takes care of compute node installation,
administration of user accounts, etc.
It can be customized in several ways
(e.g. if you have two networks, one for MPI, another
for cluster control and I/O, which I would recommend).
It also includes a basic web page for your cluster (via Wordpress),
which you can also customize, and very nice web-based
monitoring of your nodes through Ganglia.
It also has support for upgrades, and they tend to come up with a
new release once a year or so.

There is also a large user base and an active mailing list:

https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion
http://marc.info/?l=npaci-rocks-discussion

You can build OpenMPI (and MPICH2) from source,
with any/all your favorite compilers,
and install any compilers and all external
software (even Matlab, if you are so inclined, or your users demand)
in a NFS mounted directory
(typically /share/apps in Rocks),
so as to make them accessible by the compute nodes.
You could do the same for, say, NetCDF libraries and utilities
(NCO, NCL), etc.

What is the interconnect/network hardware you have for MPI?
Gigabit Ethernet?  Infiniband?  Myrinet? Other?

If Gigabit Ethernet Rocks won't have any problem.
If Infiniband you may need to add the OFED packages, but they may come
with CentOS now, I am not sure.
If Myrinet, I am not sure, Myrinet provided a Rocks roll up to
Rocks 5.0, but I am not sure about the current status (Rocks is now 5.3).

If you are going to handle a variety of different compilers, MPI 
flavors, with various versions, etc, I recommend using the
"Environment module" package.
It is a quite convenient (and consistent) way to allow users to switch
from one environment to another, change compilers, MPI, etc,
allowing good flexibility.
You can install "environment modules" separately (say via yum or RPM)
with no compatibility issues whatsoever with Rocks:

http://modules.sourceforge.net/

I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------





More information about the Beowulf mailing list