[Beowulf] Can one Infiniband net support MPI and a parallel filesystem?

Wed Aug 13 22:37:35 PDT 2008

----- "Kilian CAVALOTTI" <kilian.cavalotti.work at gmail.com> wrote:

> Hi Chris,

Hello Kilian,

> On Tuesday 12 August 2008 08:29:31 pm Chris Samuel wrote:
>
> > We do use things like cpusets to try and limit the impact
> > that jobs can have on other jobs on the same nodes, 
> 
> I'm actually curious about how you implemented that.

Not a problem.

> Do you have NUMA hardware?

Yes, the cluster we're using this on has dual quad core
Barcelona CPUs and 32GB RAM per node (to get it to the
4GB/core level).  It's running CentOS 5 with the mainline
kernel.

> Do you use a resources manager, and is the cpusets creation
> process integrated with it?

We are using Torque (an open source PBS derivative)
and that has built in cpusets support.   It previously
had some support for the older SGI Altix cpusets but
that has now been replaced with support for the 2.6
kernel implementation (which itself has now been pulled
into the more generic cgroups work).

The 2.6 cpuset support in Torque came out of a long
discussion between Garrick Staples and myself at SC'07
where we nutted out the basic design and Garrick then
did the hard work of implementing it.

> How do you manage concurrent jobs running on the same
> machine: do you pin them on specific CPUs and keep track
> of  what CPU is busy and which is not, or do you have a
> way to just limit the number of CPUs they're using?

There are two major assumptions in the current Torque
code:

1) There is a direct mapping between Torque's concept
of vnodes (cpus) and cores.  I.e. if you have told Torque
a node has 8 cpus then it has 8 cores to bind to.

2) The cpus are contiguous and start at 0.  So if you
are using a boot cpuset then it's best to reserve the
*last* core in the box for that and not the first. You
will also need to tell Torque that the node has N-1 cpus.

The design is sort of hierarchical:

1) A top level "torque" cpuset is created by the pbs_mom
when it starts if it does not already exist.  It adds all
the cpus and mems into it.

2) When a job is scheduled onto the node(s) the pbs_mom
creates a job cpuset which includes the specific cpus
(vnodes) that have been allocated by the scheduler, and
all the mems present (it currently makes no attempt to be
clever about that).

3) Prior to the 2.3.2 release there was a per vnode (core)
cpuset created within the job cpuset and then processes
launched via the PBS tm_spawn interface by tools like
Pete Wyckoff's mpiexec would get locked to a core.

Great in theory, but...

That's been changed now to just put processes in the
job cpuset as MPI tools like OpenMPI's mpiexec only
make a single tm_spawn call *per node* and then fork
the MPI processes from that so you would end up with
all the processes of an OpenMPI job locked to a single
core with the old code.

This still leaves issues for codes that use rsh/rsh
based MPI launchers but we're playing around with a
drop in script that makes it do the right thing using
pbsdsh instead.

> As you can guess, I'd be interested in some technical details. :)

Hope that's useful!

We also have an init script that does:

mkdir /dev/cpuset
mount -t cpuset none /dev/cpuset

to make sure the cpuset VFS is there on boot.

Tangent: Linux cpusets were how we found that the noacpi
boot option broke the kernels detection of NUMA capabilities [1]
on Barcelona as /dev/cpuset/mems only had "0" in it, not "0-1"
as it should have had!

[1] - it first tries a K8 specific hack and then uses
ACPI, so for K10 no ACPI - no NUMA. ;-)

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency