[Beowulf] Intel Quad-Core or AMD Opteron

Steffen Persvold steffen.persvold at scali.com
Fri Aug 24 05:53:06 PDT 2007


> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
On
> Behalf Of Greg Lindahl
> Sent: Thursday, August 23, 2007 12:14 PM
> To: beowulf at beowulf.org
> Subject: Re: [Beowulf] Intel Quad-Core or AMD Opteron
> 
> On Thu, Aug 23, 2007 at 09:09:57AM -0400, Douglas Eadline wrote:
> 
> >    Naturally, if you have four processes running
> >    it is best if each one gets its own woodcrest. To the OS
> >    the all look the same. Other than Intel MPI, I don't
> >    know of any other MPI that attempts to optimize this.
> 
> InfiniPath MPI has always optimized this. Of course, there's no way it
> can read your mind and know if you are going to run a second 4-core
> job on the same node, so there is no perfect default. But it has
> switches to give you either behavior, tightly packed or spread out.
> 

You mean like this (with Scali MPI Connect, verbose output) :

1st job (4 processes on a dual socket quadcore machine) :

Affinity 'automatic' policy BANDWIDTH granularity CORE nprocs 4
Will bind process 0 with mask 1000000000000000 [(socket: 0 core: 0
execunit: 0)]
Will bind process 1 with mask 0000100000000000 [(socket: 1 core: 0
execunit: 0)]
Will bind process 2 with mask 0100000000000000 [(socket: 0 core: 1
execunit: 0)]
Will bind process 3 with mask 0000010000000000 [(socket: 1 core: 1
execunit: 0)]


2nd job on the same machine (while the other one is running) :

Affinity 'automatic' policy BANDWIDTH granularity CORE nprocs 4
Will bind process 0 with mask 0010000000000000 [(socket: 0 core: 2
execunit: 0)]
Will bind process 1 with mask 0000001000000000 [(socket: 1 core: 2
execunit: 0)]
Will bind process 2 with mask 0001000000000000 [(socket: 0 core: 3
execunit: 0)]
Will bind process 3 with mask 0000000100000000 [(socket: 1 core: 3
execunit: 0)]

In Scali MPI Connect, a "bandwidth" policy means that you'd want the
processes to use as many sockets as possible (i.e spread out) to
optimize for memory bandwidth. There's also a "latency" policy which
uses as few sockets as possible (to optimize for shared cache usage).
These automatic policies will take into account processes running on the
node that already have their affinity set (MPI jobs or not). The policy
type (bandwidth vs. latency) is of course user controllable (who are we
to say how your application performs best :) )

Anyways, enough "marketing" for now I guess. Bottom line is that having
the right type of processor affinity control mechanisms have proven to
be key to getting good application performance (but y'all knew that
already I guess..). One thing that we've seen is that when you have
these quadcore/dualsocket machines, your total job throughput is
somewhat higher if you use half the cores per job, but twice the jobs
(i.e like my example above). I don't yet have any good explanation for
this though, but you'd probably like to discuss it :)

Cheers,

Steffen Persvold
Chief Software Architect - Scali MPI Connect

http://www.scali.com/
Higher Performance Computing





More information about the Beowulf mailing list