[Beowulf] shared memory versus MPI and bootless boot

Wed Jun 28 18:38:47 PDT 2006

----- Original Message ----- 
From: "Brian Dobbins" <brian.dobbins at yale.edu>
To: "Vincent Diepeveen" <diep at xs4all.nl>
Cc: "pauln" <pauln at psc.edu>; "Eray Ozkural" <examachine at gmail.com>; 
<beowulf at beowulf.org>
Sent: Saturday, June 03, 2006 11:04 AM
Subject: Re: [Beowulf] Building my own highend cluster

> Hi Vincent (and others),
>
>  I just wanted to add my own two cents after having fairly recently
[snip]

Thanks, i'll have a look at it!

Of course i prefer to just put in a cdrom, hit enter and then connect the
cables.

But really, if you guys talk about cfengine i have no clue what universe you 
talk about.

If i boot a machine without harddrive, basically the machine says: "F you, 
error! Press enter to reboot"

Ok let's start please there. What do i do after getting that message?

Which key do i hit?

> recalled the relative complexity of creating diskless nodes 'by hand' a
> few years back and subsequently finding the wonderful simplicity of
> tools such as Warewulf (or Rocks).  So, in the interest of providing
> more information to the discussion at hand, here's a bit more detail and
> other assorted thoughts:

I put in a warewulf cdrom in the 'masternode', press enter, select
at all 'diskless nodes' in bios: "boot over network" and it all works fine?

btw does that 'boot over network means i need a 16 node hub for 100 mbit and
connect all the machines besides the quadrics network also to 100 mbit?

About wareful, small problem, how to coboot it with openSSI and elan3 
drivers?

Now don't tell me it's based upon open-BS, learning linux when Linus started 
releasing
it start of 90s was already hard enough for me :)

>>[From pauln]
>>.. my apologies in advance:
>>http://www.psc.edu/~pauln/Diskless_Boot_Howto.html
>
>  While I think cfengine and custom scripts gives a ton of flexibility,
> I've found it much easier on our diskless clusters to use the Warewulf
> software ( http://www.warewulf-cluster.org/ ).  It handles a lot of the
> behind-the-scenes dirty work for you (ie, making the RAM disks/tmpfs,
> configuring PXE & DHCP, etc.) and the people on the mailing list tend to
> be quick to respond to troubles with effective solutions.  Also, it's
> actively supported by other people and it just makes life a lot easier,
> in my opinion.  It isn't hard at all to tweak, either, and I'd happily
> go into more detail if you wish, but I'd really recommend a quick look
> through the website as well, just to get a rough idea of the process.
>
>  Secondly, though I haven't used it myself, I recently spoke with a
> friend who was very knowledgeable about Rocks, which also has a diskless
> mode, I'm told.  Here's the link for that: (
> http://www.rocksclusters.org/ )
>
>> Programming in MPI/SHMEM is by the way pretty retarded way to program.
>
>  If ease-of-use and shared-memory style are more important to you than
> performance, you might be interested in checking out the "Cluster
> OpenMP" developments in the Intel compilers.

OpenMP doesn't enter the room here of course.

No no shared memory programming is way more easier.

Just share in linux with shget and shmat some memory and you're shared 
memory.

That's how diep works basically.

If i go add all kind of fancy MPI commands to that, it of course slows down 
a factor 2 or so
first, at a single processor.

Much easier is to just keep using what i've got. Start n processes at n 
cores, and use shared memory
to divide memory segments.

The assumption in diep is that the process that allocates shared memory 
segments the first and also
cleans them (or initializes them whatever you want to call it) is the 
processor at which the memory gets allocated.

If that principle gets followed, then diep runs parallel fine, even with 
pretty bad latencies from processor to processor.

The luck i've got in Diep is that it has the most chessknowledge in its 
evaluation function from all chessprograms
in world. That's a result from me having been dogfood for world top players 
over the years
and i actually managed to draw a world top 6 player once myself in an 
official major league game),
some of them in world top 10 even. You learn the game quickly then :)

So needing those 64 bytes from a remote node isn't happening too frequently 
in Diep and with 4 cores at a dual opteron
of course the odds of it being at a remote memory node is far less than 50%.

Example of access to remote memory is in hashtable:

    unsigned int
      l,procnr,hindex;
    procnr = ((((unsigned 
int)(hashpos.lo&0x000000000000ffff))*nprocesses)>>16);
    hindex = (unsigned 
int)((((hashpos.lo>>16)&0x00000000ffffffff)*abmod)>>32);
    hentry = &(globaltrans[procnr][hindex]);

So basically it exists:
   HashEntry *globaltrans[MAXPROCESSORS];

I attach simply with shmget/shmat shared memory to that from remote 
processors.
Then what happens is a lookup.

This is a lot simpler of course than OpenMP not to mention MPI.

This is simplistically how you program for a shared memory machine such as a 
quad opteron or a quad xeon too.

This is how the commercial version of the software looks like too of course.

As you see i also avoid a slow 'modulo' instruction or 2 in the code.
Average coders would write here something like:

    procnr = ((unsigned int)hashpos.lo) % (unsigned int)nprocesses);
    hindex = (unsigned int)((hashpos.lo>>16)%abmod);

modulo and dividing is BAD on the processor. Very very slow.

Though not near as slow as a MPI call.

>  This is mostly an aside, but why would you need to strip MPI commands to
> run on a 4 or 8 processor system?

The basic point is: most sciensits first slow down their program factor 20 
to get MPI
and in order to then simply throw factor 1000 at it.

I can't afford that loss at a single mainboard machine. This software is 
quite optimized written to run optimal at a single mainboard
machine. No slowdowns.

So if i add MPI calls that slows me down.

If i move from my dual opteron dual core to a 16 node cluster using mpi 
calls, my first priority is to be faster than something very well
optimized for a single mainboard machine.

THAT IS NOT EASY.

> matter.  I agree shared memory methods are easier to program, but I

It's not about stripping.

We're talking about 2.2 MB of optimized C code where i would ADD mpi 
commands to,
with all bugs that you get and that need to bugfixed. Bugfixing that takes 
years.

Vincent

>  Finally, going back to the beginning of the discussion, I'd just caution
> you about putting motherboards on a slab of wood in a garage.  The
> filter might keep dust out of the garage, but other things always seem
> to manage to get into garages, and lots of creepy-crawly things love
> warmth and light - two things your system are bound to give off.  :)

Bugs :)

Thanks,
Vincent