From lindahl at pbm.com  Sat Dec  1 15:15:31 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Sat, 1 Dec 2007 15:15:31 -0800
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
In-Reply-To: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
Message-ID: <20071201231531.GA4736@bx9.net>

On Thu, Nov 29, 2007 at 11:26:45AM -0800, Tom Elken wrote:

> The SPEC HPG (High Performance Group) is having discussions about using
> a hybrid of MPI and thread-level parallelism on the SPEC MPI2007
> benchmark suite.

I'd find it useful to debunk the notion that hybrid programming
actually gives a speedup. That's probably not what HPG has in mind,
but it'd be useful to the community.

-- greg


From quantummechanicsllc at msn.com  Sat Dec  1 13:25:09 2007
From: quantummechanicsllc at msn.com (Donald Shillady)
Date: Sat, 1 Dec 2007 16:25:09 -0500
Subject: [Beowulf] Really efficient MPIs??
In-Reply-To: <e4d4fd070711280737s28316b9amfd0993610b6925cf@mail.gmail.com>
References: <428810f20711272131r5cc3bb08w2431083b9cd85b97@mail.gmail.com>
	<474D6779.5010000@charter.net>
	<FF0718A6-29AD-44C4-B2D5-DDE023D20B47@cs.earlham.edu> 
	<e4d4fd070711280737s28316b9amfd0993610b6925cf@mail.gmail.com>
Message-ID: <BAY115-W18FBD5C06AC4856F7D7B2CB4720@phx.gbl>


Please pardon my naive questions but this is surely the place to get an expert answer.  I am enthused by the recent micrWulf built by Prof. Adoms and his student at Calvin College.  That device approached a "homogeneous" parallel system with all the same core frequencies and achieved over 26 GFLOPS for about $1200.  With private funds and a curious 3 yo grandson I prefer to enclose the "System" into four PC cases and an external Ethernet switch.  I also want to maintain the performance I now have with a 3 GHz Toshiba Pentium 4 laptop so I prefer an "asymmetric" system with a fast Master node with a lot of frills and three slower dual core satellite PCs.  About ten years ago I was able to link an HP 9000/720 running HP-UX in one building with three other SGI nodes running IRIX in another building connected by TCP/IP Ethernet.  Sadly I do not recall the name of that message passing system but the link was a pretty bad mismatch between the slower HP9000/720 and the faster SGI CPUs at that time but it was something with "Theoretical Chemists ......".  Was that TCP/IP?  Anyway I know it is possible to link CPU/cores with different speeds and different memory-bus speeds so my question is whether "Open MPI" can handle this situation?  Specifically, suppose I set up:
 
1. a Master box with an AMD X2 5800+ overclocked to 3.0 GHz with DDR2 800 memory (at least 4GB, maybe 8GB), 1 300 GB SATA drive; there would also be other creature comfort frills on the Master box like CD R/W, floppy drive, graphics card etc.
 
2. three cheaper AMD X2 4000+ (2.1 GHz) and running cheaper DDR2 667 memory; bare bones, no drives just CPU, memory and gigE switch.
 
3. connected by a Trendware TEG-S80TXE 8-port Gigabit Ethernet switch with associated NIC switches.
 
If all the CPU/nodes/cores were AMD X2 4000+ units this should be similar to the Calvin College microWulf and run at about 27 GFLOPS (LINPAK) due to the slightly faster 2.1 GHz AMD 4000+ CPUs compared to the microWulf AMD 3800+ 2.0 GHz units.  I do not seek the  ultimate (GFLOP/$) minimum, just an inexpensive system to run GAMESS for molecular calculations and a chance to learn about parallel software late in my career.
 
So, can "Open MPI" handle different CPU/core frequencies and different memory bus frequencies over gigE.  I note that the writer of GAMESS (Mike Schmidt) recommends TCP/IP for GAMESS rather than OPEN MPI and GAMESS is the overwhelming goal for my use but using UBUNTU I would like to be able to access the Internet as well from the Master box.  While I have your attention, could you comment on whether Open MPI will run under LINSPIRE?  I have messed around with LINSPIRE more than UBUNTU (although I have both source disks) and I like LINSPIRE because it looks more like WINDOWS.
 
Summary:
 
1. Can Open MPI handle different clock speeds across several node/cores?
 
2. Can Open MPI handle different memory bus clock speeds across several node/cores?
 
3. Why not LINSPIRE instead of UBUNTU?
 
Sorry about the dumb questions but I seem to recall that the Duke Beowulf managed to run using many different X86 PCs so what I want to do should be possible, but is Open MPI the best choice or what else?
 
Don Shillady
Emeritus Professor of Chemistry, VCU
Ashland VA (working at home)
 

Date: Wed, 28 Nov 2007 10:37:45 -0500From: peter.st.john at gmail.comTo: charliep at cs.earlham.eduSubject: Re: [Beowulf] Really efficient MPIs??CC: beowulf at beowulf.org
For the sake of others as easily confused as myself, I note (now, thanks!) that OpenMP and OpenMPI are two different things:
OpenMP (an alternative to the MPI method) is http://en.wikipedia.org/wiki/OpenMP
OpenMPI (an implementation of MPI) is http://en.wikipedia.org/wiki/OpenMPI
Cool.
Peter
On Nov 28, 2007 8:49 AM, Charlie Peck <charliep at cs.earlham.edu> wrote:

On Nov 28, 2007, at 8:04 AM, Jeffrey B. Layton wrote:> If you don't want to pay money for an MPI, then go with Open-MPI.> It too can run on various networks without recompiling. Plus it's > open-source.Unless you are using a gigabit ethernet, Open-MPI is noticeably lessefficient that LAM-MPI over that fabric.I suspect at some point in the future gige will catch-up but for now my (limited) understanding is that the Open-MPI folks are focusingtheir time on higher bandwidth/lower latency fabrics than gige.charlie


_______________________________________________Beowulf mailing list, Beowulf at beowulf.orgTo change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071201/e533a182/attachment.html>

From hahn at mcmaster.ca  Sat Dec  1 17:54:21 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sat, 1 Dec 2007 20:54:21 -0500 (EST)
Subject: [Beowulf] Really efficient MPIs??
In-Reply-To: <BAY115-W18FBD5C06AC4856F7D7B2CB4720@phx.gbl>
References: <428810f20711272131r5cc3bb08w2431083b9cd85b97@mail.gmail.com>
	<474D6779.5010000@charter.net>
	<FF0718A6-29AD-44C4-B2D5-DDE023D20B47@cs.earlham.edu>
	<e4d4fd070711280737s28316b9amfd0993610b6925cf@mail.gmail.com>
	<BAY115-W18FBD5C06AC4856F7D7B2CB4720@phx.gbl>
Message-ID: <Pine.LNX.4.64.0712012044440.11256@coffee.psychology.mcmaster.ca>

>with "Theoretical Chemists ......".  Was that TCP/IP?  Anyway I know it is
>possible to link CPU/cores with different speeds and different memory-bus
>speeds so my question is whether "Open MPI" can handle this situation?

sure.  nothing about MPI assumes that nodes are homogenous in speed,
just that they can somehow get packets from sender to receiver.

>Specifically, suppose I set up:

the cluster you describe is basically a normal beowulf.

> 1. Can Open MPI handle different clock speeds across several node/cores?

of course.

> 2. Can Open MPI handle different memory bus clock speeds across several node/cores?

of course.  MPI itself doesn't know or care about what cpu/mem are in the
nodes, though individual applications may work best with homogenous nodes.
(consider if the app has a periodic global collective operation such as 
broadcast or reduce.  the work done between these collectives should ideally
take the same amount of elapsed/wallclock time, or else some nodes will wind
up waiting for the slower nodes.)

> 3. Why not LINSPIRE instead of UBUNTU?

it doesn't matter.  an MPI application just wants basic OS functionality
like a network stack, process scheduler, memory manager.  distros differ 
only in desktop and config features - they all use pretty much the same 
kernel, so from MPI's perspective are nearly equivalent.

> possible, but is Open MPI the best choice or what else?

the MPI implementation won't have any effect on how well your application
tolerates heterogeneity of nodes.

regards, mark hahn.


From gdjacobs at gmail.com  Sat Dec  1 19:23:04 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Sat, 01 Dec 2007 21:23:04 -0600
Subject: [Beowulf] Really efficient MPIs??
In-Reply-To: <Pine.LNX.4.64.0712012044440.11256@coffee.psychology.mcmaster.ca>
References: <428810f20711272131r5cc3bb08w2431083b9cd85b97@mail.gmail.com>	<474D6779.5010000@charter.net>	<FF0718A6-29AD-44C4-B2D5-DDE023D20B47@cs.earlham.edu>	<e4d4fd070711280737s28316b9amfd0993610b6925cf@mail.gmail.com>	<BAY115-W18FBD5C06AC4856F7D7B2CB4720@phx.gbl>
	<Pine.LNX.4.64.0712012044440.11256@coffee.psychology.mcmaster.ca>
Message-ID: <47522518.8010206@gmail.com>

Mark Hahn wrote:
>> possible, but is Open MPI the best choice or what else?
> 
> the MPI implementation won't have any effect on how well your application
> tolerates heterogeneity of nodes.

True within this context of a single binary executable image. Of course,
few run totally heterogeneous nodes anymore. What is the new hybrid
K10-Cell computer going to be using for interconnect?

-- 
Geoffrey D. Jacobs


From lindahl at pbm.com  Sun Dec  2 15:17:00 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Sun, 2 Dec 2007 15:17:00 -0800
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
In-Reply-To: <320e992a0712020205r32f5eb4fld4d9272b1b958cc8@mail.gmail.com>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
	<20071201231531.GA4736@bx9.net>
	<320e992a0712020205r32f5eb4fld4d9272b1b958cc8@mail.gmail.com>
Message-ID: <20071202231700.GA22575@bx9.net>

On Sun, Dec 02, 2007 at 12:05:50PM +0200, Eray Ozkural wrote:

> I wouldn't be so sure!
> 
> Sounds like a great match for clusters of multi-core architectures.

People said the same thing when SMP became common on the low end.

> And obviously many papers have been written about programming clusters
> of SMP's so what exactly is your point here?

The hybrid MPI/OpenMP emperor has no clothes.

-- greg


From nelsoneci at gmail.com  Sat Dec  1 19:38:34 2007
From: nelsoneci at gmail.com (Nelson Castillo)
Date: Sat, 1 Dec 2007 22:38:34 -0500
Subject: [Beowulf] Recommended paper for parallel sorting?
Message-ID: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com>

Hi.

Could you please recommend a paper for reading? I'd like to know about parallel
sorting algorithms for this architecture.

Regards,
Nelson.-

-- 
http://arhuaco.org


From examachine at gmail.com  Sun Dec  2 02:05:50 2007
From: examachine at gmail.com (Eray Ozkural)
Date: Sun, 2 Dec 2007 12:05:50 +0200
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
In-Reply-To: <20071201231531.GA4736@bx9.net>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
	<20071201231531.GA4736@bx9.net>
Message-ID: <320e992a0712020205r32f5eb4fld4d9272b1b958cc8@mail.gmail.com>

On Dec 2, 2007 1:15 AM, Greg Lindahl <lindahl at pbm.com> wrote:
> On Thu, Nov 29, 2007 at 11:26:45AM -0800, Tom Elken wrote:
>
> > The SPEC HPG (High Performance Group) is having discussions about using
> > a hybrid of MPI and thread-level parallelism on the SPEC MPI2007
> > benchmark suite.
>
> I'd find it useful to debunk the notion that hybrid programming
> actually gives a speedup. That's probably not what HPG has in mind,
> but it'd be useful to the community.

I wouldn't be so sure!

Sounds like a great match for clusters of multi-core architectures.
And obviously many papers have been written about programming clusters
of SMP's so what exactly is your point here?

Best,

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://www.cs.bilkent.edu.tr/~erayo  Malfunct: http://myspace.com/malfunct
ai-philosophy: http://groups.yahoo.com/group/ai-philosophy


From toon.knapen at gmail.com  Sun Dec  2 06:51:53 2007
From: toon.knapen at gmail.com (Toon Knapen)
Date: Sun, 02 Dec 2007 15:51:53 +0100
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries
	with MPI
In-Reply-To: <Pine.LNX.4.64.0711301724470.28765@coffee.psychology.mcmaster.ca>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
	<474FEF18.6020308@obs.unige.ch>
	<Pine.LNX.4.64.0711300936050.11868@coffee.psychology.mcmaster.ca>
	<d5bdff000711300759r17f381fap25e34bc1821f73ff@mail.gmail.com>
	<Pine.LNX.4.64.0711301724470.28765@coffee.psychology.mcmaster.ca>
Message-ID: <4752C689.5030102@gmail.com>

Mark Hahn wrote:
>> IMHO the hybris approach (MPI+threads) is interesting in case every
>> MPI-process has lots of local data.
> 
> yes.  but does this happen a lot?  the appealing case would be threads 
> that make lots of heavy use of some large data, _but_
> without needing synchronization/locking.  once you need locking
> among the threads, message passing starts to catch up.

Direct solvers (for Finite Elements for instance) need a lot of data. 
Additionally distributing the matrix generate interfaces (between the 
different submatrices) which are hard to solve. In such situation, one 
tries to minimize the number of interfaces (by having one submatrix per 
MPI-process) and speed up the solving of each submatrix using threads.

Finance is another example. Financial applications need to evaluate a 
large number of open positions based on the simulated, current or past 
market-data. There are many dependencies between all the different data 
which makes that it is hard to decompose the data in largely independent 
chunks.


> 
>> latter is simpler because it only requires MPI-parallelism but if the 
>> code
>> is memory-bound and every mpi-process has much of the same data, it 
>> will be
>> better to share this common data with all processes on the same cpu 
>> and thus
>> use threads intra-node.
> 
> what kind of applications behave like that?  I agree that if your MPI 
> app is keeping huge amounts of (static) data replicated in each rank,
> you should rethink your design.
> 


See above.


From Hakon.Bugge at scali.com  Mon Dec  3 01:11:51 2007
From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge)
Date: Mon, 03 Dec 2007 10:11:51 +0100
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
In-Reply-To: <200712022000.lB2K08cL014118@bluewest.scyld.com>
References: <200712022000.lB2K08cL014118@bluewest.scyld.com>
Message-ID: <20071203091158.9BBED35AD18@mail.scali.no>

At Sat, 1 Dec 2007 15:15:31,Greg Lindahl <lindahl at pbm.com> wrote:
> > The SPEC HPG (High Performance Group) is having discussions about using
> > a hybrid of MPI and thread-level parallelism on the SPEC MPI2007
> > benchmark suite.
>
>I'd find it useful to debunk the notion that hybrid programming
>actually gives a speedup. That's probably not what HPG has in mind,
>but it'd be useful to the community.
>
>-- greg

I have a slightly different view. Hybrid 
programming is used for performance reasons, but 
only in cases where parallelization (to the same 
level) is impossible/impractical using the pure 
MPI mode, or the parallelization yields low 
efficiency. So, if you're able to achieve your 
performance with MPI, you probably will. But 
there are cases where you cannot; a) the 
"decomposition parallel efficiency" is not good 
enough or b) the processes need a huge (shared) table.

As to a), in the past I worked with a synthetic 
aperture radar application where I ended up with 
the hybrid model. The problem could only be 
decomposed in one dimension, and each process had 
33% overhead. Obviously, the hybrid model was a good choice in this case.

As to b), it might be more economic to size the 
memory on each node the the size of a single 
table and share it through shared memory. It is 
of course possible to share it from several MPI 
processes as well, but implementors might find 
their reason for using a hybrid model here.

Relevance to the SPEC MPI2007? To my knowledge, 
the applications here do not have any of the 
constraints above, so I would be severely 
surprised if anyone uses the hybrid model on them.


H?kon


From rgb at phy.duke.edu  Mon Dec  3 06:09:10 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 3 Dec 2007 09:09:10 -0500 (EST)
Subject: [Beowulf] Recommended paper for parallel sorting?
In-Reply-To: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com>
References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712030906190.11771@lilith.rgb.private.net>

On Sat, 1 Dec 2007, Nelson Castillo wrote:

> Hi.
>
> Could you please recommend a paper for reading? I'd like to know about parallel
> sorting algorithms for this architecture.

You might check out Ian Foster's free online book on parallel
algorithms.  It is worth buying if you're going to be doing a lot of
parallel programming.  Or there are two or three other decent textbooks
on parallel programming at the algorithm level.  I don't recall offhand
if Foster covers sorting, but you can easily found out for free.

Remember, GIYF here -- just enter search strings like "Foster Parallel
Programming" to find his book, "Parallel Sorting Algorithms" or the like
too see if there is anything out there on the web.

    rgb

>
> Regards,
> Nelson.-
>
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From peter.st.john at gmail.com  Mon Dec  3 07:27:49 2007
From: peter.st.john at gmail.com (Peter St. John)
Date: Mon, 3 Dec 2007 10:27:49 -0500
Subject: [Beowulf] Recommended paper for parallel sorting?
In-Reply-To: <Pine.LNX.4.64.0712030906190.11771@lilith.rgb.private.net>
References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com>
	<Pine.LNX.4.64.0712030906190.11771@lilith.rgb.private.net>
Message-ID: <e4d4fd070712030727h2c9fe4c9j50c378ac33e65131@mail.gmail.com>

(re Ian Foster, *Designing and Building Parallel Programs *online as below
or Addison Wesley):

I did that search and right the top was this link, which looks like homebase
for the original material:
http://www-unix.mcs.anl.gov/dbpp/
Very cool, thanks RGB for what looks like toothsome book.
Peter

On Dec 3, 2007 9:09 AM, Robert G. Brown <rgb at phy.duke.edu> wrote:

> On Sat, 1 Dec 2007, Nelson Castillo wrote:
>
> > Hi.
> >
> > Could you please recommend a paper for reading? I'd like to know about
> parallel
> > sorting algorithms for this architecture.
>
> You might check out Ian Foster's free online book on parallel
> algorithms.  It is worth buying if you're going to be doing a lot of
> parallel programming.  Or there are two or three other decent textbooks
> on parallel programming at the algorithm level.  I don't recall offhand
> if Foster covers sorting, but you can easily found out for free.
>
> Remember, GIYF here -- just enter search strings like "Foster Parallel
> Programming" to find his book, "Parallel Sorting Algorithms" or the like
> too see if there is anything out there on the web.
>
>    rgb
>
> >
> > Regards,
> > Nelson.-
> >
> >
>
> --
> Robert G. Brown
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone(cell): 1-919-280-8443
> Web: http://www.phy.duke.edu/~rgb
> Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
>  _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071203/f4b901b0/attachment.html>

From rgb at phy.duke.edu  Mon Dec  3 10:38:12 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 3 Dec 2007 13:38:12 -0500 (EST)
Subject: [Beowulf] Recommended paper for parallel sorting?
In-Reply-To: <e4d4fd070712030727h2c9fe4c9j50c378ac33e65131@mail.gmail.com>
References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com> 
	<Pine.LNX.4.64.0712030906190.11771@lilith.rgb.private.net>
	<e4d4fd070712030727h2c9fe4c9j50c378ac33e65131@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712031337340.11771@lilith.rgb.private.net>

On Mon, 3 Dec 2007, Peter St. John wrote:

> (re Ian Foster, *Designing and Building Parallel Programs *online as below
> or Addison Wesley):
>
> I did that search and right the top was this link, which looks like homebase
> for the original material:
> http://www-unix.mcs.anl.gov/dbpp/
> Very cool, thanks RGB for what looks like toothsome book.

I went ahead and bought a paper copy, but it is nice to be able to
access the material from a workstation because I don't carry the copy
around with me all the time...;-)

   rgb

> Peter
>
> On Dec 3, 2007 9:09 AM, Robert G. Brown <rgb at phy.duke.edu> wrote:
>
>> On Sat, 1 Dec 2007, Nelson Castillo wrote:
>>
>>> Hi.
>>>
>>> Could you please recommend a paper for reading? I'd like to know about
>> parallel
>>> sorting algorithms for this architecture.
>>
>> You might check out Ian Foster's free online book on parallel
>> algorithms.  It is worth buying if you're going to be doing a lot of
>> parallel programming.  Or there are two or three other decent textbooks
>> on parallel programming at the algorithm level.  I don't recall offhand
>> if Foster covers sorting, but you can easily found out for free.
>>
>> Remember, GIYF here -- just enter search strings like "Foster Parallel
>> Programming" to find his book, "Parallel Sorting Algorithms" or the like
>> too see if there is anything out there on the web.
>>
>>    rgb
>>
>>>
>>> Regards,
>>> Nelson.-
>>>
>>>
>>
>> --
>> Robert G. Brown
>> Duke University Dept. of Physics, Box 90305
>> Durham, N.C. 27708-0305
>> Phone(cell): 1-919-280-8443
>> Web: http://www.phy.duke.edu/~rgb
>> Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
>>  _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From lindahl at pbm.com  Mon Dec  3 12:55:45 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 3 Dec 2007 12:55:45 -0800
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
In-Reply-To: <20071203091158.9BBED35AD18@mail.scali.no>
References: <200712022000.lB2K08cL014118@bluewest.scyld.com>
	<20071203091158.9BBED35AD18@mail.scali.no>
Message-ID: <20071203205545.GA11220@bx9.net>

On Mon, Dec 03, 2007 at 10:11:51AM +0100, H?kon Bugge wrote:

> But 
> there are cases where you cannot; a) the 
> "decomposition parallel efficiency" is not good 
> enough or b) the processes need a huge (shared) table.

You can accomplish (b) using a mmaped file, which is much easier than
hybrid programming. I agree that (a) is theoretically useful, but I
have only once seen a benchmark situation where (a) was the case. I
have seen several situations where a hybrid code had a 1D MPI
decomposition and "needed" OpenMP for more scaling, but could have
been a pure MPI 2D or 3D code, with less complexity than the hybrid
code.

-- greg


From richard.walsh at comcast.net  Mon Dec  3 13:47:41 2007
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Mon, 03 Dec 2007 21:47:41 +0000
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
Message-ID: <120320072147.28375.4754797D000678BC00006ED72200748184089C040E99D20B9D0E080C079D@comcast.net>


-------------- Original message -------------- 
From: H?kon Bugge <Hakon.Bugge at scali.com> 

> I have a slightly different view. Hybrid 
> programming is used for performance reasons, but 
> only in cases where parallelization (to the same 
> level) is impossible/impractical using the pure 
> MPI mode, or the parallelization yields low 
> efficiency. So, if you're able to achieve your 
> performance with MPI, you probably will. But 
> there are cases where you cannot; a) the 
> "decomposition parallel efficiency" is not good 
> enough or b) the processes need a huge (shared) table. 

I think that what is being said here is that applications may be decomposible in some number of dimensions, but not so in all.  If the benefits in performance in locally managing the "unruly" dimensions are great enough, then a hybrid program may be worth the trouble.  I think that the number of real-world apps in this class is perhaps not large, or there would be more hybrid code. 
Another perhaps relavent alternative that will at some point be able to take on both the partionable and unpartionable extreme cases and everything in between are the PGAS language extensions (UPC and CAF).  Not yet at distributed-memory, performance-parity with well-coded MPI, but with, arguably, an intrinsic programmability advantage in LOC and in data structure coverage.  AMR codes tracking shedding vortices are inherently non-partionable (or in need of regular repartitioning).  Managing then in either MPI or OpenMP in a distributed memory environment is a chore.
And if you believe that ... ;-) ... then there is of course the "magic" of many-threaded latency hiding (can't say I am a true believer for the data intensive OZ of HPC).  Some would have you believe that a 32 thread, 8 core Niagara 2 (or perhaps a future design at some higher active thread to core ratio) can hide all your data latency events behind its active thread horizon.  
Maybe the key is to combine PGAS with many-threads ... mmm ... anyone doing this?
;-) 
rbw
-- 

"Making predictions is hard, especially about the future." 

Niels Bohr 

-- 

Richard Walsh 
Thrashing River Consulting-- 
5605 Alameda St. 
Shoreview, MN 55126 

Phone #: 612-382-4620
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071203/a286e39b/attachment.html>

From lindahl at pbm.com  Mon Dec  3 13:57:53 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 3 Dec 2007 13:57:53 -0800
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
In-Reply-To: <120320072147.28375.4754797D000678BC00006ED72200748184089C040E99D20B9D0E080C079D@comcast.net>
References: <120320072147.28375.4754797D000678BC00006ED72200748184089C040E99D20B9D0E080C079D@comcast.net>
Message-ID: <20071203215752.GB6727@bx9.net>

On Mon, Dec 03, 2007 at 09:47:41PM +0000, richard.walsh at comcast.net wrote:

> I think that the number of real-world apps in this class is perhaps
> not large, or there would be more hybrid code.

Ah, but you've missed the random element here: People start writing
hybrid code before they have any proof that it helps them. Or they
don't write it at all because they know it's complicated. Either way,
you can't assume cause and effect of "hybrid helps me" and "my code
is hybrid".

-- greg


From gerry.creager at tamu.edu  Mon Dec  3 14:18:46 2007
From: gerry.creager at tamu.edu (Gerry Creager)
Date: Mon, 03 Dec 2007 16:18:46 -0600
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded	libraries
	with MPI
In-Reply-To: <20071203215752.GB6727@bx9.net>
References: <120320072147.28375.4754797D000678BC00006ED72200748184089C040E99D20B9D0E080C079D@comcast.net>
	<20071203215752.GB6727@bx9.net>
Message-ID: <475480C6.1070309@tamu.edu>

Greg Lindahl wrote:
> On Mon, Dec 03, 2007 at 09:47:41PM +0000, richard.walsh at comcast.net wrote:
> 
>> I think that the number of real-world apps in this class is perhaps
>> not large, or there would be more hybrid code.
> 
> Ah, but you've missed the random element here: People start writing
> hybrid code before they have any proof that it helps them. Or they
> don't write it at all because they know it's complicated. Either way,
> you can't assume cause and effect of "hybrid helps me" and "my code
> is hybrid".

Or their code turns out to be 'hybrid' because they didn't really know 
what they were writing...

gerry
-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843


From richard.walsh at comcast.net  Mon Dec  3 14:29:56 2007
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Mon, 03 Dec 2007 22:29:56 +0000
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
Message-ID: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net>


-------------- Original message -------------- 
From: Greg Lindahl <lindahl at pbm.com> 

> On Mon, Dec 03, 2007 at 09:47:41PM +0000, richard.walsh at comcast.net wrote: 
> 
> > I think that the number of real-world apps in this class is perhaps 
> > not large, or there would be more hybrid code. 
> 
> Ah, but you've missed the random element here: People start writing 
> hybrid code before they have any proof that it helps them. Or they 
> don't write it at all because they know it's complicated. Either way, 
> you can't assume cause and effect of "hybrid helps me" and "my code 
> is hybrid". 
True, enough ... one must consider both the kinetic and thermodynamic requirements for existence, but I was thinking that the system was perhaps at equilibrium by now.  Still, it was careless of me to use non-existence to argue for either the absense of cause or presence of impossibility.  I am still waiting to get a straight flush in 5-card draw.
;-)
rbw
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071203/ee242292/attachment.html>

From lindahl at pbm.com  Mon Dec  3 15:23:44 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 3 Dec 2007 15:23:44 -0800
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
In-Reply-To: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net>
References: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net>
Message-ID: <20071203232343.GA27291@bx9.net>

On Mon, Dec 03, 2007 at 10:29:56PM +0000, richard.walsh at comcast.net wrote:

> True, enough ... one must consider both the kinetic and
> thermodynamic requirements for existence, but I was thinking that the
> system was perhaps at equilibrium by now.

No, people keep on producing hybrid codes and finding that they aren't
any faster than pure MPI. It's an amazing waste of money.

> I am still waiting to get a straight flush in 5-card draw.

Riiiight.

-- greg


From gmichal at uow.edu.au  Mon Dec  3 15:20:26 2007
From: gmichal at uow.edu.au (Guillaume MICHAL)
Date: Tue, 04 Dec 2007 10:20:26 +1100
Subject: [Beowulf] A cluster for material simulation
Message-ID: <op.t2r8gcybnfpofi@guizmo>

Good morning all,
Our faculty is thinking about a cluster for material simulations. At the  
moment we would like to use FEM, MD, MPM and maybe in some cases a  
multiscale FEM/MD or MPM/MD. We will start with a very small cluster  
around 5 nodes to be familliar with this kind of system and then extend it  
to around 20 nodes. Tasks size could vary between 1G to let say 10G. FEM  
will use Abaqus or CODE_ASTER. I don't really know the name of the  
softwares for MPM and MD.
I did some reasearch and reading (by the way, Building clustered linux  
systems by Robert W.Lucke is a bit scary!) and defined 2 kind of systems.  
I'd like your opinion on these.

Both systems use a Gigabit ethernet, 2GB of memory per CPU, 80GB of sata  
hard drive per nodes.
Dekstop motherboard based system:
1 asus P5E WS Professional motherboard, 1066FSB, DDR2 800 NON ECC  
unbuffered, 2GigE ports, 1 Intel Q6600 CPU @2.4GHz 8MB L2 cache

Server motherboard based system:
Supermicroserver 6015C-MTB 1333/1066FSB, DDR2 667 ECC FB-DIMM, 2GigE  
ports, 2 intel Xeon 5410 CPU @2.3Ghz 12MB L2 cache

It might seems I'm comparing apples and oranges but theoretical peak  
performance is equivalent and in term of cost/CPU there is not a huge  
difference(150 to 250 A$), also the server solution use twice as less  
nodes wich could be interesting in term of space, cables, switch... For  
recycling the desktop option seems better except if we use the servers for  
some kind of graphic cluster in the futur.

Now the real questions:
	1- If I understood properly FEM is kind of memory bounded so DDR2  
800/1066FSB/8MB L2 cache or DDR2 667/1333FSB/12MB L2 cache -> kind of  
newbie to theses things!
	2- Which one seems better in term of performance, reliability?
	3- Do I need a distinct network for NFS sharing (thath's why I wanted 2  
GigE ports per nodes) or I put the shared data on the master node(Quote  
 from R.W.Luke book: "This is bad, bad, bad")?
	4- there is also the supermicro superserver 6015tw-tb with two dual  
socket motherboard in a 1U form factor (node it's just two nodes put in  
one box, no interconnections whatsoever apart from the PSU) with roughtly  
the same price per CPU compare to the other supermicro solution, could be  
interesting for an even more compact system, do you have any knowledge  
about this system?
	5- anything I didn't think of and might be worth checking such as "Oh!  
you need a fast hard drive as i/o is critical...;-)"


Thank you for your advices!
Guillaume Michal


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


From Michael.Frese at NumerEx.com  Mon Dec  3 16:39:45 2007
From: Michael.Frese at NumerEx.com (Michael H. Frese)
Date: Mon, 03 Dec 2007 17:39:45 -0700
Subject: [Beowulf] NFS Read Errors
Message-ID: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>

We were having trouble restarting from our homegrown parallel 
magnetohydrodynamic code's checkpoint files.  The files could be 
read, but funny things happened in the run afterward.  Eventually we 
figured out that the restarted parallel run differed from the serial 
restarted run from the same checkpoint.

After much gnashing of teeth and rending of apparel, we found that 
the checkpoint files were being read incorrectly across NFS.  That 
let us simplify our search for the problem.  We first found that the 
local md5 digest [openssl dgst -md5 (file...)] on an NFS cp'ed 
version of the file was different from that produced on the original 
file.  What was interesting was that the copy either took forEVER -- 
like 10 minutes or 20 minutes for a 1 GB file -- when the final 
result was bad or it took about a minute when the file was 
perfect.  I'm guessing that whatever error checking that gets done on 
the packets was rejecting so many it finally got a bad packet it 
couldn't tell was bad.

When we found that doing the md5 digest on a remote file produced a 
different result than doing it on the processor on which the disk was 
mounted, our tests got simpler.  And shorter, still, after we found 
that we could get fairly frequent failures with 10 MB files or 
smaller.  Clearly we had an NFS failure, probably associated with hardware.

This was all between two specific nodes of our small cluster.  [Old 
hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual 
(Tyan...) chip motherboards both running Redhat 9 one with the 
2.4.20-8 kernels, though one is the smp version; NetGear GA311 NICs; 
and a NetGear GS108 8 port Copper 1 GB/s switch.  The single 
processor motherboards have 32-bit PCI slots so their network speeds 
are limited to 300 kbps as shown by netpipe.  All of the LEDs at the 
ends of the cables show 1000Mb connections.]

Then we started checking other pairs.  Some were fine.  Some were bad 
in the same way.  So we replaced the switch, changing to a 16 port 
NetGear GS216.  That seemed to cure most of the problem.  But we 
continued to have problems copying a file on one particular single 
processor machine from the others.

That's where we are now.  The md5 digest run on that machine 
consistently shows the same result, whereas the digest for that file 
produced on a remote machine will be almost stochastic.  In some 
cases it will eventually settle in to the right answer, and then the 
speed goes WAY up.  I suppose that happens because the file request 
can be served from the local machine's cache.  But why doesn't it 
happen after it received bad blocks?

Most, if not all of the original network cards in those machines went 
bad and have been replaced in the last few years, so I decided to try 
a brand new GA311.  No joy there.  It still gives out the wrong 
info.  I guess the motherboard PCI bus controller is hinky, but I'm 
far from sure.

We are in the process of upgrading and thus replacing all the 
machines we have of that configuration due to space limitations and 
their age, but I'm still curious what the problem could be.

Suggestions?  Comments?


Mike


From landman at scalableinformatics.com  Mon Dec  3 17:21:37 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Mon, 03 Dec 2007 20:21:37 -0500
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
Message-ID: <4754ABA1.9030105@scalableinformatics.com>

Hi Michael:

Michael H. Frese wrote:
> We were having trouble restarting from our homegrown parallel 
> magnetohydrodynamic code's checkpoint files.  The files could be read, 
> but funny things happened in the run afterward.  Eventually we figured 
> out that the restarted parallel run differed from the serial restarted 
> run from the same checkpoint.
> 
> After much gnashing of teeth and rending of apparel, we found that the 
> checkpoint files were being read incorrectly across NFS.  That let us 
> simplify our search for the problem.  We first found that the local md5 
> digest [openssl dgst -md5 (file...)] on an NFS cp'ed version of the file 

	md5sum filename

does the same thing with a slightly simpler syntax.  There is mounting 
evidence that you should use sha1sum rather than md5sum.

> was different from that produced on the original file.  What was 
> interesting was that the copy either took forEVER -- like 10 minutes or 
> 20 minutes for a 1 GB file -- when the final result was bad or it took 
> about a minute when the file was perfect.  I'm guessing that whatever 
> error checking that gets done on the packets was rejecting so many it 
> finally got a bad packet it couldn't tell was bad.

Sounds a great deal like a bad disk/disk system or something mucking 
with your connection to the data.  1 GB file, even at 1 MB/s is 1000 
seconds, or 16 minutes.  If you have a disk which keeps timing out, or 
has bad blocks, and keeps retrying, well, stuff like this can happen, 
especially on old kernels (and old hardware).

Could also be a RAM error.

> 
> When we found that doing the md5 digest on a remote file produced a 
> different result than doing it on the processor on which the disk was 
> mounted, our tests got simpler.  And shorter, still, after we found that 
> we could get fairly frequent failures with 10 MB files or smaller.  
> Clearly we had an NFS failure, probably associated with hardware.

Yes.  I would venture a guess that you are seeing *lots* of errors in 
your /var/log/syslog or /var/log/messages files.


> This was all between two specific nodes of our small cluster.  [Old 
> hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual 
> (Tyan...) chip motherboards both running Redhat 9 one with the 2.4.20-8 
> kernels, though one is the smp version; NetGear GA311 NICs; and a 

Owie...

> NetGear GS108 8 port Copper 1 GB/s switch.  The single processor 
> motherboards have 32-bit PCI slots so their network speeds are limited 
> to 300 kbps as shown by netpipe.  All of the LEDs at the ends of the 
> cables show 1000Mb connections.]

300 kbps?  thats 300 kilo bits per second (abbreviations are *very* 
important to get right, kB/s is not the same as kb/s).  300 kbps is 
usually read as 300 kilo bits per second.  Or about about 37.5 kB/s. 
Which is about the average speed of various DSL lines.

I hope you mean 30 MB/s (or 240 Mb/s).

> 
> Then we started checking other pairs.  Some were fine.  Some were bad in 
> the same way.  So we replaced the switch, changing to a 16 port NetGear 
> GS216.  That seemed to cure most of the problem.  But we continued to 

We have seen bad switches a few times.

> have problems copying a file on one particular single processor machine 
> from the others.
> 
> That's where we are now.  The md5 digest run on that machine 
> consistently shows the same result, whereas the digest for that file 
> produced on a remote machine will be almost stochastic.  In some cases 
> it will eventually settle in to the right answer, and then the speed 
> goes WAY up.  I suppose that happens because the file request can be 
> served from the local machine's cache.  But why doesn't it happen after 
> it received bad blocks?

I am guessing you are using TCP NFS mounts as well?  TCP forces retries 
in the event of bad packets.  UDP doesn't force this, but the NFS 
protocol will try.  Ram errors, bad cables, burnt switches, and machines 
with interrupt problems (old machines often shared interrupts without 
being able to do a very good job of it).

> Most, if not all of the original network cards in those machines went 
> bad and have been replaced in the last few years, so I decided to try a 
> brand new GA311.  No joy there.  It still gives out the wrong info.  I 
> guess the motherboard PCI bus controller is hinky, but I'm far from sure.

Did you try a new cable?  Had a few cables go bad, usually they are 
marginal to begin with.

> 
> We are in the process of upgrading and thus replacing all the machines 
> we have of that configuration due to space limitations and their age, 
> but I'm still curious what the problem could be.

There are quite a few possibilities unfortunately.  Unless you plan to 
use these existing machines for quite a while longer, it might be less 
painful to shut off the malfunctioning node.

> 
> Suggestions?  Comments?

2.4.20?  Athlons?  I would say a serious hardware/OS refresh is in order :)


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From hahn at mcmaster.ca  Mon Dec  3 22:15:00 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 4 Dec 2007 01:15:00 -0500 (EST)
Subject: [Beowulf] Re: CSharifi Next generation of HPC
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGqXv4ZTAVFNmgYhAXsY3wABAAAAAA==@GMAIL.COM>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
	<474FEF18.6020308@obs.unige.ch>
	<Pine.LNX.4.64.0711300936050.11868@coffee.psychology.mcmaster.ca>
	<d5bdff000711300759r17f381fap25e34bc1821f73ff@mail.gmail.com>
	<Pine.LNX.4.64.0711301724470.28765@coffee.psychology.mcmaster.ca>
	<4752C689.5030102@gmail.com>
	<!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGqXv4ZTAVFNmgYhAXsY3wABAAAAAA==@GMAIL.COM>
Message-ID: <Pine.LNX.4.64.0712040050520.4169@coffee.psychology.mcmaster.ca>

> C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level

OK, how about providing some meaty content?  google shows me that 
you've put this fairly content-light PR on several groups and websites.

> as Usability.  Although the latter belief was hard to realize, a sample

why was it hard?  there have been a fair number of several kernel-based
dist-OS approaches (MOSIX comes to mind, but scyld, and also a host of older
academic systems.)

> byproduct called DIPC was built purely based on this thesis and openly
> announced to the Linux community worldwide in 1993.  This was admired for
> being able to provide necessary supports for distributed communication at
> the Kernel Level of Linux for the first time in the world, and for providing

page-based distributed shared memory has been done many, many times,
and operate in a very easy-to-understand manner (like a cache with 4KB 
rather than 64KB lines.)  can you quantify the advantage to managing
the DSM in the kernel?  I'm sure you're aware that "playing MMU games"
is not highly regarded in many circles because of its slowness - 
have you figured out a way around that?

regards, mark hahn.


From hahn at mcmaster.ca  Mon Dec  3 22:31:05 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 4 Dec 2007 01:31:05 -0500 (EST)
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <4754ABA1.9030105@scalableinformatics.com>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
	<4754ABA1.9030105@scalableinformatics.com>
Message-ID: <Pine.LNX.4.64.0712040116300.4169@coffee.psychology.mcmaster.ca>

> does the same thing with a slightly simpler syntax.  There is mounting 
> evidence that you should use sha1sum rather than md5sum.

for general checking, md5 is still fine (ie not security-related stuff).

> I am guessing you are using TCP NFS mounts as well?  TCP forces retries in 
> the event of bad packets.  UDP doesn't force this, but the NFS protocol will

UDP has a checksum as well, though it's only 16b.  then again, the TCP
checksum isn't all that strong for today's data rates either.

you should definitely examine /proc/net/dev on involved machines.

>> We are in the process of upgrading and thus replacing all the machines we 
>> have of that configuration due to space limitations and their age, but I'm 
>> still curious what the problem could be.

I would attempt to reduce the complexity of your testing.
for instance, can a node write and verify to its local disk
without problem?  can it stream data over tcp sockets (netcat 
or the like) without corruption or obvious problems reflected
in /proc/net/dev?  does ethtool tell you anything about the 
config of the nic?  comparing tcp vs udp NFS would be sensible
as well - varying the packet size, too.  switching client and/or 
server to a modern 2.6 kernel may be instructive.


From hahn at mcmaster.ca  Mon Dec  3 23:00:19 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 4 Dec 2007 02:00:19 -0500 (EST)
Subject: [Beowulf] A cluster for material simulation
In-Reply-To: <op.t2r8gcybnfpofi@guizmo>
References: <op.t2r8gcybnfpofi@guizmo>
Message-ID: <Pine.LNX.4.64.0712040134480.4169@coffee.psychology.mcmaster.ca>

> be familliar with this kind of system and then extend it to around 20 nodes. 
> Tasks size could vary between 1G to let say 10G.

10G is quite modest, especially for 20 nodes (ram is cheap!).
are you sure you need a cluster?  a single nicely configured 
SMP system will handle 10G jobs quite neatly, and save considerable
effort.  of course, you can't really scale memory bandwidth
without going to a cluster, but I would guess that a 4-socket,
quad-core AMD system with all memory banks active would be tempting.

> I did some reasearch and reading (by the way, Building clustered linux 
> systems by Robert W.Lucke is a bit scary!)

well, it tries to cover a lot of ground.  it's really pretty simple
to get a basic cluster up and running.

> Both systems use a Gigabit ethernet, 2GB of memory per CPU, 80GB of sata hard 
> drive per nodes.
> Dekstop motherboard based system:
> 1 asus P5E WS Professional motherboard, 1066FSB, DDR2 800 NON ECC unbuffered, 
> 2GigE ports, 1 Intel Q6600 CPU @2.4GHz 8MB L2 cache
>
> Server motherboard based system:
> Supermicroserver 6015C-MTB 1333/1066FSB, DDR2 667 ECC FB-DIMM, 2GigE ports, 2 
> intel Xeon 5410 CPU @2.3Ghz 12MB L2 cache

the main thing here is that Intel has, for a long time, had a mediocre 
reputation for memory bandwidth.  I probably would not consider buying
anything older than the 45nm penryn-generation chips with 1333 or higher FSB.

> It might seems I'm comparing apples and oranges but theoretical peak 
> performance is equivalent and in term of cost/CPU there is not a huge 
> difference(150 to 250 A$), also the server solution use twice as less nodes 
> wich could be interesting in term of space, cables, switch...

a 20-node cluster is half a rack, and not really complicated in cabling.
how's your cooling?  I'd probably worry about cooling before I worried 
about cabling...

> For recycling 
> the desktop option seems better except if we use the servers for some kind of 
> graphic cluster in the futur.

perhaps.  my experience is that well-adapted cluster nodes are not 
good for desktops precisely because of those adaptations.

> 	1- If I understood properly FEM is kind of memory bounded so DDR2 
> 800/1066FSB/8MB L2 cache or DDR2 667/1333FSB/12MB L2 cache -> kind of newbie 
> to theses things!

10G/20 nodes is 512M/node - divided among 4 cores is 128M/core, so I 
suspect the cache size isn't going to make much difference.  the FSB will
matter, though.

> 	2- Which one seems better in term of performance, reliability?

faster FSB and ram will be noticably better in performance.  I don't see
why there would be much difference in reliability, though.  the parts that
break are mainly fans.  server parts tend to offer nicer monitoring options
as well as the comfort of ECC (one less place for a heisenbug to live.)

> 	3- Do I need a distinct network for NFS sharing (thath's why I wanted

certainly not.  my experience is that a single job doesn't tend to overlap
its MPI and NFS traffic much.  if you share a single node among multiple 
jobs, this could be an issue.

> 2 GigE ports per nodes) or I put the shared data on the master node(Quote 
> from R.W.Luke book: "This is bad, bad, bad")?

well, he's wrong.  sure, it's a hotspot, but it's also convenient, cheap
and effective.  going to a parallel filesystem will be a significant 
increase in complexity, though only you can know how badly you need the 
IO performance.  a shared fileserver can deliver higher bandwidth through
trunking or even a 10Gb link.  configuring a couple fileservers obviously
scales nicely at the expense of having a partitioned namespace.

> 	4- there is also the supermicro superserver 6015tw-tb with two dual 
> socket motherboard in a 1U form factor (node it's just two nodes put in one 
> box, no interconnections whatsoever apart from the PSU) with roughtly the 
> same price per CPU compare to the other supermicro solution, could be 
> interesting for an even more compact system, do you have any knowledge about 
> this system?

AFAIK, the only downside is a custom formfactor (chassis, boards, PSU).
but why is space such an issue for you?  a stack of 20 1U servers is not
all that big.  it's also a newer system design which, given low-volt cpus,
would be nicely heat-efficient.

> 	5- anything I didn't think of and might be worth checking such as 
> "Oh! you need a fast hard drive as i/o is critical...;-)"

your IO will be over gigabit, so you don't need fast HD (current single
disks average about 70 MB/s.

even for a 20-node cluster, I'd seriously consider getting IPMI
or at least controllable power.


From rgb at phy.duke.edu  Tue Dec  4 04:53:10 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 4 Dec 2007 07:53:10 -0500 (EST)
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded libraries
	with MPI
In-Reply-To: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net>
References: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net>
Message-ID: <Pine.LNX.4.64.0712040748050.11771@lilith.rgb.private.net>

On Mon, 3 Dec 2007, richard.walsh at comcast.net wrote:

> impossibility.  I am still waiting to get a straight flush in 5-card
> draw.

Are ye, now... interesting.

Sometime we'll have to wait together.  In the meantime, I find that if
you play the game with a wild card or eight it alters the odds
magnificently.  Why, you can get a straight flush and still lose the
game...;-)

    rgb

(Who's lurking but busy and who never, ever writes hybrid code.  Sounds
positively -- um -- sexual.  Or radioactive.  Involving white coated men
with large ears and thick glasses.  Not for me.)

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From larry.stewart at sicortex.com  Tue Dec  4 05:46:35 2007
From: larry.stewart at sicortex.com (Larry Stewart)
Date: Tue, 04 Dec 2007 08:46:35 -0500
Subject: [Beowulf] Recommended paper for parallel sorting?
In-Reply-To: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com>
References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com>
Message-ID: <47555A3B.3080609@sicortex.com>

Nelson Castillo wrote:

>Hi.
>
>Could you please recommend a paper for reading? I'd like to know about parallel
>sorting algorithms for this architecture.
>
>Regards,
>Nelson.-
>
>  
>
I was looking into this a few months ago.  Here are some good papers I 
found:

http://citeseer.ist.psu.edu/393851.html  -- Communications Conscious 
Radix Sort

http://citeseer.ist.psu.edu/569483.html  -- Parallel Algorithms for 
Personalized Communication and Sorting With an Experinmental Study

Martin Schmollinger: Improving Communication Sensitive Parallel Radix 
Sort for Unbalanced Data. Euro-Par 2003 
<http://www.informatik.uni-trier.de/%7Eley/db/conf/europar/europar2003.html#Schmollinger03>: 
885-893

Schmollinger's PhD dissertation has a good chapter on this as well.

-- 
-Larry / Sector IX


From Michael.Frese at NumerEx.com  Tue Dec  4 06:55:12 2007
From: Michael.Frese at NumerEx.com (Michael H. Frese)
Date: Tue, 04 Dec 2007 07:55:12 -0700
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <4754ABA1.9030105@scalableinformatics.com>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
	<4754ABA1.9030105@scalableinformatics.com>
Message-ID: <6.2.5.6.2.20071204042727.04f6d1f0@NumerEx.com>

Joe,

Thanks for the suggestions.  Let me make some quick corrections.

At one point I knew about md5sum, but, as they say in Spanish, it 
forgot itself on me.

You are right about the data rate on the 32 bit PCI cards: I meant 300 Mbps.

As for the time for wire speed transmission of 1 GB, at 300 Mbps it 
is only about 30 seconds.  It turns out the biggest file I am dealing 
with is 400 MB, not 1 GB, and the local md5sum takes only 10 seconds, 
indicating that the disk-to-memory speed is at least 40 MBps, which 
is about what I expect from this hardware, and about equal to the 300 
Mbps ethernet speed on the single processor.  But the remote md5sum 
takes almost 6 minutes to get the wrong answer.

The problem with disk system or memory hypotheses is that the local 
md5sum is consistent, and fast.

There are no unexpected messages in /var/log/messages, and there is 
no /var/log/syslog.

The only thing I haven't checked outside the box is the cable, so I 
will do that, but it seems unlikely.

And yes, these boxes are old, but they have served me well, and my 
replacements won't be up and running till the end of the month.  I 
also was hoping to find a better configuration choice, if there is one.


Mike


At 06:21 PM 12/3/2007, Joe Landman wrote:
>Hi Michael:
>
>Michael H. Frese wrote:
>>We were having trouble restarting from our homegrown parallel 
>>magnetohydrodynamic code's checkpoint files.  The files could be 
>>read, but funny things happened in the run afterward.  Eventually 
>>we figured out that the restarted parallel run differed from the 
>>serial restarted run from the same checkpoint.
>>After much gnashing of teeth and rending of apparel, we found that 
>>the checkpoint files were being read incorrectly across NFS.  That 
>>let us simplify our search for the problem.  We first found that 
>>the local md5 digest [openssl dgst -md5 (file...)] on an NFS cp'ed 
>>version of the file
>
>         md5sum filename
>
>does the same thing with a slightly simpler syntax.  There is 
>mounting evidence that you should use sha1sum rather than md5sum.
>
>>was different from that produced on the original file.  What was 
>>interesting was that the copy either took forEVER -- like 10 
>>minutes or 20 minutes for a 1 GB file -- when the final result was 
>>bad or it took about a minute when the file was perfect.  I'm 
>>guessing that whatever error checking that gets done on the packets 
>>was rejecting so many it finally got a bad packet it couldn't tell was bad.
>
>Sounds a great deal like a bad disk/disk system or something mucking 
>with your connection to the data.  1 GB file, even at 1 MB/s is 1000 
>seconds, or 16 minutes.  If you have a disk which keeps timing out, 
>or has bad blocks, and keeps retrying, well, stuff like this can 
>happen, especially on old kernels (and old hardware).
>
>Could also be a RAM error.
>
>>When we found that doing the md5 digest on a remote file produced a 
>>different result than doing it on the processor on which the disk 
>>was mounted, our tests got simpler.  And shorter, still, after we 
>>found that we could get fairly frequent failures with 10 MB files or smaller.
>>Clearly we had an NFS failure, probably associated with hardware.
>
>Yes.  I would venture a guess that you are seeing *lots* of errors 
>in your /var/log/syslog or /var/log/messages files.
>
>
>>This was all between two specific nodes of our small cluster.  [Old 
>>hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual 
>>(Tyan...) chip motherboards both running Redhat 9 one with the 
>>2.4.20-8 kernels, though one is the smp version; NetGear GA311 NICs; and a
>
>Owie...
>
>>NetGear GS108 8 port Copper 1 GB/s switch.  The single processor 
>>motherboards have 32-bit PCI slots so their network speeds are 
>>limited to 300 kbps as shown by netpipe.  All of the LEDs at the 
>>ends of the cables show 1000Mb connections.]
>
>300 kbps?  thats 300 kilo bits per second (abbreviations are *very* 
>important to get right, kB/s is not the same as kb/s).  300 kbps is 
>usually read as 300 kilo bits per second.  Or about about 37.5 kB/s. 
>Which is about the average speed of various DSL lines.
>
>I hope you mean 30 MB/s (or 240 Mb/s).
>
>>Then we started checking other pairs.  Some were fine.  Some were 
>>bad in the same way.  So we replaced the switch, changing to a 16 
>>port NetGear GS216.  That seemed to cure most of the problem.  But 
>>we continued to
>
>We have seen bad switches a few times.
>
>>have problems copying a file on one particular single processor 
>>machine from the others.
>>That's where we are now.  The md5 digest run on that machine 
>>consistently shows the same result, whereas the digest for that 
>>file produced on a remote machine will be almost stochastic.  In 
>>some cases it will eventually settle in to the right answer, and 
>>then the speed goes WAY up.  I suppose that happens because the 
>>file request can be served from the local machine's cache.  But why 
>>doesn't it happen after it received bad blocks?
>
>I am guessing you are using TCP NFS mounts as well?  TCP forces 
>retries in the event of bad packets.  UDP doesn't force this, but 
>the NFS protocol will try.  Ram errors, bad cables, burnt switches, 
>and machines with interrupt problems (old machines often shared 
>interrupts without being able to do a very good job of it).
>
>>Most, if not all of the original network cards in those machines 
>>went bad and have been replaced in the last few years, so I decided 
>>to try a brand new GA311.  No joy there.  It still gives out the 
>>wrong info.  I guess the motherboard PCI bus controller is hinky, 
>>but I'm far from sure.
>
>Did you try a new cable?  Had a few cables go bad, usually they are 
>marginal to begin with.
>
>>We are in the process of upgrading and thus replacing all the 
>>machines we have of that configuration due to space limitations and 
>>their age, but I'm still curious what the problem could be.
>
>There are quite a few possibilities unfortunately.  Unless you plan 
>to use these existing machines for quite a while longer, it might be 
>less painful to shut off the malfunctioning node.
>
>>Suggestions?  Comments?
>
>2.4.20?  Athlons?  I would say a serious hardware/OS refresh is in order :)
>
>
>
>--
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web  : http://www.scalableinformatics.com
>        http://jackrabbit.scalableinformatics.com
>phone: +1 734 786 8423
>fax  : +1 866 888 3112
>cell : +1 734 612 4615


From dnlombar at ichips.intel.com  Tue Dec  4 07:17:48 2007
From: dnlombar at ichips.intel.com (Lombard, David N)
Date: Tue, 4 Dec 2007 07:17:48 -0800
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
In-Reply-To: <4752C689.5030102@gmail.com>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
	<474FEF18.6020308@obs.unige.ch>
	<Pine.LNX.4.64.0711300936050.11868@coffee.psychology.mcmaster.ca>
	<d5bdff000711300759r17f381fap25e34bc1821f73ff@mail.gmail.com>
	<Pine.LNX.4.64.0711301724470.28765@coffee.psychology.mcmaster.ca>
	<4752C689.5030102@gmail.com>
Message-ID: <20071204151748.GA26106@nlxdcldnl2.cl.intel.com>

On Sun, Dec 02, 2007 at 03:51:53PM +0100, Toon Knapen wrote:
> Mark Hahn wrote:
> >>IMHO the hybris approach (MPI+threads) is interesting in case every
> >>MPI-process has lots of local data.
> >
> >yes.  but does this happen a lot?  the appealing case would be threads 
> >that make lots of heavy use of some large data, _but_
> >without needing synchronization/locking.  once you need locking
> >among the threads, message passing starts to catch up.
> 
> Direct solvers (for Finite Elements for instance) need a lot of data. 
> Additionally distributing the matrix generate interfaces (between the 
> different submatrices) which are hard to solve. In such situation, one 
> tries to minimize the number of interfaces (by having one submatrix per 
> MPI-process) and speed up the solving of each submatrix using threads.

Yes, this is my direct experience with hybrid programming.  An automated
domain decomp is used to partition the model, and then threads (either
native or OpenMP) are used within the domain.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.


From dnlombar at ichips.intel.com  Tue Dec  4 07:28:52 2007
From: dnlombar at ichips.intel.com (Lombard, David N)
Date: Tue, 4 Dec 2007 07:28:52 -0800
Subject: [Beowulf] Recommended paper for parallel sorting?
In-Reply-To: <Pine.LNX.4.64.0712031337340.11771@lilith.rgb.private.net>
References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com>
	<Pine.LNX.4.64.0712030906190.11771@lilith.rgb.private.net>
	<e4d4fd070712030727h2c9fe4c9j50c378ac33e65131@mail.gmail.com>
	<Pine.LNX.4.64.0712031337340.11771@lilith.rgb.private.net>
Message-ID: <20071204152852.GB26106@nlxdcldnl2.cl.intel.com>

On Mon, Dec 03, 2007 at 01:38:12PM -0500, Robert G. Brown wrote:
> On Mon, 3 Dec 2007, Peter St. John wrote:
> 
> >(re Ian Foster, *Designing and Building Parallel Programs *online as below
> >or Addison Wesley):
> >
> >I did that search and right the top was this link, which looks like 
> >homebase
> >for the original material:
> >http://www-unix.mcs.anl.gov/dbpp/
> >Very cool, thanks RGB for what looks like toothsome book.
> 
> I went ahead and bought a paper copy, but it is nice to be able to
> access the material from a workstation because I don't carry the copy
> around with me all the time...;-)

Whenever I find an example of both print and online copies of any
reasonable text, I'll make sure I buy the print copy to reward such
behavior.  The Rute and SVN books are two additional examples.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.


From Michael.Frese at NumerEx.com  Tue Dec  4 07:54:24 2007
From: Michael.Frese at NumerEx.com (Michael H. Frese)
Date: Tue, 04 Dec 2007 08:54:24 -0700
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <Pine.LNX.4.64.0712040116300.4169@coffee.psychology.mcmaste r.ca>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
	<4754ABA1.9030105@scalableinformatics.com>
	<Pine.LNX.4.64.0712040116300.4169@coffee.psychology.mcmaster.ca>
Message-ID: <6.2.5.6.2.20071204085359.04f72018@NumerEx.com>

Mark,

Thanks for your helpful comments.

At 11:31 PM 12/3/2007, you wrote:
>>I am guessing you are using TCP NFS mounts as well?  TCP forces 
>>retries in the event of bad packets.  UDP doesn't force this, but 
>>the NFS protocol will
>
>UDP has a checksum as well, though it's only 16b.  then again, the TCP
>checksum isn't all that strong for today's data rates either.

 From reading the man page on nfs on the systems with the 2.4 
kernels, it looks like the default for an nfs mount is udp.  It also 
looks like tcp is not really an option until nfs v4, so it may be 
something to try on the 2.6 kernels that I have on some of my newer 
machines at another site.

>you should definitely examine /proc/net/dev on involved machines.

I hadn't known about /proc/net/dev.  When I check there, I see no 
transmit errors on the server side and no receive errors on the 
client side.  That's odd, because the other thing I see is that the 
average packet size received (bytes received divided by packets 
received) on the client side is 3.9, while on the server side, the 
average packet size sent is 1430.  In other words, there are a many 
more packets received than there ought to be.  That's very 
fishy.  It's probably the result of the way the packet count is done 
and reported.  I.e., it may be that all the received packets -- good 
and bad -- are counted, but only the bytes in the good ones are 
counted, with some similar problem on the server side.  I think the 
statistics are aggregate since the last boot, so they may not be just 
from the troublesome tests I was performing, either.

>I would attempt to reduce the complexity of your testing.
>for instance, can a node write and verify to its local disk
>without problem?

The local disk read seems rock solid in comparison to the NFS 
one.  The local md5sum produces the same result time after time, 
which is just not the case for the remote.

>can it stream data over tcp sockets (netcat or the like) without 
>corruption or obvious problems reflected
>in /proc/net/dev?

netcat is not on my systems.  Looks like I have to get someone to 
download and build it for me, and try the streaming tests you recommend.

>does ethtool tell you anything about the config of the nic?

Not on the 2.4 systems, though it seems to tell me a little on the 2.6's.

>comparing tcp vs udp NFS would be sensible
>as well - varying the packet size, too.  switching client and/or 
>server to a modern 2.6 kernel may be instructive.

Upgrading the kernel is probably the only way I'll get nfs over 
tcp.  Given that these systems are headed out the door, I'm not sure 
that's a good use of our time.  But it may be worth doing an our new 
and newer systems.

Thanks again!


Mike  


From jlb17 at duke.edu  Tue Dec  4 09:24:54 2007
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Tue, 4 Dec 2007 12:24:54 -0500 (EST)
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <6.2.5.6.2.20071204085359.04f72018@NumerEx.com>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
	<4754ABA1.9030105@scalableinformatics.com>
	<Pine.LNX.4.64.0712040116300.4169@coffee.psychology.mcmaster.ca>
	<6.2.5.6.2.20071204085359.04f72018@NumerEx.com>
Message-ID: <alpine.LRH.0.99999.0712041209430.11349@hogwarts.egr.duke.edu>

On Tue, 4 Dec 2007 at 8:54am, Michael H. Frese wrote

> From reading the man page on nfs on the systems with the 2.4 kernels, it 
> looks like the default for an nfs mount is udp.  It also looks like tcp is 
> not really an option until nfs v4, so it may be something to try on the 2.6 
> kernels that I have on some of my newer machines at another site.

NFSv3 over TCP is the default for most modern distros (obviously this 
rules out your setup ;).  I honestly don't remember if it was supported in 
RH9 (I think it was, but that was many moons ago) but it'd be easy to 
test.  Just add 'tcp' to the mount options in /etc/fstab and try the 
mount.  If it's not supported, it won't work.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


From mousavi.ehsan at gmail.com  Mon Dec  3 21:47:36 2007
From: mousavi.ehsan at gmail.com (Ehsan Mousavi)
Date: Tue, 4 Dec 2007 09:17:36 +0330
Subject: [Beowulf] CSharifi Next generation of HPC
In-Reply-To: <4752C689.5030102@gmail.com>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>	<474FEF18.6020308@obs.unige.ch>	<Pine.LNX.4.64.0711300936050.11868@coffee.psychology.mcmaster.ca>	<d5bdff000711300759r17f381fap25e34bc1821f73ff@mail.gmail.com>	<Pine.LNX.4.64.0711301724470.28765@coffee.psychology.mcmaster.ca>
	<4752C689.5030102@gmail.com>
Message-ID: <!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGqXv4ZTAVFNmgYhAXsY3wABAAAAAA==@GMAIL.COM>

C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level
Paradigm" for Distributed Computing Support

Contrary to two school of thoughts in providing system software support for
distributed computation that advocate either the development of a whole new
distributed operating system (like Mach), or the development of
library-based or patch-based middleware on top of existing operating systems
(like MPI, Kerrighed and Mosix), Dr. Mohsen Sharifi hypothesized another
school of thought as his thesis in 1986 that believes all distributed
systems software requirements and supports can be and must be built at the
Kernel Level of existing operating systems; requirements like Ease of
Programming, Simplicity, Efficiency, Accessibility, etc which may be coined
as Usability.  Although the latter belief was hard to realize, a sample
byproduct called DIPC was built purely based on this thesis and openly
announced to the Linux community worldwide in 1993.  This was admired for
being able to provide necessary supports for distributed communication at
the Kernel Level of Linux for the first time in the world, and for providing
Ease of Programming as a consequence of being realized at the Kernel Level.
However, it was criticized at the same time as being inefficient. This did
not force the school to trade Ease of Programming for Efficiency but instead
tried hard to achieve efficiency, alongside ease of programming and
simplicity, without defecting the school that advocates the provision of all
needs at the kernel level. The result of this effort is now manifested in
the C-Sharifi Cluster Engine.
 C-Sharifi is a cost effective distributed system software engine in support
of high performance computing by clusters of off-the-shelf computers. It is
wholly implemented in Kernel, and as a consequence of following this school,
it has Ease of Programming, Ease of Clustering, Simplicity, and it can be
configured to fit as best as possible to the efficiency requirements of
applications that need high performance.  It supports both distributed
shared memory and message passing styles, it is built in Linux, and its
cost/performance ratio in some scientific applications (like meteorology and
cryptanalysis) has shown to be far better than non-kernel-based solutions
and engines (like MPI, Kerrighed and Mosix). 

Best Regard
~Ehsan Mousavi
C-Sharifi  Development Team

-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On
Behalf Of Toon Knapen
Sent: Sunday, December 02, 2007 6:22 PM
To: Mark Hahn
Cc: Beowulf Mailing List
Subject: Re: [Beowulf] Using Autoparallel compilers or Multi-Threaded
libraries with MPI

Mark Hahn wrote:
>> IMHO the hybris approach (MPI+threads) is interesting in case every
>> MPI-process has lots of local data.
> 
> yes.  but does this happen a lot?  the appealing case would be threads 
> that make lots of heavy use of some large data, _but_
> without needing synchronization/locking.  once you need locking
> among the threads, message passing starts to catch up.

Direct solvers (for Finite Elements for instance) need a lot of data. 
Additionally distributing the matrix generate interfaces (between the 
different submatrices) which are hard to solve. In such situation, one 
tries to minimize the number of interfaces (by having one submatrix per 
MPI-process) and speed up the solving of each submatrix using threads.

Finance is another example. Financial applications need to evaluate a 
large number of open positions based on the simulated, current or past 
market-data. There are many dependencies between all the different data 
which makes that it is hard to decompose the data in largely independent 
chunks.


> 
>> latter is simpler because it only requires MPI-parallelism but if the 
>> code
>> is memory-bound and every mpi-process has much of the same data, it 
>> will be
>> better to share this common data with all processes on the same cpu 
>> and thus
>> use threads intra-node.
> 
> what kind of applications behave like that?  I agree that if your MPI 
> app is keeping huge amounts of (static) data replicated in each rank,
> you should rethink your design.
> 


See above.

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From gmichal at uow.edu.au  Tue Dec  4 00:15:18 2007
From: gmichal at uow.edu.au (Guillaume Michal)
Date: Tue, 04 Dec 2007 19:15:18 +1100
Subject: [Beowulf] A cluster for material simulation
In-Reply-To: <Pine.LNX.4.64.0712040134480.4169@coffee.psychology.mcmaster.ca>
References: <op.t2r8gcybnfpofi@guizmo>
	<Pine.LNX.4.64.0712040134480.4169@coffee.psychology.mcmaster.ca>
Message-ID: <op.t2sw7sy6kq1em0@nomad.dune.org>

Space in not an issue at all in fact but as a mech engineer I'm more: "the  
less parts the better",
so I tend to try to factorise and make it as simple as possible. The heat  
won't be a problem as air
conditioning exists in the room. In term of tasks sizes, 10G is what we  
need "tomorrow", as our understanding of the cluster increase, we will  
increase the size of the problems.

By the way, thank you for your indications, I'm going to to think a bit  
more about all that, and... try to understand what IPMI are all about ;-)

Guillaume


From nelsoneci at gmail.com  Tue Dec  4 05:52:43 2007
From: nelsoneci at gmail.com (Nelson Castillo)
Date: Tue, 4 Dec 2007 08:52:43 -0500
Subject: [Beowulf] Recommended paper for parallel sorting?
In-Reply-To: <47555A3B.3080609@sicortex.com>
References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com>
	<47555A3B.3080609@sicortex.com>
Message-ID: <2accc2ff0712040552n5e5f222fo2c241aa11fb577b6@mail.gmail.com>

On Dec 4, 2007 8:46 AM, Larry Stewart <larry.stewart at sicortex.com> wrote:
(cut)
> I was looking into this a few months ago.  Here are some good papers I
> found:
>
> http://citeseer.ist.psu.edu/393851.html  -- Communications Conscious
> Radix Sort
>
> http://citeseer.ist.psu.edu/569483.html  -- Parallel Algorithms for
> Personalized Communication and Sorting With an Experinmental Study
>
> Martin Schmollinger: Improving Communication Sensitive Parallel Radix
> Sort for Unbalanced Data. Euro-Par 2003
> <http://www.informatik.uni-trier.de/%7Eley/db/conf/europar/europar2003.html#Schmollinger03>:
> 885-893
>
> Schmollinger's PhD dissertation has a good chapter on this as well.
>
> --
> -Larry / Sector IX

Thanks a lot for all your responses. I am very curious about Parallel
Radix Sort. I've
read and watched the 5th lecture of this course, and I wanted to know more about
parallel implementations. I've found many papers in the subject, but
in this case
I preferred to ask for the relevant ones since it is easy to get lost
with papers that
are not that good.

http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-046JFall-2005/LectureNotes/index.htm

Regards.

-- 
http://arhuaco.org


From examachine at gmail.com  Tue Dec  4 06:18:35 2007
From: examachine at gmail.com (Eray Ozkural)
Date: Tue, 4 Dec 2007 16:18:35 +0200
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	libraries with MPI
In-Reply-To: <Pine.LNX.4.64.0712040748050.11771@lilith.rgb.private.net>
References: <120320072229.3467.475483640006184A00000D8B2200748184089C040E99D20B9D0E080C079D@comcast.net>
	<Pine.LNX.4.64.0712040748050.11771@lilith.rgb.private.net>
Message-ID: <320e992a0712040618t5e8a2f9ch7bb26867ffeba8ba@mail.gmail.com>

On Dec 4, 2007 2:53 PM, Robert G. Brown <rgb at phy.duke.edu> wrote:
> On Mon, 3 Dec 2007, richard.walsh at comcast.net wrote:
>
> > impossibility.  I am still waiting to get a straight flush in 5-card
> > draw.
>
> Are ye, now... interesting.
>
> Sometime we'll have to wait together.  In the meantime, I find that if
> you play the game with a wild card or eight it alters the odds
> magnificently.  Why, you can get a straight flush and still lose the
> game...;-)

I think for many types of code pure MPI code would be much easier to
develop, granted, but an auto parallel compiler can choose to use
either type  (for certain types of codes where the compiler would work
at all). Am I speaking the obvious? Where it suits, the multithreaded
code can be much faster than MPI code as it can avoid copying large
messages. Depends very much on what type of communication there is in
the algorithm. Maybe Greg is right for the majority of X kind of code,
I wouldn't have a problem with that statement, but in general I'm
quite doubtful that there can be no performance gains.

Best,

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://www.cs.bilkent.edu.tr/~erayo  Malfunct: http://myspace.com/malfunct
ai-philosophy: http://groups.yahoo.com/group/ai-philosophy


From rokrau at yahoo.com  Tue Dec  4 09:02:27 2007
From: rokrau at yahoo.com (Roland Krause)
Date: Tue, 4 Dec 2007 09:02:27 -0800 (PST)
Subject: [Beowulf] Recommended paper for parallel sorting?
In-Reply-To: <Pine.LNX.4.64.0712030906190.11771@lilith.rgb.private.net>
Message-ID: <861246.65508.qm@web81113.mail.mud.yahoo.com>

Speaking of Ian Foster's books. Does anyone have an opinion about this
one? 

The Sourcebook of Parallel Computing (The Morgan Kaufmann Series in
Computer Architecture and Design)

This book seems to be quite a bit newer, has a different focus
obviously, but I'd like to know what you think about it? 

Thanks,
Roland


--- "Robert G. Brown" <rgb at phy.duke.edu> wrote:

> You might check out Ian Foster's free online book on parallel
> algorithms.  It is worth buying if you're going to be doing a lot of
> parallel programming.  Or there are two or three other decent
> textbooks
> on parallel programming at the algorithm level.  I don't recall
> offhand
> if Foster covers sorting, but you can easily found out for free.
> 


From mg.mailing-list at laposte.net  Tue Dec  4 11:03:34 2007
From: mg.mailing-list at laposte.net (Mathieu Gontier)
Date: Tue, 04 Dec 2007 20:03:34 +0100
Subject: [Beowulf] use a MPI library thought a shared library
Message-ID: <4755A486.5000109@laposte.net>

Hi all,

I am currently working with a project named MorphMPI. Its main purpose 
is to offer a generic interface for the developers of parallel 
applications, and chose the MPI library/interconnect at the runtime by 
rebuilding a shared morph library against the desire MPI library. (The 
final application is linked against a shared morph library instead of 
the real MPI library.)
For more information about that, you can follow these links:
- http://www.clustermonkey.net//content/view/213/32/
- http://sourceforge.net/projects/morphmpi

So, I meet a little problem whatever the MPI library used (I tried with 
MPICH-1.2.5.2, MPICHGM and IntelMPI).
When MorphMPI is  linked statically with my parallel application, 
everything is ok; but when MorphMPI is  linked dynamically with my 
parallel application, MPI_Get_count return a wrong value.

I concluded it is difficult to use a MPI library thought a shared 
library. I wonder if someone have more information about it (in this 
case, you're welcome ;-) )

Thank you for your support,
Mathieu.

PS: my problem happens in the the following example,

#  include<morphmpi.h>

#  include <mpi.h>

#include<stdio.h>


int main( int argc, char* argv[] )

{

  int np, me, ier, flag=0, msglen=-1 ;

  MorphMPI_Request request ;

  MorphMPI_Status status ;

  int buf[1] ; buf[0]=-1 ;


  ier = MorphMPI_Init( &argc, &argv ) ;

  ier = MorphMPI_Comm_size( MorphMPI_COMM_WORLD, &np ) ;

  ier = MorphMPI_Comm_rank( MorphMPI_COMM_WORLD, &me ) ;


  if( me > 1 ) printf( "I am the useless processor #%d on %d\n", me, np ) ;

  else printf( "I am the working processor #%d on %d\n", me, np ) ;


  ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ;


printf( "<<< %d >>>\n", &status ) ;


  if( ! me ) {

    buf[0] = 69 ;

    ier = MorphMPI_Isend( buf, 1, MorphMPI_INT, 1,1, MorphMPI_COMM_WORLD, &request ) ;

    ier = MorphMPI_Wait( &request, &status ) ;

  }


  ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ;


  if( me == 1 ) {

    ier = MorphMPI_Irecv( buf, 1, MorphMPI_INT, 0, 1, MorphMPI_COMM_WORLD, &request ) ;

    ier = MorphMPI_Wait( &request, &status ) ;

    ier = MorphMPI_Get_count( &status, MorphMPI_INT, &msglen ) ;


    if( msglen != 1 ) printf( "ERROR: The lengh of the message is not 1\n" ) ;

    else printf( "SUCCESS !\n" ) ;

  }


  ier = MorphMPI_Finalize() ;

}


-- 
Mathieu Gontier
Core Development Engineer

Read the attached v-card for telephone, fax, adress
Look at our web-site http://www.fft.be
 

From larry.stewart at sicortex.com  Tue Dec  4 11:49:11 2007
From: larry.stewart at sicortex.com (Larry Stewart)
Date: Tue, 04 Dec 2007 14:49:11 -0500
Subject: [Beowulf] Intel MPI Benchmark maintainers?
In-Reply-To: <Pine.LNX.4.64.0712040050520.4169@coffee.psychology.mcmaster.ca>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>	<474FEF18.6020308@obs.unige.ch>	<Pine.LNX.4.64.0711300936050.11868@coffee.psychology.mcmaster.ca>	<d5bdff000711300759r17f381fap25e34bc1821f73ff@mail.gmail.com>	<Pine.LNX.4.64.0711301724470.28765@coffee.psychology.mcmaster.ca>	<4752C689.5030102@gmail.com>	<!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGqXv4ZTAVFNmgYhAXsY3wABAAAAAA==@GMAIL.COM>
	<Pine.LNX.4.64.0712040050520.4169@coffee.psychology.mcmaster.ca>
Message-ID: <4755AF37.2000807@sicortex.com>

Does anyone know where to send bug fixes for the Intel MPI Benchmarks?

Simple stuff - bad printfs in error handling paths, but I can't find an 
email
address for such things.

-L


From peter.st.john at gmail.com  Tue Dec  4 12:05:23 2007
From: peter.st.john at gmail.com (Peter St. John)
Date: Tue, 4 Dec 2007 15:05:23 -0500
Subject: [Beowulf] use a MPI library thought a shared library
In-Reply-To: <4755A486.5000109@laposte.net>
References: <4755A486.5000109@laposte.net>
Message-ID: <e4d4fd070712041205i87ab7f8g4ba96a830f56743e@mail.gmail.com>

Mathieu,
I didn't spot why you included <mpi.h>? It seems you work thru morph_mpi.h
wrappers, right? Perhaps I misunderstand?
Peter

On Dec 4, 2007 2:03 PM, Mathieu Gontier <mg.mailing-list at laposte.net> wrote:

> Hi all,
>
> I am currently working with a project named MorphMPI. Its main purpose
> is to offer a generic interface for the developers of parallel
> applications, and chose the MPI library/interconnect at the runtime by
> rebuilding a shared morph library against the desire MPI library. (The
> final application is linked against a shared morph library instead of
> the real MPI library.)
> For more information about that, you can follow these links:
> - http://www.clustermonkey.net//content/view/213/32/
> - http://sourceforge.net/projects/morphmpi
>
> So, I meet a little problem whatever the MPI library used (I tried with
> MPICH-1.2.5.2, MPICHGM and IntelMPI).
> When MorphMPI is  linked statically with my parallel application,
> everything is ok; but when MorphMPI is  linked dynamically with my
> parallel application, MPI_Get_count return a wrong value.
>
> I concluded it is difficult to use a MPI library thought a shared
> library. I wonder if someone have more information about it (in this
> case, you're welcome ;-) )
>
> Thank you for your support,
> Mathieu.
>
> PS: my problem happens in the the following example,
>
> #  include<morphmpi.h>
>
> #  include <mpi.h>
>
> #include<stdio.h>
>
>
> int main( int argc, char* argv[] )
>
> {
>
>  int np, me, ier, flag=0, msglen=-1 ;
>
>  MorphMPI_Request request ;
>
>  MorphMPI_Status status ;
>
>  int buf[1] ; buf[0]=-1 ;
>
>
>  ier = MorphMPI_Init( &argc, &argv ) ;
>
>  ier = MorphMPI_Comm_size( MorphMPI_COMM_WORLD, &np ) ;
>
>  ier = MorphMPI_Comm_rank( MorphMPI_COMM_WORLD, &me ) ;
>
>
>  if( me > 1 ) printf( "I am the useless processor #%d on %d\n", me, np ) ;
>
>  else printf( "I am the working processor #%d on %d\n", me, np ) ;
>
>
>  ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ;
>
>
> printf( "<<< %d >>>\n", &status ) ;
>
>
>  if( ! me ) {
>
>    buf[0] = 69 ;
>
>    ier = MorphMPI_Isend( buf, 1, MorphMPI_INT, 1,1, MorphMPI_COMM_WORLD,
> &request ) ;
>
>    ier = MorphMPI_Wait( &request, &status ) ;
>
>  }
>
>
>  ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ;
>
>
>  if( me == 1 ) {
>
>    ier = MorphMPI_Irecv( buf, 1, MorphMPI_INT, 0, 1, MorphMPI_COMM_WORLD,
> &request ) ;
>
>    ier = MorphMPI_Wait( &request, &status ) ;
>
>    ier = MorphMPI_Get_count( &status, MorphMPI_INT, &msglen ) ;
>
>
>    if( msglen != 1 ) printf( "ERROR: The lengh of the message is not 1\n"
> ) ;
>
>    else printf( "SUCCESS !\n" ) ;
>
>  }
>
>
>  ier = MorphMPI_Finalize() ;
>
> }
>
>
>
> --
> Mathieu Gontier
> Core Development Engineer
>
> Read the attached v-card for telephone, fax, adress
> Look at our web-site http://www.fft.be
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071204/7be2dfbb/attachment.html>

From henry.gabb at intel.com  Tue Dec  4 12:13:15 2007
From: henry.gabb at intel.com (Gabb, Henry)
Date: Tue, 4 Dec 2007 12:13:15 -0800
Subject: [Beowulf] RE: Intel MPI Benchmark maintainers?
In-Reply-To: <200712042000.lB4K03ms028979@bluewest.scyld.com>
Message-ID: <4D97B70CF7F72144881F66DFF4BD7A12031AFE82@fmsmsx413.amr.corp.intel.com>

Hi Larry,
The Intel MPI Benchmarks are part of the Intel Cluster Toolkit
(http://www3.intel.com/cd/software/products/asmo-na/eng/307696.htm) so
you can submit bug reports to your Premier Support account for ICT. If
you don't have a Premier account, you can send the bug reports directly
to me. I'll make sure they get to the right place.

Henry Gabb
Intel Cluster Software and Technologies


From mathog at caltech.edu  Tue Dec  4 12:45:25 2007
From: mathog at caltech.edu (David Mathog)
Date: Tue, 04 Dec 2007 12:45:25 -0800
Subject: [Beowulf] Re: NFS Read Errors
Message-ID: <E1Izedx-0000NO-3y@mendel.bio.caltech.edu>

I missed the beginning of this thread - what were the parameters
in /etc/fstab on the client?

Unless hard mounts are used it is possible for a block of 
null bytes to end up in the file where data was supposed to be.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From rosing at peakfive.com  Tue Dec  4 13:08:08 2007
From: rosing at peakfive.com (Matt Rosing)
Date: Tue, 4 Dec 2007 14:08:08 -0700
Subject: [Beowulf] Re: use a MPI library thought a shared library
In-Reply-To: <200712042000.lB4K03ms028979@bluewest.scyld.com>
References: <200712042000.lB4K03ms028979@bluewest.scyld.com>
Message-ID: <18261.49592.175837.718416@lala.site>

 > From: Mathieu Gontier <mg.mailing-list at laposte.net>
 > 
 > So, I meet a little problem whatever the MPI library used (I tried with 
 > MPICH-1.2.5.2, MPICHGM and IntelMPI).
 > When MorphMPI is  linked statically with my parallel application, 
 > everything is ok; but when MorphMPI is  linked dynamically with my 
 > parallel application, MPI_Get_count return a wrong value.

I'm guessing your machine is suffering from version hell and your
LD_LIBRARY_PATH environment variable doesn't match your Makefile.

We use modules and someone else figures all that out.

Hope this helps,

Matt


From landman at scalableinformatics.com  Tue Dec  4 13:22:32 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 04 Dec 2007 16:22:32 -0500
Subject: [Beowulf] Re: NFS Read Errors
In-Reply-To: <E1Izedx-0000NO-3y@mendel.bio.caltech.edu>
References: <E1Izedx-0000NO-3y@mendel.bio.caltech.edu>
Message-ID: <4755C518.5070409@scalableinformatics.com>

David Mathog wrote:
> I missed the beginning of this thread - what were the parameters
> in /etc/fstab on the client?
> 
> Unless hard mounts are used it is possible for a block of 
> null bytes to end up in the file where data was supposed to be.

I think his issue is one of an over-zealous retry loop somewhere ...  He 
is using udp mounts by default (could do a "mount -o remount,tcp /path" 
to change to tcp, but I don't think this will help).

It sounded to me like a bad HD, but his local HD reads/writes seem ok 
(is this correct)?

It could be

	a) bad driver
	b) bad NIC
	c) bad PCI slot
	d) bad cable
	e) bad switch
	f) bad switch port
	g) other things :)

The gear he was using is *old*, and the distro is a 2.4.20 based thing 
(RH9 I think?).

If it is worth the time and effort to hunt it down, I might suggest 
investing in a pair of new (different NICs) putting them in a node with 
a crossover cable, and making sure he can pass data back and forth 
without issue.  Then see if the problem emerges in changing one thing at 
a time (or bisect the search space, but the list is short enough that 
either one would work well).

> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From landman at scalableinformatics.com  Tue Dec  4 13:49:05 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 04 Dec 2007 16:49:05 -0500
Subject: [Beowulf] use a MPI library thought a shared library
In-Reply-To: <4755A486.5000109@laposte.net>
References: <4755A486.5000109@laposte.net>
Message-ID: <4755CB51.5050802@scalableinformatics.com>

Greetings Mathieu:

Mathieu Gontier wrote:

[...]

> So, I meet a little problem whatever the MPI library used (I tried with 
> MPICH-1.2.5.2, MPICHGM and IntelMPI).
> When MorphMPI is  linked statically with my parallel application, 
> everything is ok; but when MorphMPI is  linked dynamically with my 
> parallel application, MPI_Get_count return a wrong value.
> 
> I concluded it is difficult to use a MPI library thought a shared 
> library. I wonder if someone have more information about it (in this 

Not likely.  I would suggest ldd.  It is your friend.

For example:

joe at pegasus-i:~/workspace/source-mpi$ ldd matmul_mpi_3.exe
         libm.so.6 => /lib/libm.so.6 (0x00002b5409d17000)
         libmpi.so.0 => not found
         libopen-rte.so.0 => not found
         libopen-pal.so.0 => not found
         librt.so.1 => /lib/librt.so.1 (0x00002b5409f99000)
         libdl.so.2 => /lib/libdl.so.2 (0x00002b540a1a2000)
         libnsl.so.1 => /lib/libnsl.so.1 (0x00002b540a3a6000)
         libutil.so.1 => /lib/libutil.so.1 (0x00002b540a5c0000)
         libpthread.so.0 => /lib/libpthread.so.0 (0x00002b540a7c3000)
         libc.so.6 => /lib/libc.so.6 (0x00002b540a9de000)
         /lib64/ld-linux-x86-64.so.2 (0x00002b5409af9000)

Notice that libmpi.so.0 is not found, so I can't run this by hand. 
Unless I force the issue using LD_LIBRARY_PATH

joe at pegasus-i:~/workspace/source-mpi$ export 
LD_LIBRARY_PATH="/home/joe/local/lib64/:/home/joe/local/lib/"
joe at pegasus-i:~/workspace/source-mpi$ ldd matmul_mpi_3.exe
         libm.so.6 => /lib/libm.so.6 (0x00002ae35ca50000)
         libmpi.so.0 => /home/joe/local/lib/libmpi.so.0 (0x00002ae35ccd1000)
         libopen-rte.so.0 => /home/joe/local/lib/libopen-rte.so.0 
(0x00002ae35cfe8000)
         libopen-pal.so.0 => /home/joe/local/lib/libopen-pal.so.0 
(0x00002ae35d2b3000)
         librt.so.1 => /lib/librt.so.1 (0x00002ae35d514000)
         libdl.so.2 => /lib/libdl.so.2 (0x00002ae35d71d000)
         libnsl.so.1 => /lib/libnsl.so.1 (0x00002ae35d921000)
         libutil.so.1 => /lib/libutil.so.1 (0x00002ae35db3b000)
         libpthread.so.0 => /lib/libpthread.so.0 (0x00002ae35dd3e000)
         libc.so.6 => /lib/libc.so.6 (0x00002ae35df59000)
         /lib64/ld-linux-x86-64.so.2 (0x00002ae35c832000)

and it might even run ...

joe at pegasus-i:~/workspace/source-mpi$ ./matmul_mpi_3.exe
D[tid=0]: running on machine = pegasus-i
D: checking arguments: N_args=1
D: arg[0] = ./matmul_mpi_3.exe
Allocating memory ...
array size in MB = 7.629 MB
  (remember, you have 2 of these)normalization a: 0.05510,  b: 0.00173
0 : loop_min = 0, loop_max = 1000
...

Do you have some sort of LD_LIBRARY_PATH set up?  Or something set in 
/etc/ld.so.config that points to where these things are?  Remember, 
mpirun/mpiexec's alternative purpose in life is to set up the correct 
run time environment for you, so you might want to see what is going on 
with the environment in your equivalent command.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From Michael.Frese at NumerEx.com  Tue Dec  4 13:54:51 2007
From: Michael.Frese at NumerEx.com (Michael H. Frese)
Date: Tue, 04 Dec 2007 14:54:51 -0700
Subject: [Beowulf] Re: NFS Read Errors
In-Reply-To: <E1Izedx-0000NO-3y@mendel.bio.caltech.edu>
References: <E1Izedx-0000NO-3y@mendel.bio.caltech.edu>
Message-ID: <6.1.2.0.2.20071204144808.06568008@themis.numerex.com>

David,

The fstab mount parameters are 'rw,hard,bg', so I think that's not the problem.

I'll send you my original missive separately.

Thanks.


Mike


At 03:01 PM 12/4/2007, David Mathog wrote:
>I missed the beginning of this thread - what were the parameters
>in /etc/fstab on the client?
>
>Unless hard mounts are used it is possible for a block of
>null bytes to end up in the file where data was supposed to be.
>
>Regards,
>
>David Mathog
>mathog at caltech.edu
>Manager, Sequence Analysis Facility, Biology Division, Caltech
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071204/6c80adcb/attachment.html>

From becker at scyld.com  Tue Dec  4 14:00:52 2007
From: becker at scyld.com (Donald Becker)
Date: Tue, 4 Dec 2007 14:00:52 -0800 (PST)
Subject: [Beowulf] CSharifi Next generation of HPC
In-Reply-To: <!&!AAAAAAAAAAAYAAAAAAAAAIuzoW6S3BlMjcsQQUsueybCgAAAEAAAAGqXv4ZTAVFNmgYhAXsY3wABAAAAAA==@GMAIL.COM>
Message-ID: <Pine.LNX.4.44.0712041307270.28631-100000@bluewest.scyld.com>


[[[ Hmmmm, OK, I seem to have moderation-approved pretty much a repeat of 
a wide-spread posting.  So I'll answer with the response I was planning a 
few days ago. ]]

On Tue, 4 Dec 2007, Ehsan Mousavi wrote:

> C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level
> Paradigm" for Distributed Computing Support
> 
> Contrary to two school of thoughts in providing system software support for
> (like MPI, Kerrighed and Mosix), Dr. Mohsen Sharifi hypothesized another
> school of thought as his thesis in 1986 that believes all distributed
> systems software requirements and supports can be and must be built at the
> Kernel Level of existing operating systems;

In 1986 I had been working for a few years on shared memory systems with 
a hefty proportion of custom-designed hardware.

I learned from that experience.  That's why I now work on distributed 
memory systems based on off-the-shelf commodity hardware.

I also think that there are some important aspects of cluster 
infrastructure that (at present) can only be implemented by tweaking the 
kernel.  But most of the features to make a cluster easy to use don't need 
special kernel support, and indeed can't be implemented inside the kernel 
at all.

You might initially think "you can put any program inside the kernel, 
therefore you can do everything inside the kernel".  But as a 
counter-example consider name services.  Essentially all programs use the 
standard library interface to name services, which in turn uses the Name 
Service Switch.  You can add a bunch of really powerful feature by 
using a cluster-specific name service.  And this can only be done by 
working with the existing user-level library code.  (Well, unless you 
build a new library within your kernel.)


This argument almost misses the main point:
Cluster systems exist for to simplify the system for the end users.

When you think in terms of kernel modifications, most of the changes end 
up being tricks to prove to other developers how clever you are, not 
features that make the system easier to use (example: Plan 9). And most of 
the clever tricks end up getting in the way of the developer, rather than 
speeding up the application or really simplifying the programming model.

DSM / Distributed Shared Memory (which I prefer to call NVM, Network 
Virtual Memory) is a prefect example of this.  It certainly doesn't help 
the end user.  The only aspect an end user or system administrator sees is 
that NVM causes cascading system failures when one machine drops out of 
the cluster.

The programmer doesn't benefit either.  They initially 
think that NVM gives them an easy to use shared memory model.  They 
quickly find that it only appears to be normal memory.  To get even barely 
acceptable performance they have to treat the shared memory very 
differently than regular memory.  Variables written by different processes 
have to be segregated into different pages.  Writes have to grouped.  You 
have to think about when to manually cache structures to avoid a re-read 
that might trigger a network page fault, but refresh that structure when 
you need potentially updated values.  Many independent attempts have 
concluded that most application ports take a long time to tune for NVM, 
and almost all end up using NVM as a stylized message passing mechanism.


-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA


From hahn at mcmaster.ca  Wed Dec  5 08:22:55 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Wed, 5 Dec 2007 11:22:55 -0500 (EST)
Subject: [Beowulf] CSharifi Next generation of HPC
In-Reply-To: <Pine.LNX.4.44.0712041307270.28631-100000@bluewest.scyld.com>
References: <Pine.LNX.4.44.0712041307270.28631-100000@bluewest.scyld.com>
Message-ID: <Pine.LNX.4.64.0712041744400.21128@coffee.psychology.mcmaster.ca>

> DSM / Distributed Shared Memory (which I prefer to call NVM, Network
> Virtual Memory) is a prefect example of this.  It certainly doesn't help

I think the 'N' is a valuable change, but would suggest NSM is even better.
to me, the V hints too much of paging-type VM, and doesn't hint at the 
main point (sharing).

> the end user.  The only aspect an end user or system administrator sees is
> that NVM causes cascading system failures when one machine drops out of
> the cluster.

a really good NSM implementation might well provide some kind of 
persistence, even replication of the space.  it would be tricky to do
without introducing some sort of transactional support, though, and 
that seriously complicates the user-level interface.  of course, people
who do this sort of thing often worry about different consistency models
which require transaction-like directives anyway.  again the programmer's
interface becomes not so simple.

> The programmer doesn't benefit either.  They initially
> think that NVM gives them an easy to use shared memory model.  They
> quickly find that it only appears to be normal memory.  To get even barely
> acceptable performance they have to treat the shared memory very
> differently than regular memory.  Variables written by different processes
> have to be segregated into different pages.  Writes have to grouped.  You
> have to think about when to manually cache structures to avoid a re-read
> that might trigger a network page fault, but refresh that structure when
> you need potentially updated values.

well put.  I was pondering how to say this while also pointing out that 
even within a single machine, programmers really cannot think memory is 
flat.  that is, you have to program for your caches.

level		latency		size		concurrency
register	<.5 ns		8B		1-10? (renaming)
L1		1-2 ns		64B		~2
L2/3		4-20 ns		64B		~1
ram		50-80 ns	64B		1-4
remote		5+ us		4KB		1
swap		10 ms		>=4KB		1

the 'remote' there is for a reference to an NSM page that has to be brought
over the net, and is assuming a fast interconnect.  it's effectively the same
as an MPI send and receive.  notice that you can't really express just a send
with NSM (it would be a blind write).

I think NSM is attractive mainly at a shallow level: either for very simple,
limited applications which just want to replicate a chunk of read-only shared
memory across machines, or cases where details like locking and locality
haven't been thought out yet.


From Michael.Frese at NumerEx.com  Wed Dec  5 08:55:50 2007
From: Michael.Frese at NumerEx.com (Michael H. Frese)
Date: Wed, 05 Dec 2007 09:55:50 -0700
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <6.2.5.6.2.20071204085359.04f72018@NumerEx.com>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
	<4754ABA1.9030105@scalableinformatics.com>
	<Pine.LNX.4.64.0712040116300.4169@coffee.psychology.mcmaster.ca>
	<6.2.5.6.2.20071204085359.04f72018@NumerEx.com>
Message-ID: <6.2.5.6.2.20071205085308.04eff510@NumerEx.com>

This tale is at an end, I think, because I can't bear to tell it much 
longer.  As many have suggested, there is probably a hardware 
problem, and since the hardware is old, I will do without the 
services of the troublesome machines -- It turns out that there is 
another acting up as well -- till they are replaced in a couple of weeks.

Many thanks to all who racked their brains for helpful suggestions.

I want to tell a little more of what I have learned, before I drop 
the subject altogether.

First, I did swap the cable of the bad machine with that of a good 
one with no effect on either machine.  This eliminates the 
possibility of the cable or the switch port being bad.  Since I had 
previously changed out the NIC and the switch, the only possibilty is 
something inside the machine itself, probably the motherboard, but 
possibly a corrupted kernel module for handling udp -- more on that below.

Second, we could find no sign of this failure in any log.  Nor did 
/proc/net/dev show any errors.  The suggestion is that older kernels 
aren't going to detect and report such errors.  I think that's 
because they do nfs over udp.  More about that in a moment.

Third, though netcat isn't on these systems, nc is.  We didn't get 
around to trying it, because we found ttcp.

Fourth, with ttcp over tcp, I found that the troubled machine could 
send 800 MB in about 20 seconds -- the wire speed for those 32-bit 
PCI slots as tested by netpipe.  However, if I used ttcp over udp, I 
couldn't reliably send even ten 8192-byte blocks!  Successive sends 
and receives would receive 3, or 1, or 5 blocks.  Don't ask me how 
these two facts are compatible.  I don't know.

Clearly, this puts a premium on using tcp for nfs.  All our attempts 
to do that failed.  Well, both of them, anyway.  In the first one, we 
unmounted the offending disk, modified its fstab entry, and remounted 
it.  We were pretty careful in the second one, where we added tcp to 
the fstab argument, unmounted all the remote disks, restarted all the 
nfsd's, and did 'mount -a'.  We got an error message in both cases 
that didn't obviously refer to the tcp argument, but the mount didn't 
happen.  As I write this, I see references to tcp mount requests in 
the mountd man page, so maybe we need to do a bit more here.

The Wikipedia article on nfs says this:  "At the time of introduction 
of Version 3, vendor support for TCP as a transport-layer protocol 
began increasing. While several vendors had already added support for 
NFS Version 2 with TCP as a transport, Sun Microsystems added support 
for TCP as a transport for NFS at the same time it added support for 
Version 3."

I'd like to know what version of nfs this server supports, but the 
man page on nfsd doesn't say.  The man page on rpc.mountd says that 
it supports nfs version 2 and version 3, but that "If the NFS kernel 
module was compiled without support for NFSv3, rpc.mountd must be 
invoked with the option --no-nfs-version 3."  Yet the 
/proc/procnum/cmdline for the running rpc.mountd doesn't show a 
--no-nfs-version argument.  Clearly, both the kernel and the server 
need to support the use of tcp.

I'd like to get any of our other machines with these older kernels at 
other sites to using tcp for nfs where possible, in order to avoid 
this in the future.  We are already seeing signs of network problems 
on them.  If that's not possible, then in order to avoid a complete 
rebuild of those systems -- there are 12 of them -- we are going to 
put a testing script together using remote invocations of md5sum and 
comparison of results to recorded local results.

Thanks again!


Mike


At 08:54 AM 12/4/2007, you wrote:
>Mark,
>
>Thanks for your helpful comments.
>
>At 11:31 PM 12/3/2007, you wrote:
>>>I am guessing you are using TCP NFS mounts as well?  TCP forces 
>>>retries in the event of bad packets.  UDP doesn't force this, but 
>>>the NFS protocol will
>>
>>UDP has a checksum as well, though it's only 16b.  then again, the TCP
>>checksum isn't all that strong for today's data rates either.
>
> From reading the man page on nfs on the systems with the 2.4 
> kernels, it looks like the default for an nfs mount is udp.  It 
> also looks like tcp is not really an option until nfs v4, so it may 
> be something to try on the 2.6 kernels that I have on some of my 
> newer machines at another site.
>
>>you should definitely examine /proc/net/dev on involved machines.
>
>I hadn't known about /proc/net/dev.  When I check there, I see no 
>transmit errors on the server side and no receive errors on the 
>client side.  That's odd, because the other thing I see is that the 
>average packet size received (bytes received divided by packets 
>received) on the client side is 3.9, while on the server side, the 
>average packet size sent is 1430.  In other words, there are a many 
>more packets received than there ought to be.  That's very 
>fishy.  It's probably the result of the way the packet count is done 
>and reported.  I.e., it may be that all the received packets -- good 
>and bad -- are counted, but only the bytes in the good ones are 
>counted, with some similar problem on the server side.  I think the 
>statistics are aggregate since the last boot, so they may not be 
>just from the troublesome tests I was performing, either.
>
>>I would attempt to reduce the complexity of your testing.
>>for instance, can a node write and verify to its local disk
>>without problem?
>
>The local disk read seems rock solid in comparison to the NFS 
>one.  The local md5sum produces the same result time after time, 
>which is just not the case for the remote.
>
>>can it stream data over tcp sockets (netcat or the like) without 
>>corruption or obvious problems reflected
>>in /proc/net/dev?
>
>netcat is not on my systems.  Looks like I have to get someone to 
>download and build it for me, and try the streaming tests you recommend.
>
>>does ethtool tell you anything about the config of the nic?
>
>Not on the 2.4 systems, though it seems to tell me a little on the 2.6's.
>
>>comparing tcp vs udp NFS would be sensible
>>as well - varying the packet size, too.  switching client and/or 
>>server to a modern 2.6 kernel may be instructive.
>
>Upgrading the kernel is probably the only way I'll get nfs over 
>tcp.  Given that these systems are headed out the door, I'm not sure 
>that's a good use of our time.  But it may be worth doing an our new 
>and newer systems.
>
>Thanks again!
>
>
>Mike
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf


From jlb17 at duke.edu  Wed Dec  5 09:26:20 2007
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Wed, 5 Dec 2007 12:26:20 -0500 (EST)
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <6.2.5.6.2.20071205085308.04eff510@NumerEx.com>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
	<4754ABA1.9030105@scalableinformatics.com>
	<Pine.LNX.4.64.0712040116300.4169@coffee.psychology.mcmaster.ca>
	<6.2.5.6.2.20071204085359.04f72018@NumerEx.com>
	<6.2.5.6.2.20071205085308.04eff510@NumerEx.com>
Message-ID: <alpine.LRH.0.99999.0712051221360.11349@hogwarts.egr.duke.edu>

On Wed, 5 Dec 2007 at 9:55am, Michael H. Frese wrote

> Clearly, this puts a premium on using tcp for nfs.  All our attempts to do 
> that failed.  Well, both of them, anyway.  In the first one, we unmounted the 
> offending disk, modified its fstab entry, and remounted it.  We were pretty 
> careful in the second one, where we added tcp to the fstab argument, 
> unmounted all the remote disks, restarted all the nfsd's, and did 'mount -a'. 
> We got an error message in both cases that didn't obviously refer to the tcp 
> argument, but the mount didn't happen.  As I write this, I see references to 
> tcp mount requests in the mountd man page, so maybe we need to do a bit more 
> here.
>
> The Wikipedia article on nfs says this:  "At the time of introduction of 
> Version 3, vendor support for TCP as a transport-layer protocol began 
> increasing. While several vendors had already added support for NFS Version 2 
> with TCP as a transport, Sun Microsystems added support for TCP as a 
> transport for NFS at the same time it added support for Version 3."
>
> I'd like to know what version of nfs this server supports, but the man page 
> on nfsd doesn't say.  The man page on rpc.mountd says that it supports nfs 
> version 2 and version 3, but that "If the NFS kernel module was compiled 
> without support for NFSv3, rpc.mountd must be invoked with the option 
> --no-nfs-version 3."  Yet the /proc/procnum/cmdline for the running 
> rpc.mountd doesn't show a --no-nfs-version argument.  Clearly, both the 
> kernel and the server need to support the use of tcp.

Looking back through this thread, I don't see any details on the NFS 
server, only the clients.  What are the hardware and OS version of the NFS 
server?

Grepping through the kernel config for RH9 shows it definitely did not 
support NFS over TCP *as a server*.  If your server is newer, though, and 
does support a TCP nfsd, then you may have to look at other stuff 
(firewalls rules, TCP wrappers, etc) as to why the TCP mounts didn't work.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


From Michael.Frese at NumerEx.com  Wed Dec  5 09:57:23 2007
From: Michael.Frese at NumerEx.com (Michael H. Frese)
Date: Wed, 05 Dec 2007 10:57:23 -0700
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <alpine.LRH.0.99999.0712051221360.11349@hogwarts.egr.duke.e
 du>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
	<4754ABA1.9030105@scalableinformatics.com>
	<Pine.LNX.4.64.0712040116300.4169@coffee.psychology.mcmaster.ca>
	<6.2.5.6.2.20071204085359.04f72018@NumerEx.com>
	<6.2.5.6.2.20071205085308.04eff510@NumerEx.com>
	<alpine.LRH.0.99999.0712051221360.11349@hogwarts.egr.duke.edu>
Message-ID: <6.2.5.6.2.20071205105246.04f3e988@NumerEx.com>

Joshua,

Thanks for the info on the nfs server in RH9.  We are using that 
distro unmodified out of the box, so to speak, so that is clearly 
blocks any possibility for fixing the problem in software.  As for 
the hardware, it was described earlier as follows:

[Old hardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual 
(Tyan...) chip motherboards both running Redhat 9 one with the 
2.4.20-8 kernels, though one is the smp version; NetGear GA311 NICs; 
and a NetGear GS108 8 port Copper 1 GB/s switch.  The single 
processor motherboards have 32-bit PCI slots so their network speeds 
are limited to 300 kbps as shown by netpipe.  All of the LEDs at the 
ends of the cables show 1000Mb connections.]

Thanks again for your help.


Mike


At 10:26 AM 12/5/2007, Joshua Baker-LePain wrote:
>On Wed, 5 Dec 2007 at 9:55am, Michael H. Frese wrote
>
>>Clearly, this puts a premium on using tcp for nfs.  All our 
>>attempts to do that failed.  Well, both of them, anyway.  In the 
>>first one, we unmounted the offending disk, modified its fstab 
>>entry, and remounted it.  We were pretty careful in the second one, 
>>where we added tcp to the fstab argument, unmounted all the remote 
>>disks, restarted all the nfsd's, and did 'mount -a'. We got an 
>>error message in both cases that didn't obviously refer to the tcp 
>>argument, but the mount didn't happen.  As I write this, I see 
>>references to tcp mount requests in the mountd man page, so maybe 
>>we need to do a bit more here.
>>
>>The Wikipedia article on nfs says this:  "At the time of 
>>introduction of Version 3, vendor support for TCP as a 
>>transport-layer protocol began increasing. While several vendors 
>>had already added support for NFS Version 2 with TCP as a 
>>transport, Sun Microsystems added support for TCP as a transport 
>>for NFS at the same time it added support for Version 3."
>>
>>I'd like to know what version of nfs this server supports, but the 
>>man page on nfsd doesn't say.  The man page on rpc.mountd says that 
>>it supports nfs version 2 and version 3, but that "If the NFS 
>>kernel module was compiled without support for NFSv3, rpc.mountd 
>>must be invoked with the option --no-nfs-version 3."  Yet the 
>>/proc/procnum/cmdline for the running rpc.mountd doesn't show a 
>>--no-nfs-version argument.  Clearly, both the kernel and the server 
>>need to support the use of tcp.
>
>Looking back through this thread, I don't see any details on the NFS 
>server, only the clients.  What are the hardware and OS version of 
>the NFS server?
>
>Grepping through the kernel config for RH9 shows it definitely did 
>not support NFS over TCP *as a server*.  If your server is newer, 
>though, and does support a TCP nfsd, then you may have to look at 
>other stuff (firewalls rules, TCP wrappers, etc) as to why the TCP 
>mounts didn't work.
>
>--
>Joshua Baker-LePain
>QB3 Shared Cluster Sysadmin
>UCSF
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf


From mg.mailing-list at laposte.net  Wed Dec  5 00:15:17 2007
From: mg.mailing-list at laposte.net (Mathieu Gontier)
Date: Wed, 05 Dec 2007 09:15:17 +0100
Subject: [Beowulf] use a MPI library thought a shared library
In-Reply-To: <e4d4fd070712041205i87ab7f8g4ba96a830f56743e@mail.gmail.com>
References: <4755A486.5000109@laposte.net>
	<e4d4fd070712041205i87ab7f8g4ba96a830f56743e@mail.gmail.com>
Message-ID: <47565E15.8090502@laposte.net>

Sorry. Indeed, the included <mpi.h> should not be here: it is a relic of 
some flags added to understand the problem. Then, the test case is 
correct without this include.
So, Peter, you well understand morphmpi.h ;-)

Mathieu Gontier
Core Development Engineer

Read the attached v-card for telephone, fax, adress
Look at our web-site http://www.fft.be
 

Peter St. John wrote:
> Mathieu,
> I didn't spot why you included <mpi.h>? It seems you work thru 
> morph_mpi.h wrappers, right? Perhaps I misunderstand?
> Peter
>
> On Dec 4, 2007 2:03 PM, Mathieu Gontier <mg.mailing-list at laposte.net 
> <mailto:mg.mailing-list at laposte.net>> wrote:
>
>     Hi all,
>
>     I am currently working with a project named MorphMPI. Its main purpose
>     is to offer a generic interface for the developers of parallel
>     applications, and chose the MPI library/interconnect at the runtime by
>     rebuilding a shared morph library against the desire MPI library. (The
>     final application is linked against a shared morph library instead of
>     the real MPI library.)
>     For more information about that, you can follow these links:
>     - http://www.clustermonkey.net//content/view/213/32/
>     <http://www.clustermonkey.net//content/view/213/32/>
>     - http://sourceforge.net/projects/morphmpi
>
>     So, I meet a little problem whatever the MPI library used (I tried
>     with
>     MPICH-1.2.5.2, MPICHGM and IntelMPI).
>     When MorphMPI is  linked statically with my parallel application,
>     everything is ok; but when MorphMPI is  linked dynamically with my
>     parallel application, MPI_Get_count return a wrong value.
>
>     I concluded it is difficult to use a MPI library thought a shared
>     library. I wonder if someone have more information about it (in this
>     case, you're welcome ;-) )
>
>     Thank you for your support,
>     Mathieu.
>
>     PS: my problem happens in the the following example,
>
>     #  include<morphmpi.h>
>
>     #  include <mpi.h>
>
>     #include<stdio.h>
>
>
>     int main( int argc, char* argv[] )
>
>     {
>
>      int np, me, ier, flag=0, msglen=-1 ;
>
>      MorphMPI_Request request ;
>
>      MorphMPI_Status status ;
>
>      int buf[1] ; buf[0]=-1 ;
>
>
>      ier = MorphMPI_Init( &argc, &argv ) ;
>
>      ier = MorphMPI_Comm_size( MorphMPI_COMM_WORLD, &np ) ;
>
>      ier = MorphMPI_Comm_rank( MorphMPI_COMM_WORLD, &me ) ;
>
>
>      if( me > 1 ) printf( "I am the useless processor #%d on %d\n",
>     me, np ) ;
>
>      else printf( "I am the working processor #%d on %d\n", me, np ) ;
>
>
>      ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ;
>
>
>     printf( "<<< %d >>>\n", &status ) ;
>
>
>      if( ! me ) {
>
>        buf[0] = 69 ;
>
>        ier = MorphMPI_Isend( buf, 1, MorphMPI_INT, 1,1,
>     MorphMPI_COMM_WORLD, &request ) ;
>
>        ier = MorphMPI_Wait( &request, &status ) ;
>
>      }
>
>
>      ier = MorphMPI_Barrier( MorphMPI_COMM_WORLD ) ;
>
>
>      if( me == 1 ) {
>
>        ier = MorphMPI_Irecv( buf, 1, MorphMPI_INT, 0, 1,
>     MorphMPI_COMM_WORLD, &request ) ;
>
>        ier = MorphMPI_Wait( &request, &status ) ;
>
>        ier = MorphMPI_Get_count( &status, MorphMPI_INT, &msglen ) ;
>
>
>        if( msglen != 1 ) printf( "ERROR: The lengh of the message is
>     not 1\n" ) ;
>
>        else printf( "SUCCESS !\n" ) ;
>
>      }
>
>
>      ier = MorphMPI_Finalize() ;
>
>     }
>
>
>
>     --
>     Mathieu Gontier
>     Core Development Engineer
>
>     Read the attached v-card for telephone, fax, adress
>     Look at our web-site http://www.fft.be <http://www.fft.be/>
>
>
>     _______________________________________________
>     Beowulf mailing list, Beowulf at beowulf.org <mailto:Beowulf at beowulf.org>
>     To change your subscription (digest mode or unsubscribe) visit
>     http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From mg.mailing-list at laposte.net  Wed Dec  5 00:28:05 2007
From: mg.mailing-list at laposte.net (Mathieu Gontier)
Date: Wed, 05 Dec 2007 09:28:05 +0100
Subject: [Beowulf] use a MPI library thought a shared library
In-Reply-To: <4755CB51.5050802@scalableinformatics.com>
References: <4755A486.5000109@laposte.net>
	<4755CB51.5050802@scalableinformatics.com>
Message-ID: <47566115.5000009@laposte.net>

Yep, I use ldd every days. But here the problem comes from a corrupted 
structure in MorphMPI and MPI

typedef struct{
  int MorphMPI_SOURCE;
  int MorphMPI_TAG;
  int MorphMPI_ERROR;
  void* mpi_status ;
} MorphMPI_Status ;

Where the attribut mpi_status is used to point a real MPI_Status. In MPICH:

typedef struct{
  int MPI_SOURCE;
  int MPI_TAG;
  int MPI_ERROR;
  int count ;
} MPI_Status ;

Then, when my MorphMPI_Status is given to MorphMPI_Get_count(), the 
attribut MorphMPI_Status::mpi_status is not corrupted but 
MorphMPI_Status::mpi_status::count is corrupted: the value should be 4 
and not "random".

I tried to manipulate the structure MorphMPI_Status (add another integer 
to align it in 64-bits, only have the void*,...) without success.

As reminder, this problem appears only when the MPI is used through a 
dynamic linked MorphMPI library.

Does someone have an idea?

Mathieu Gontier
Core Development Engineer

Read the attached v-card for telephone, fax, adress
Look at our web-site http://www.fft.be
 

Joe Landman wrote:
> Greetings Mathieu:
>
> Mathieu Gontier wrote:
>
> [...]
>
>> So, I meet a little problem whatever the MPI library used (I tried 
>> with MPICH-1.2.5.2, MPICHGM and IntelMPI).
>> When MorphMPI is  linked statically with my parallel application, 
>> everything is ok; but when MorphMPI is  linked dynamically with my 
>> parallel application, MPI_Get_count return a wrong value.
>>
>> I concluded it is difficult to use a MPI library thought a shared 
>> library. I wonder if someone have more information about it (in this 
>
> Not likely.  I would suggest ldd.  It is your friend.
>
> For example:
>
> joe at pegasus-i:~/workspace/source-mpi$ ldd matmul_mpi_3.exe
>         libm.so.6 => /lib/libm.so.6 (0x00002b5409d17000)
>         libmpi.so.0 => not found
>         libopen-rte.so.0 => not found
>         libopen-pal.so.0 => not found
>         librt.so.1 => /lib/librt.so.1 (0x00002b5409f99000)
>         libdl.so.2 => /lib/libdl.so.2 (0x00002b540a1a2000)
>         libnsl.so.1 => /lib/libnsl.so.1 (0x00002b540a3a6000)
>         libutil.so.1 => /lib/libutil.so.1 (0x00002b540a5c0000)
>         libpthread.so.0 => /lib/libpthread.so.0 (0x00002b540a7c3000)
>         libc.so.6 => /lib/libc.so.6 (0x00002b540a9de000)
>         /lib64/ld-linux-x86-64.so.2 (0x00002b5409af9000)
>
> Notice that libmpi.so.0 is not found, so I can't run this by hand. 
> Unless I force the issue using LD_LIBRARY_PATH
>
> joe at pegasus-i:~/workspace/source-mpi$ export 
> LD_LIBRARY_PATH="/home/joe/local/lib64/:/home/joe/local/lib/"
> joe at pegasus-i:~/workspace/source-mpi$ ldd matmul_mpi_3.exe
>         libm.so.6 => /lib/libm.so.6 (0x00002ae35ca50000)
>         libmpi.so.0 => /home/joe/local/lib/libmpi.so.0 
> (0x00002ae35ccd1000)
>         libopen-rte.so.0 => /home/joe/local/lib/libopen-rte.so.0 
> (0x00002ae35cfe8000)
>         libopen-pal.so.0 => /home/joe/local/lib/libopen-pal.so.0 
> (0x00002ae35d2b3000)
>         librt.so.1 => /lib/librt.so.1 (0x00002ae35d514000)
>         libdl.so.2 => /lib/libdl.so.2 (0x00002ae35d71d000)
>         libnsl.so.1 => /lib/libnsl.so.1 (0x00002ae35d921000)
>         libutil.so.1 => /lib/libutil.so.1 (0x00002ae35db3b000)
>         libpthread.so.0 => /lib/libpthread.so.0 (0x00002ae35dd3e000)
>         libc.so.6 => /lib/libc.so.6 (0x00002ae35df59000)
>         /lib64/ld-linux-x86-64.so.2 (0x00002ae35c832000)
>
> and it might even run ...
>
> joe at pegasus-i:~/workspace/source-mpi$ ./matmul_mpi_3.exe
> D[tid=0]: running on machine = pegasus-i
> D: checking arguments: N_args=1
> D: arg[0] = ./matmul_mpi_3.exe
> Allocating memory ...
> array size in MB = 7.629 MB
>  (remember, you have 2 of these)normalization a: 0.05510,  b: 0.00173
> 0 : loop_min = 0, loop_max = 1000
> ...
>
> Do you have some sort of LD_LIBRARY_PATH set up?  Or something set in 
> /etc/ld.so.config that points to where these things are?  Remember, 
> mpirun/mpiexec's alternative purpose in life is to set up the correct 
> run time environment for you, so you might want to see what is going 
> on with the environment in your equivalent command.
>
>


From gdjacobs at gmail.com  Wed Dec  5 13:09:33 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Wed, 05 Dec 2007 15:09:33 -0600
Subject: [Beowulf] BIOS
In-Reply-To: <a8d96dec0708121612q35825dddh795a80c6e6fff807@mail.gmail.com>
References: <20070809141520.GA605@gretchen.aei.uni-hannover.de>	<Pine.LNX.4.64.0708112338220.6982@lilith.rgb.private.net>	<46BEA3B7.4010806@aei.mpg.de>	<Pine.LNX.4.64.0708121538460.10277@lilith.rgb.private.net>
	<a8d96dec0708121612q35825dddh795a80c6e6fff807@mail.gmail.com>
Message-ID: <4757138D.6070901@gmail.com>

Bruno Coutinho wrote:
> 
> 
> 2007/8/12, Robert G. Brown <rgb at phy.duke.edu <mailto:rgb at phy.duke.edu>>:
> 
>     On Sun, 12 Aug 2007, Carsten Aulbert wrote:
> 
>     > Thanks for the link. In principle we have everything working already
>     > that way, but want to "excel" a bit more:
> 
>     No, no, no.  You want to "ooffice" a little more...;-)
> 
>     >
>     > (1) Right now we use memdisk from the syslinux/isolinux family to boot
>     > the dos image. Booting an exact floppy image works fine, but for some
>     > part in (2) we might need more space than a 2,88 MB floppy or its
>     > extended pendant gives to us. Thus we are currently trying to boot
>     a hd
>     > image which sems to be a bit trickier than a simple floppy image
>     > (getting boot code, partition table right for example).
>     >
>     > (2) We want to have some feedback from the process and don't want to
>     > have an automatic reboot after a possible failure because in the worst
>     > case this might "brickify" a node. Once I had the problem, that
>     > automatic BIOS flashing worked, but one node - which looked
>     similar but
>     > behaved differently - was not able to finish the flashing procedure
>     > successfully. Since I was monitoring the node I was able to redo the
>     > flashing with a different option [1].
>     >
>     > Anyway, that's the reason why we want to include a dhcp client and
>     some
>     > means, possibly a ssh or rsh client along with the needed packetdriver
>     > to the image and notify the server that way, that it successfully
>     > flashed the BIOS and set our custom settings correctly. Only after
>     that
>     > the nodes should continue FAIing.
> 
>     No, that's reasonable -- I just didn't understand.   Autoexec.bat is
>     dumb
>     as a post in comparison even with /bin/sh, too. 
> 
> 
> It's dumb, but not so dumb. :-)
> Th syntax is crappy but it has this feature:
> http://www.robvanderwoude.com/errorlevel.html
> 
> OBS: REM is a comment initiator like #.

DOS batch files were actually surprisingly capable. It's just that what
they could do was not as well documented as, for example, Bash is today.

-- 
Geoffrey D. Jacobs


From Bogdan.Costescu at iwr.uni-heidelberg.de  Thu Dec  6 09:37:11 2007
From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Thu, 6 Dec 2007 18:37:11 +0100 (CET)
Subject: [Beowulf] Re: NFS Read Errors
In-Reply-To: <4755C518.5070409@scalableinformatics.com>
References: <E1Izedx-0000NO-3y@mendel.bio.caltech.edu>
	<4755C518.5070409@scalableinformatics.com>
Message-ID: <Pine.LNX.4.64.0712061830460.27457@dingo.iwr.uni-heidelberg.de>

On Tue, 4 Dec 2007, Joe Landman wrote:

> 	 a) bad driver
> 	 b) bad NIC

... or a combination of these which translates into RX and/or TX 
checksumming offload not working properly; the driver then lies to the 
upper levels and the error is passed through. I don't remember if this 
was even possible at the time of RHL9, but try to run:

ethtool -k ethX

and if any of the checksums are turned on, you can turn them off with:

ethtool -K ethX rx off
ethtool -K ethX tx off

(note: ethtool might not be installed by default, check the install 
media if there was a package with this name)

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De


From landman at scalableinformatics.com  Thu Dec  6 14:18:12 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 06 Dec 2007 17:18:12 -0500
Subject: [Beowulf] Re: NFS Read Errors
In-Reply-To: <Pine.LNX.4.64.0712061830460.27457@dingo.iwr.uni-heidelberg.de>
References: <E1Izedx-0000NO-3y@mendel.bio.caltech.edu>
	<4755C518.5070409@scalableinformatics.com>
	<Pine.LNX.4.64.0712061830460.27457@dingo.iwr.uni-heidelberg.de>
Message-ID: <47587524.3070107@scalableinformatics.com>

Bogdan Costescu wrote:
> On Tue, 4 Dec 2007, Joe Landman wrote:
> 
>>      a) bad driver
>>      b) bad NIC
> 
> ... or a combination of these which translates into RX and/or TX 
> checksumming offload not working properly; the driver then lies to the 
> upper levels and the error is passed through. I don't remember if this 
> was even possible at the time of RHL9, but try to run:

I think ethtool was a post 2.4 kernel utility.  As I remember, there was 
an miitool that gave something roughly like that in functionality.


-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From Bogdan.Costescu at iwr.uni-heidelberg.de  Thu Dec  6 14:44:14 2007
From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Thu, 6 Dec 2007 23:44:14 +0100 (CET)
Subject: [Beowulf] Re: NFS Read Errors
In-Reply-To: <47587524.3070107@scalableinformatics.com>
References: <E1Izedx-0000NO-3y@mendel.bio.caltech.edu>
	<4755C518.5070409@scalableinformatics.com>
	<Pine.LNX.4.64.0712061830460.27457@dingo.iwr.uni-heidelberg.de>
	<47587524.3070107@scalableinformatics.com>
Message-ID: <Pine.LNX.4.64.0712062339240.18727@kenzo.iwr.uni-heidelberg.de>

On Thu, 6 Dec 2007, Joe Landman wrote:

> I think ethtool was a post 2.4 kernel utility.  As I remember, there was an 
> miitool that gave something roughly like that in functionality.

I just looked in the pristine 2.4.20 source and found 8139cp with 
references to the 8169 chip used on the Netgear GA311 and a routine 
called "cp_ethtool_ioctl" with switch statements for RX and TX 
checksumming... whether ethtool was "officially" used at that moment, 
I can't remember, but some drivers certainly had the support.

miitool was supposed to be used only for management of the media 
(which MII transceiver, what speed, restart autonegotiation, etc.)

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De


From fumie.costen at manchester.ac.uk  Thu Dec  6 06:28:46 2007
From: fumie.costen at manchester.ac.uk (f.costen@cs.man.ac.uk)
Date: Thu, 06 Dec 2007 14:28:46 +0000
Subject: [Beowulf] example BlueGene fortran MPI program
In-Reply-To: <e4d4fd070712030727h2c9fe4c9j50c378ac33e65131@mail.gmail.com>
References: <2accc2ff0712011938r58867701tc9988135f9edeb2a@mail.gmail.com>	<Pine.LNX.4.64.0712030906190.11771@lilith.rgb.private.net>
	<e4d4fd070712030727h2c9fe4c9j50c378ac33e65131@mail.gmail.com>
Message-ID: <4758071E.1030702@cs.man.ac.uk>

Dear All,

I am coding and running fortran MPI program
under our local cluster and  national Beowulf cluster.
The code runs happily in these environment with
intel fortran compiler.
I now have to port my code to IBM BlueGene.

Initially hello-world program worked.
Then our real code is compiled without any problem.
But when we launch the job
we get lots of error message of MPI_Attr_get.
If you have dealt with BlueGene in the past or currently
and still have
some example program in fortran, may I have it ?

Thank you very much indeed
Best wishes
Fumie


From amjad11 at gmail.com  Thu Dec  6 21:47:42 2007
From: amjad11 at gmail.com (amjad ali)
Date: Fri, 7 Dec 2007 10:47:42 +0500
Subject: [Beowulf] specific motherboard???
Message-ID: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>

 Hello all,

I want to bulid a beowulf cluster of 16+1 nodes with each node having one
Intel Core2Duo (2.66 GHz, FSB 1333MHz, 4MB L2) processor and GiGE as the
interconnect. On this cluster, I would run my PETSc based CFD/FEM codes
(REQURING VERY FAST MEMORY/high memory bandwidth). Please help me out to
select out any one of the following boards:

1) Intel Server board S3200SH, System Bus 1333MHz, supprting 240-pin DDR2
800 MHz RAM
2) Intel Desktop board DX38BT, System Bus 1333MHz, supprting 240-pin DDR3
1333 MHz RAM

See that RAM speed difference. Given that keeping up running the cluster all
the time and loging on of many user simultaneously is not the concern. The
cluster may be dedicated to be used by one user whenever required. But it
may be the case that running a code for several days will be required.

Would the desktop board DX38BT be suitable to run the cluster for several
hours/days?
Which Board you recommend for this scenario?

Regards,
Amjad Ali.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071207/43f920dc/attachment.html>

From rgb at phy.duke.edu  Fri Dec  7 04:28:53 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 7 Dec 2007 07:28:53 -0500 (EST)
Subject: [Beowulf] specific motherboard???
In-Reply-To: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>
References: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712070716010.29089@lilith.rgb.private.net>

On Fri, 7 Dec 2007, amjad ali wrote:

> Hello all,
>
> I want to bulid a beowulf cluster of 16+1 nodes with each node having one
> Intel Core2Duo (2.66 GHz, FSB 1333MHz, 4MB L2) processor and GiGE as the
> interconnect. On this cluster, I would run my PETSc based CFD/FEM codes
> (REQURING VERY FAST MEMORY/high memory bandwidth). Please help me out to
> select out any one of the following boards:
>
> 1) Intel Server board S3200SH, System Bus 1333MHz, supprting 240-pin DDR2
> 800 MHz RAM
> 2) Intel Desktop board DX38BT, System Bus 1333MHz, supprting 240-pin DDR3
> 1333 MHz RAM
>
> See that RAM speed difference. Given that keeping up running the cluster all
> the time and loging on of many user simultaneously is not the concern. The
> cluster may be dedicated to be used by one user whenever required. But it
> may be the case that running a code for several days will be required.
>
> Would the desktop board DX38BT be suitable to run the cluster for several
> hours/days?

If you're asking whether or not it will work or wear out early or
something, it will be fine.  You can run "Desktop" boards 24x7 all year
long as long as you mount them in a case configuration and air
conditioned environment that keeps them COOL.  And that's true for the
server mobo too.

> Which Board you recommend for this scenario?

If memory is your (or "a") significant bottleneck, the one with the
highest memory bandwidth seems like it would outperform the other,
doesn't it?

But the best thing to do is to get one of each now and run your
particular code mix on each of them and compare times vs costs.  Loser
(speedwise on a memory-intensive computation) gets to be the head/server
node, which won't care quite so much about memory if it doesn't
participate in the computation.  [If it IS going to participate, use the
loser for your desktop or something.]

     rgb

>
> Regards,
> Amjad Ali.
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From Michael.Frese at NumerEx.com  Fri Dec  7 09:28:14 2007
From: Michael.Frese at NumerEx.com (Michael H. Frese)
Date: Fri, 07 Dec 2007 10:28:14 -0700
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <47597B84.30704@cs.earlham.edu>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
	<4754ABA1.9030105@scalableinformatics.com>
	<Pine.LNX.4.64.0712040116300.4169@coffee.psychology.mcmaster.ca>
	<6.2.5.6.2.20071204085359.04f72018@NumerEx.com>
	<6.2.5.6.2.20071205085308.04eff510@NumerEx.com>
	<47597B84.30704@cs.earlham.edu>
Message-ID: <6.2.5.6.2.20071207102734.04fd0110@NumerEx.com>

Skylar,

An interesting suggestion.  I'll look into it.  Thanks.


Mke


At 09:57 AM 12/7/2007, Skylar Thompson wrote:
>Michael H. Frese wrote:
> > Fourth, with ttcp over tcp, I found that the troubled machine could
> > send 800 MB in about 20 seconds -- the wire speed for those 32-bit PCI
> > slots as tested by netpipe.  However, if I used ttcp over udp, I
> > couldn't reliably send even ten 8192-byte blocks!  Successive sends
> > and receives would receive 3, or 1, or 5 blocks.  Don't ask me how
> > these two facts are compatible.  I don't know.
>
>This sounds like it could be a flow control issue. TCP does its own flow
>control, but UDP doesn't. Some Ethernet switches do IEEE 802.3x, which
>might help your UDP performance.
>
>--
>-- Skylar Thompson (skylar at cs.earlham.edu)
>-- http://www.cs.earlham.edu/~skylar/
>
>
>


From bill at cse.ucdavis.edu  Fri Dec  7 10:40:07 2007
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Fri, 07 Dec 2007 10:40:07 -0800
Subject: [Beowulf] specific motherboard???
In-Reply-To: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>
References: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>
Message-ID: <47599387.6080400@cse.ucdavis.edu>

amjad ali wrote:
>  Hello all,
> 
> I want to bulid a beowulf cluster of 16+1 nodes with each node having one
> Intel Core2Duo (2.66 GHz, FSB 1333MHz, 4MB L2) processor and GiGE as the
> interconnect. On this cluster, I would run my PETSc based CFD/FEM codes
> (REQURING VERY FAST MEMORY/high memory bandwidth). Please help me out to
> select out any one of the following boards:

Keep in mind that the Petsc faq mentions:
 A fast, low-latency interconnect; any ethernet, even 10 gigE cannot provide
 the needed performance.

I wouldn't assume that your workload will scale with gigE to 16 nodes.

> 1) Intel Server board S3200SH, System Bus 1333MHz, supprting 240-pin DDR2
> 800 MHz RAM
> 2) Intel Desktop board DX38BT, System Bus 1333MHz, supprting 240-pin DDR3
> 1333 MHz RAM

The application performance difference should be minimal to nil.  2 800 MHz
64-bit dimms can easily saturate the 64 bit 1333 MHz FSB.  I've not personally
tested DDR3, but all the tests I've seen show minimal performance improvements
at substantial price increases.

> See that RAM speed difference. Given that keeping up running the cluster all
> the time and loging on of many user simultaneously is not the concern. The
> cluster may be dedicated to be used by one user whenever required. But it
> may be the case that running a code for several days will be required.
> 
> Would the desktop board DX38BT be suitable to run the cluster for several
> hours/days?
> Which Board you recommend for this scenario?

Either, you didn't mention the budget, but I'd at least consider a faster network.


From coutinho at dcc.ufmg.br  Fri Dec  7 11:36:37 2007
From: coutinho at dcc.ufmg.br (Bruno Coutinho)
Date: Fri, 7 Dec 2007 17:36:37 -0200
Subject: [Beowulf] specific motherboard???
In-Reply-To: <47599387.6080400@cse.ucdavis.edu>
References: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>
	<47599387.6080400@cse.ucdavis.edu>
Message-ID: <a8d96dec0712071136m6db525a8k49e93bd972692b6f@mail.gmail.com>

2007/12/7, Bill Broadley <bill at cse.ucdavis.edu>:
>
> amjad ali wrote:
> >  Hello all,
> >
> > I want to bulid a beowulf cluster of 16+1 nodes with each node having
> one
> > Intel Core2Duo (2.66 GHz, FSB 1333MHz, 4MB L2) processor and GiGE as the
> > interconnect. On this cluster, I would run my PETSc based CFD/FEM codes
> > (REQURING VERY FAST MEMORY/high memory bandwidth). Please help me out to
> > select out any one of the following boards:
>
> Keep in mind that the Petsc faq mentions:
> A fast, low-latency interconnect; any ethernet, even 10 gigE cannot
> provide
> the needed performance.
>
> I wouldn't assume that your workload will scale with gigE to 16 nodes.
>
> > 1) Intel Server board S3200SH, System Bus 1333MHz, supprting 240-pin
> DDR2
> > 800 MHz RAM
> > 2) Intel Desktop board DX38BT, System Bus 1333MHz, supprting 240-pin
> DDR3
> > 1333 MHz RAM
>
> The application performance difference should be minimal to nil.  2 800
> MHz
> 64-bit dimms can easily saturate the 64 bit 1333 MHz FSB.  I've not
> personally
> tested DDR3, but all the tests I've seen show minimal performance
> improvements
> at substantial price increases.
>
> > See that RAM speed difference. Given that keeping up running the cluster
> all
> > the time and loging on of many user simultaneously is not the concern.
> The
> > cluster may be dedicated to be used by one user whenever required. But
> it
> > may be the case that running a code for several days will be required.
> >
> > Would the desktop board DX38BT be suitable to run the cluster for
> several
> > hours/days?
> > Which Board you recommend for this scenario?
>
> Either, you didn't mention the budget, but I'd at least consider a faster
> network.


DX38BT have a 82566DC Gigabit controller, that isn't capable of jumbo frames
(frame size is limited to 1500 bytes).
But S3200SH have a 82566 controller too and a  82541PI that support jumbo
frames up to 16kb (probably your switch will support frames up to 10kb).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071207/b7e89b69/attachment.html>

From angelv at iac.es  Fri Dec  7 11:48:45 2007
From: angelv at iac.es (angelv at iac.es)
Date: Fri,  7 Dec 2007 19:48:45 +0000 (WET)
Subject: [Beowulf] Automated vacation reply
Message-ID: <20071207194845.06081156A84@ginebra>


Sobre su mensaje / about your message:

       De/From: beowulf-request at beowulf.org
       Para/To: beowulf at beowulf.org
Asunto/Subject: Beowulf Digest, Vol 46, Issue 10


I will be away from my office for the biggest part of December, January
and February. I will check and reply e-mails occasionally, but expect
delays.

?ngel de Vicente


From toon.knapen at gmail.com  Fri Dec  7 04:26:42 2007
From: toon.knapen at gmail.com (Toon Knapen)
Date: Fri, 7 Dec 2007 13:26:42 +0100
Subject: [Beowulf] multi-threading vs. MPI
Message-ID: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>

On this list there is almost unanimous agreement that MPI is the way to go
for parallelism and that combining multi-threading (MT) and message-passing
(MP) is not even worth it, just sticking to MP is all that is necessary.

However, in real-life most are talking and investing in MT while very few
are interested in MP. I also just read on the blog of Arch Robison " TBB
perhaps gives up a little performance short of optimal so you don't have to
write message-passing " (here:
http://softwareblogs.intel.com/2007/11/17/supercomputing-07-computer-environment-and-evolution/
 )

How come there is almost unanimous agreement in the beowulf-community while
the rest is almost unanimous convinced of the opposite ? Are we just tapping
ourselves on the back or is MP not sufficiently dissiminated or ... ?

toon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071207/60c05e51/attachment.html>

From angelv at iac.es  Fri Dec  7 12:17:24 2007
From: angelv at iac.es (angelv at iac.es)
Date: Fri,  7 Dec 2007 20:17:24 +0000 (WET)
Subject: [Beowulf] Automated vacation reply
Message-ID: <20071207201724.2412A156A84@ginebra>


Sobre su mensaje / about your message:

       De/From: beowulf-request at beowulf.org
       Para/To: beowulf at beowulf.org
Asunto/Subject: Beowulf Digest, Vol 46, Issue 11


I will be away from my office for the biggest part of December, January
and February. I will check and reply e-mails occasionally, but expect
delays.

?ngel de Vicente


From lindahl at pbm.com  Fri Dec  7 12:24:31 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Fri, 7 Dec 2007 12:24:31 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
Message-ID: <20071207202431.GA17274@bx9.net>

On Fri, Dec 07, 2007 at 01:26:42PM +0100, Toon Knapen wrote:

> However, in real-life most are talking and investing in MT while very few
> are interested in MP.

In real life (i.e. not HPC), everyone uses message passing between
nodes.  So I don't see what you're getting at.

-- greg


From toon.knapen at gmail.com  Fri Dec  7 12:41:37 2007
From: toon.knapen at gmail.com (Toon Knapen)
Date: Fri, 07 Dec 2007 21:41:37 +0100
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071207202431.GA17274@bx9.net>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net>
Message-ID: <4759B001.4090004@gmail.com>

Greg Lindahl wrote:
> 
> In real life (i.e. not HPC), everyone uses message passing between
> nodes.  So I don't see what you're getting at.
> 


Many on this list suggest that using multiple MPI-processes on one and 
the same node is superior to MT approaches IIUC. However I have the 
impression that almost the whole industry is looking into MT to benefit 
from multi-core without even considering message-passing. Why is that so?

toon


From brian.ropers.huilman at gmail.com  Fri Dec  7 13:24:34 2007
From: brian.ropers.huilman at gmail.com (Brian D. Ropers-Huilman)
Date: Fri, 7 Dec 2007 15:24:34 -0600
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <4759B001.4090004@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
Message-ID: <f0e11c480712071324h615690e2y920b40f21fcaa7df@mail.gmail.com>

On Dec 7, 2007 2:41 PM, Toon Knapen <toon.knapen at gmail.com> wrote:
> Greg Lindahl wrote:
> >
> > In real life (i.e. not HPC), everyone uses message passing between
> > nodes.  So I don't see what you're getting at.
>
> Many on this list suggest that using multiple MPI-processes on one and
> the same node is superior to MT approaches IIUC. However I have the
> impression that almost the whole industry is looking into MT to benefit
> from multi-core without even considering message-passing. Why is that so?

Because most of the "real" world does everything in a single node.
Given multi-cores, they have to move down the MT path.

The complications and effort of mixing MT on a node with the MP
between nodes -- absolutely required in our world -- is seen as too
great compared with the performance benefit. Simply staying with
multiple MP tasks, whether they are between nodes or within a node, is
much simpler -- it's what we already do.

No, the question of whether MP as we currently do it will scale to
larger and larger systems remains to be seen, but that's another
thread.

-- 
Brian D. Ropers-Huilman, Director
Systems Administration and Technical Operations
Minnesota Supercomputing Institute                 <bropers at msi.umn.edu>
599 Walter Library                                   +1 612-626-5948 (V)
117 Pleasant Street S.E.                             +1 612-624-8861 (F)
University of Minnesota                               Twin Cities Campus
Minneapolis, MN 55455-0255                       http://www.msi.umn.edu/


From richard.walsh at comcast.net  Fri Dec  7 14:15:25 2007
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Fri, 07 Dec 2007 22:15:25 +0000
Subject: [Beowulf] multi-threading vs. MPI
Message-ID: <120720072215.3444.4759C5FD00014E8500000D742207003201089C040E99D20B9D0E080C079D@comcast.net>


-------------- Original message -------------- 
From: "Toon Knapen" <toon.knapen at gmail.com> 


How come there is almost unanimous agreement in the beowulf-community while the rest is almost unanimous convinced of the opposite ? Are we just tapping ourselves on the back or is MP not sufficiently dissiminated or ... ? 

Mmm ... I think the answer to this is that the rest of world (non-HPC world) is in a time
warp.  HPC went through its SMP-threads phase in the early-mid 1990s with OpenMP, and then we needed more a more scalable approach (MPI).  Now that multi-core and multi-socket has brought parallelism to the rest of the Universe, SMP-based parallelism has had a resurgence ... this has also naturally caused some in HPC to revisit the question as nodes have fattened.  

The allure of a programming model that is intuitive, expressive, symbolically light-weight,
and provides a way to manage the latency variance across memory partitions is irresistable.

I kind of like the CAF extension to Fortran and the concept of co-arrays.  The co-array is
and array of identical normal arrays, but one per active image/process.  They are defined as such:

          real, dimension (N) [*] ::  X, Y

If the program is run on 8 cores/processors/images the * becomes 8.  8, 1D arrays of size
N are created on each processor. In any references to the locale component of the co-array
(the image on the processor referencing it), you can drop the []s ... all other references (remote)
must include it.  This is symbolically light, but reminds the programmer of every costly non-
local reference with the presence of the []s in the assignment or operation.  There is much
more to it than that of course, but as the performance gap between carefully constructed
MPI applications and CAF compiled code shrinks I can see the later gaining some traction
for purely programming elegance related reasons.  If you accept that notion that most MPI
programs are written at a B- level in terms of efficiency then the idea of gap closing may not
be so far fetched.  CAF is supposed to be include in the Fortran 2008 standard.

rbw

-- 

"Making predictions is hard, especially about the future." 

Niels Bohr 

-- 

Richard Walsh 
Thrashing River Consulting-- 
5605 Alameda St. 
Shoreview, MN 55126 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071207/6ca9dd9b/attachment.html>
-------------- next part --------------
An embedded message was scrubbed...
From: "Toon Knapen" <toon.knapen at gmail.com>
Subject: [Beowulf] multi-threading vs. MPI
Date: Fri, 7 Dec 2007 20:07:32 +0000
Size: 721
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071207/6ca9dd9b/attachment.mht>

From mwill at penguincomputing.com  Fri Dec  7 14:16:54 2007
From: mwill at penguincomputing.com (Michael Will)
Date: Fri, 7 Dec 2007 14:16:54 -0800
Subject: [Beowulf] multi-threading vs. MPI
Message-ID: <433093DF7AD7444DA65EFAFE3987879C33DEC2@orca.penguincomputing.com>

Distributed objects... Corba... Soap... This all precedes multiple cores per server and essentially is message passing in the enterprise.

Michael

Sent from my GoodLink synchronized handheld (www.good.com)


 -----Original Message-----
From: 	Toon Knapen [mailto:toon.knapen at gmail.com]
Sent:	Friday, December 07, 2007 01:31 PM Pacific Standard Time
To:	Greg Lindahl
Cc:	beowulf at beowulf.org
Subject:	Re: [Beowulf] multi-threading vs. MPI

Greg Lindahl wrote:
> 
> In real life (i.e. not HPC), everyone uses message passing between
> nodes.  So I don't see what you're getting at.
> 


Many on this list suggest that using multiple MPI-processes on one and 
the same node is superior to MT approaches IIUC. However I have the 
impression that almost the whole industry is looking into MT to benefit 
from multi-core without even considering message-passing. Why is that so?

toon
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071207/e712f18f/attachment.html>

From hahn at mcmaster.ca  Fri Dec  7 16:51:58 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 7 Dec 2007 19:51:58 -0500 (EST)
Subject: [Beowulf] NFS Read Errors
In-Reply-To: <6.2.5.6.2.20071207102734.04fd0110@NumerEx.com>
References: <6.2.5.6.2.20071128132559.04fcbba8@NumerEx.com>
	<4754ABA1.9030105@scalableinformatics.com>
	<Pine.LNX.4.64.0712040116300.4169@coffee.psychology.mcmaster.ca>
	<6.2.5.6.2.20071204085359.04f72018@NumerEx.com>
	<6.2.5.6.2.20071205085308.04eff510@NumerEx.com>
	<47597B84.30704@cs.earlham.edu>
	<6.2.5.6.2.20071207102734.04fd0110@NumerEx.com>
Message-ID: <Pine.LNX.4.64.0712071947300.2110@coffee.psychology.mcmaster.ca>

>> > Fourth, with ttcp over tcp, I found that the troubled machine could
>> > send 800 MB in about 20 seconds -- the wire speed for those 32-bit PCI
>> > slots as tested by netpipe.  However, if I used ttcp over udp, I
>> > couldn't reliably send even ten 8192-byte blocks!  Successive sends

it's worth noting that ttcp with udp means that you're producing 8k 
datagrams, which means you're actually sending 6x 1500B packets back-to-back.
that's somewhat hard on the transmitter, on the receiver, and on the 
receiver's net stack, since it as to reassemble the original 8k datagram.
(there have been interesting cases where fragments get reordered for 
various reasons, and this causes real problems for the stack, since there's
often a fixed window for how many fragments will be remembered to be 
reordered...)

anyway, tcp is smart enough to send a stream of path-mtu packets,
so to some degree avoids this.

>> This sounds like it could be a flow control issue. TCP does its own flow
>> control, but UDP doesn't. Some Ethernet switches do IEEE 802.3x, which
>> might help your UDP performance.

otoh, nfs, even over udp, does its own flow-control.  it might not be very 
good, or might even not be working right here, though.


From hahn at mcmaster.ca  Fri Dec  7 16:58:13 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 7 Dec 2007 19:58:13 -0500 (EST)
Subject: [Beowulf] specific motherboard???
In-Reply-To: <a8d96dec0712071136m6db525a8k49e93bd972692b6f@mail.gmail.com>
References: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>
	<47599387.6080400@cse.ucdavis.edu>
	<a8d96dec0712071136m6db525a8k49e93bd972692b6f@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712071957100.2110@coffee.psychology.mcmaster.ca>

> DX38BT have a 82566DC Gigabit controller, that isn't capable of jumbo frames
> (frame size is limited to 1500 bytes).
> But S3200SH have a 82566 controller too and a  82541PI that support jumbo
> frames up to 16kb (probably your switch will support frames up to 10kb).

do you have any data showing that jumbo frames will make a significant
difference?  yes, they will reduce interrupt-handling overhead (and perhaps
other overheads), but those tend to be fairly minor concerns on modern,
obscenely-fast cores...


From hahn at mcmaster.ca  Fri Dec  7 17:10:41 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Fri, 7 Dec 2007 20:10:41 -0500 (EST)
Subject: [Beowulf] specific motherboard???
In-Reply-To: <47599387.6080400@cse.ucdavis.edu>
References: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>
	<47599387.6080400@cse.ucdavis.edu>
Message-ID: <Pine.LNX.4.64.0712072002410.2110@coffee.psychology.mcmaster.ca>

> Keep in mind that the Petsc faq mentions:
> A fast, low-latency interconnect; any ethernet, even 10 gigE cannot provide
> the needed performance.

which seems foolish on the face of it, since 10G can be fairly low latency.
for instance, Myrinet quotes 2.3 us for mx-over-myrinet on their 10G nic,
and 2.63 for mx-over-ethernet (same nic and host, fujitsu eth switch).
that's not the lowest latency interconnect, but it's a far cry from 
50 us Gb latency...

> I wouldn't assume that your workload will scale with gigE to 16 nodes.

I haven't scrutinized petsc docs, but I would expect the answer to 
depend on what you're doing and also how big a chunk you can put onto 
a single node.  standard volume/surface argument.  from 10,000 feet,
it _looks_ like petsc could be used in a pretty high-work-per-communication
manner.

> tested DDR3, but all the tests I've seen show minimal performance improvements
> at substantial price increases.

my understanding is that ddr3-ddr2 changes are largely system-level
engineering (lower voltage/power, some changes in signal handling to
improve snr and scalability.)


From lindahl at pbm.com  Fri Dec  7 17:19:48 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Fri, 7 Dec 2007 17:19:48 -0800
Subject: [Beowulf] specific motherboard???
In-Reply-To: <Pine.LNX.4.64.0712072002410.2110@coffee.psychology.mcmaster.ca>
References: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>
	<47599387.6080400@cse.ucdavis.edu>
	<Pine.LNX.4.64.0712072002410.2110@coffee.psychology.mcmaster.ca>
Message-ID: <20071208011948.GA4839@bx9.net>

On Fri, Dec 07, 2007 at 08:10:41PM -0500, Mark Hahn wrote:

> which seems foolish on the face of it, since 10G can be fairly low latency.
> for instance, Myrinet quotes 2.3 us for mx-over-myrinet on their 10G nic,

Most people don't mean Myrinet MX when they say 10G. All the usual
ethernet protocols are much worse.

-- greg


From richard.walsh at comcast.net  Fri Dec  7 19:51:25 2007
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Sat, 08 Dec 2007 03:51:25 +0000
Subject: [Beowulf] multi-threading vs. MPI
Message-ID: <120820070351.17643.475A14BB000E8190000044EB2200735446089C040E99D20B9D0E080C079D@comcast.net>


 -------------- Original message ----------------------
From: Toon Knapen <toon.knapen at gmail.com>
> Greg Lindahl wrote:
> > 
> > In real life (i.e. not HPC), everyone uses message passing between
> > nodes.  So I don't see what you're getting at.
> > 
> 
> Many on this list suggest that using multiple MPI-processes on one and 
> the same node is superior to MT approaches IIUC. However I have the 
> impression that almost the whole industry is looking into MT to benefit 
> from multi-core without even considering message-passing. Why is that so?

I think what Greg and others are really saying is that if you have to use a distributed memory
model (MPI) as a first order response to meet your scalability requirements, then
the extra coding effort and complexity required to create a hybrid code may not be
a good performance return on your investment.  If on the other hand you only
need to scale within a singe SMP node (with cores and sockets on a single
board growing in number, this returns more performance than in the past), then you
may be able to avoid using MPI and chose a simpler model like OpenMP.  If you
have already written an efficient MPI code,  then (with some exceptions) the 
performance-gain divided by the hybrid coding-effort may seem small.

Development in an SMP environment is easier.  I know of a number of sights
that work this way.  The experienced algorithm folks work up the code in 
OpenMP on say an SGI Altix or Power6 SMP, then they get a dedicated MPI
coding expert to convert it later for scalable production operation on a cluster.
In this situation, they do end up with hybrid versions in some cases.  In non-HPC
or smaller workgroup contexts your production code may not need to be converted.

Cheers,

rbw

--

"Making predictions is hard, especially about the future."

Niels Bohr

--

Richard Walsh
Thrashing River Consulting--
5605 Alameda St.
Shoreview, MN 55126

Phone #: 612-382-4620


From gerry.creager at tamu.edu  Fri Dec  7 20:46:44 2007
From: gerry.creager at tamu.edu (Gerry Creager)
Date: Fri, 07 Dec 2007 22:46:44 -0600
Subject: [Beowulf] specific motherboard???
In-Reply-To: <Pine.LNX.4.64.0712071957100.2110@coffee.psychology.mcmaster.ca>
References: <428810f20712062147k21feaa6ax94432b9df4705834@mail.gmail.com>	<47599387.6080400@cse.ucdavis.edu>	<a8d96dec0712071136m6db525a8k49e93bd972692b6f@mail.gmail.com>
	<Pine.LNX.4.64.0712071957100.2110@coffee.psychology.mcmaster.ca>
Message-ID: <475A21B4.4000306@tamu.edu>

Nothing hard but our subjective evaluation running WRF and MM5 showed a 
notable improvement in overall wall-clock time-to-completion.

gerry

Mark Hahn wrote:
>> DX38BT have a 82566DC Gigabit controller, that isn't capable of jumbo 
>> frames
>> (frame size is limited to 1500 bytes).
>> But S3200SH have a 82566 controller too and a  82541PI that support jumbo
>> frames up to 16kb (probably your switch will support frames up to 10kb).
> 
> do you have any data showing that jumbo frames will make a significant
> difference?  yes, they will reduce interrupt-handling overhead (and perhaps
> other overheads), but those tend to be fairly minor concerns on modern,
> obscenely-fast cores...
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843


From kir at lapshin.net  Fri Dec  7 14:01:11 2007
From: kir at lapshin.net (Kirill Lapshin)
Date: Sat, 08 Dec 2007 01:01:11 +0300
Subject: [Beowulf] Re: multi-threading vs. MPI
In-Reply-To: <4759B001.4090004@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com>
Message-ID: <fjcfr5$19v$1@ger.gmane.org>

Toon Knapen wrote:
> Many on this list suggest that using multiple MPI-processes on one and 
> the same node is superior to MT approaches IIUC. However I have the 
> impression that almost the whole industry is looking into MT to benefit 
> from multi-core without even considering message-passing. Why is that so?

My understanding is that in cluster setup you will have MP anyway, so 
you have a choice of MP+MT vs MP, i.e. intranode is MP anyway, and on a 
every node you can try to do MT (so called hybrid) or just go with MP 
all the way. The latter is simpler and real world examples of MP+MT 
outperforming MP are remain to be seen.

However if you scope is limited to a single node then MT may be very 
well better choice than MP due to a) universal availability of MT 
libraries (pthread, winthreads) and b) the fact that you can easily and 
efficiently implement MP over MT but not the other way round.

As for multinode scenario I think I am buying the argument that MP wins 
over MT+MP in majority if not all cases.


From quantummechanicsllc at msn.com  Fri Dec  7 16:22:50 2007
From: quantummechanicsllc at msn.com (Donald Shillady)
Date: Fri, 7 Dec 2007 19:22:50 -0500
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <120720072215.3444.4759C5FD00014E8500000D742207003201089C040E99D20B9D0E080C079D@comcast.net>
References: <120720072215.3444.4759C5FD00014E8500000D742207003201089C040E99D20B9D0E080C079D@comcast.net>
Message-ID: <BAY115-W7B6D71D7E8998CB5BB309B4690@phx.gbl>


This is a very interesting discussion to me.  I have started to purchase components for an 8 core microWulf based on the Calvin College microWulf constructed by Prof. Joel Adams and his student except I will use slightly faster cores with an AMD X2 5400+ in the Master node (dual core) and three AMD X2 4000+ dual core processors enclosed in inexpensive boxes.  The Master node has an MSI K9N SLI Platinum motherboard which has two Gigabit ports so perhaps the initial configuration with three satellite dual core CPU can be extended to a second set of boxes later.  All these AM2-socket CPU are dual core and apparently Prof. Adams was able to address them in the microWulf as individual cores but there is, I believe, some hyperthreading between the dual cores so what is the story about how the dual cores can be addressed individually but still have hyperthreading between the dual cores?  I am an experienced programmer for Von Neuman architecture and a total novice on parallel systems but as I build the microWulf I wonder if MPI will decouple the hyperthreading or is it not there?  From what little I have learned so far the microWulf switch depends on the relatively slow Gigabit Ethernet so there is probably time within each dual core CPU for hyperthreading to occur if indeed provision is provided for hyperthreading in the AMD X2 dual cores.  Sorry to ask such a dumb question but I am trying to learn.
 
Don Shillady
Emeritus PRofessor of Chemistry, VCU
Ashland Va (working at home)


From: richard.walsh at comcast.netTo: toon.knapen at gmail.com; beowulf at beowulf.orgSubject: Re: [Beowulf] multi-threading vs. MPIDate: Fri, 7 Dec 2007 22:15:25 +0000CC: 
 
-------------- Original message -------------- From: "Toon Knapen" <toon.knapen at gmail.com> 
 
How come there is almost unanimous agreement in the beowulf-community while the rest is almost unanimous convinced of the opposite ? Are we just tapping ourselves on the back or is MP not sufficiently dissiminated or ... ? 
 
Mmm ... I think the answer to this is that the rest of world (non-HPC world) is in a time
warp.  HPC went through its SMP-threads phase in the early-mid 1990s with OpenMP, and then we needed more a more scalable approach (MPI).  Now that multi-core and multi-socket has brought parallelism to the rest of the Universe, SMP-based parallelism has had a resurgence ... this has also naturally caused some in HPC to revisit the question as nodes have fattened.  
 
The allure of a programming model that is intuitive, expressive, symbolically light-weight,
and provides a way to manage the latency variance across memory partitions is irresistable.
 
I kind of like the CAF extension to Fortran and the concept of co-arrays.  The co-array is
and array of identical normal arrays, but one per active image/process.  They are defined as such:
 
          real, dimension (N) [*] ::  X, Y
 
If the program is run on 8 cores/processors/images the * becomes 8.  8, 1D arrays of size
N are created on each processor. In any references to the locale component of the co-array
(the image on the processor referencing it), you can drop the []s ... all other references (remote)
must include it.  This is symbolically light, but reminds the programmer of every costly non-
local reference with the presence of the []s in the assignment or operation.  There is much
more to it than that of course, but as the performance gap between carefully constructed
MPI applications and CAF compiled code shrinks I can see the later gaining some traction
for purely programming elegance related reasons.  If you accept that notion that most MPI
programs are written at a B- level in terms of efficiency then the idea of gap closing may not
be so far fetched.  CAF is supposed to be include in the Fortran 2008 standard.
 
rbw
 
-- "Making predictions is hard, especially about the future." Niels Bohr -- Richard Walsh Thrashing River Consulting-- 5605 Alameda St. Shoreview, MN 55126 
--Forwarded Message Attachment--From: toon.knapen at gmail.comTo: beowulf at beowulf.orgSubject: [Beowulf] multi-threading vs. MPIDate: Fri, 7 Dec 2007 20:07:32 +0000_______________________________________________Beowulf mailing list, Beowulf at beowulf.orgTo change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071207/a2fa0340/attachment.html>

From gerry.creager at tamu.edu  Fri Dec  7 20:56:14 2007
From: gerry.creager at tamu.edu (Gerry Creager)
Date: Fri, 07 Dec 2007 22:56:14 -0600
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <120820070351.17643.475A14BB000E8190000044EB2200735446089C040E99D20B9D0E080C079D@comcast.net>
References: <120820070351.17643.475A14BB000E8190000044EB2200735446089C040E99D20B9D0E080C079D@comcast.net>
Message-ID: <475A23EE.4020302@tamu.edu>

WRF has been under development for 10 years.  It's got an OpenMP flavor, 
an MPI flavor and a hybrid one.  We still don't have all the bugs worked 
out of the hybrid so that it can handle large, high resolution domains 
without being slower than the MPI version.  And, yeah, the OpenMP geeks 
working on this... and the MPI folks, are good.

Hybrid isn't easy and isn't always foolproof.  And, as another thought, 
OpenMP isn't always the best solution to the problem.

gerry

richard.walsh at comcast.net wrote:
>  -------------- Original message ----------------------
> From: Toon Knapen <toon.knapen at gmail.com>
>> Greg Lindahl wrote:
>>> In real life (i.e. not HPC), everyone uses message passing between
>>> nodes.  So I don't see what you're getting at.
>>>
>> Many on this list suggest that using multiple MPI-processes on one and 
>> the same node is superior to MT approaches IIUC. However I have the 
>> impression that almost the whole industry is looking into MT to benefit 
>> from multi-core without even considering message-passing. Why is that so?
> 
> I think what Greg and others are really saying is that if you have to use a distributed memory
> model (MPI) as a first order response to meet your scalability requirements, then
> the extra coding effort and complexity required to create a hybrid code may not be
> a good performance return on your investment.  If on the other hand you only
> need to scale within a singe SMP node (with cores and sockets on a single
> board growing in number, this returns more performance than in the past), then you
> may be able to avoid using MPI and chose a simpler model like OpenMP.  If you
> have already written an efficient MPI code,  then (with some exceptions) the 
> performance-gain divided by the hybrid coding-effort may seem small.
> 
> Development in an SMP environment is easier.  I know of a number of sights
> that work this way.  The experienced algorithm folks work up the code in 
> OpenMP on say an SGI Altix or Power6 SMP, then they get a dedicated MPI
> coding expert to convert it later for scalable production operation on a cluster.
> In this situation, they do end up with hybrid versions in some cases.  In non-HPC
> or smaller workgroup contexts your production code may not need to be converted.
> 
> Cheers,
> 
> rbw
> 
> --
> 
> "Making predictions is hard, especially about the future."
> 
> Niels Bohr
> 
> --
> 
> Richard Walsh
> Thrashing River Consulting--
> 5605 Alameda St.
> Shoreview, MN 55126
> 
> Phone #: 612-382-4620
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843


From laytonjb at charter.net  Sat Dec  8 04:37:24 2007
From: laytonjb at charter.net (laytonjb at charter.net)
Date: Sat, 8 Dec 2007 4:37:24 -0800
Subject: [Beowulf] specific motherboard???
In-Reply-To: <20071208011948.GA4839@bx9.net>
Message-ID: <20071208073724.E8N2V.116801.root@fepweb06>

On a related topic...

I remember some old N/2 numbers for Chelsio 10GigE NICs that were above
100,000 bytes. Does anyone have any recent N/2 numbers for 10GigE NICs?

Thanks!

Jeff

---- Greg Lindahl <lindahl at pbm.com> wrote: 
> On Fri, Dec 07, 2007 at 08:10:41PM -0500, Mark Hahn wrote:
> 
> > which seems foolish on the face of it, since 10G can be fairly low latency.
> > for instance, Myrinet quotes 2.3 us for mx-over-myrinet on their 10G nic,
> 
> Most people don't mean Myrinet MX when they say 10G. All the usual
> ethernet protocols are much worse.
> 
> -- greg
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From diep at xs4all.nl  Sat Dec  8 07:09:46 2007
From: diep at xs4all.nl (Vincent Diepeveen)
Date: Sat, 8 Dec 2007 16:09:46 +0100
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
Message-ID: <0D372EFA-BFFA-4E30-8E74-1167EF8830D3@xs4all.nl>

Well there is a difference between, being lazy and writing something  
in little time that' s generic working and embarrassingl parallel,
or something like a gametree search where you really want the maximum  
out of it and are prepared to optimize at cycle level,
in which case you definitely want a 2 layer parallellism.

note that multiprocessing is far easier than multithreading under  
linux. Of course from high level viewpoint seen that's not a real  
interesting difference,
because multiprocesing you can do in a way that you can call it  
multithreading and vice versa.

Another concern is that in the megabytes of source code, basically  
quite some communication happens between the different processes,  
simply because that speeds up the software in exponential manner.  
Communication ideally happens every node (position) you search. When  
less communication is possible thanks to the several microseconds it  
takes to get a cache line from a remote node, speedup gets worse.

A problem of doing all that communication between the nodes is that  
in shared memory it's simply a single pointer you read/write, whereas  
with MPI you'll have to do a lot in order to get it done. It totally  
screws up your source code so to speak.

In the end, all depends entirely upon the software you intend to run.

Yet MPI has a huge overhead, which shared memory parallellism doesn't  
have at all.

This is totally irrelevant at the moment that your software is  
embarrassingly parallel though.

If it is relevant, then you'll have to search to creative manners to  
parallellize your software in a good matter, as the latencies within  
1 node are up to a factor 1000 faster
than between nodes.

to compare, over shmem/mpi i'm real happy when i get 50% speedup out  
of a few nodes (and about 20-25% when number of cpu's grows to 512),
and at shared memory my diep chess program gets about 3.75 out of 4  
at a quadcore intel,
whereas the scaling is roughly 3.8 out of 4. So that's 3.75 / 3.8 =  
98.7% speedup.

There is a huge difference between 95%+ speedup and 50%.

So for example at a 16 node quad core cluster in total 64 cpu's, if i  
were to use mpi only, i'd get perhaps to 25% speedup or so.

25% * 64 = a bit less than 16 out of 64

Now using a 2 layer parallellism the calculation is more like:

3.75 speedup out of 1 node and 50% speedup out of 16 nodes = 3.75 * 8  
=  30 out of 64

Of course don't forget the huge effort to first make that  
parallellism that also runs on pc's, it's years of fulltime work that  
has been put in it.

When you would have thousands of nodes, of course it's nearly  
irrelevant to do this big effort. Yet you don't get hundreds of nodes  
easily.
With the biggest effort you can perhaps buy a 64 core machine  
yourself. As soon as big institutes start paying, things change though.

Why do the effort? If you want to get a bigger speedup, just ask for  
more budget and get yourself more cpu's, as 25% out of 256 cores  
still is
more than huge effort of years to get 30 out of 64 at a cluster.

Vincent

On Dec 7, 2007, at 1:26 PM, Toon Knapen wrote:

> On this list there is almost unanimous agreement that MPI is the  
> way to go for parallelism and that combining multi-threading (MT)  
> and message-passing (MP) is not even worth it, just sticking to MP  
> is all that is necessary.
>
> However, in real-life most are talking and investing in MT while  
> very few are interested in MP. I also just read on the blog of Arch  
> Robison " TBB perhaps gives up a little performance short of  
> optimal so you don't have to write message-passing " (here: http:// 
> softwareblogs.intel.com/2007/11/17/supercomputing-07-computer- 
> environment-and-evolution/ )
>
> How come there is almost unanimous agreement in the beowulf- 
> community while the rest is almost unanimous convinced of the  
> opposite ? Are we just tapping ourselves on the back or is MP not  
> sufficiently dissiminated or ... ?
>
> toon
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071208/b475d543/attachment.html>

From gdjacobs at gmail.com  Sat Dec  8 09:25:19 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Sat, 08 Dec 2007 11:25:19 -0600
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <4759B001.4090004@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com>
Message-ID: <475AD37F.3040004@gmail.com>

Toon Knapen wrote:
> Greg Lindahl wrote:
>>
>> In real life (i.e. not HPC), everyone uses message passing between
>> nodes.  So I don't see what you're getting at.
>>
> Many on this list suggest that using multiple MPI-processes on one and
> the same node is superior to MT approaches IIUC.

I'm not sure where this conclusion came from.

> However I have the
> impression that almost the whole industry is looking into MT to benefit
> from multi-core without even considering message-passing. Why is that so?
> 
> toon

These are developers working on NUMA and SMA, not network distributed
parallel processing. Different problem, different tool.

If an application is going to be using only shared memory, I have no
doubt the consensus here is to use native threads or OpenMP. If the
application is going to be working over a network, MPI (or PVM) is the
way to go.

Don't misconstrue any strategies and opinions on this list to be
generalizable to all types of parallel computing. This list is here to
discuss the engineering of Beowulf style systems, so opinions (unless
qualified) will generally apply to Beowulf computers as well.

-- 
Geoffrey D. Jacobs

To have no errors
  would be life without meaning
  No struggle, no joy


From gdjacobs at gmail.com  Sat Dec  8 09:35:03 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Sat, 08 Dec 2007 11:35:03 -0600
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <120720072215.3444.4759C5FD00014E8500000D742207003201089C040E99D20B9D0E080C079D@comcast.net>
References: <120720072215.3444.4759C5FD00014E8500000D742207003201089C040E99D20B9D0E080C079D@comcast.net>
Message-ID: <475AD5C7.9040906@gmail.com>

richard.walsh at comcast.net wrote:
>  
> 
>     -------------- Original message --------------
>     From: "Toon Knapen" <toon.knapen at gmail.com>
>      
>     How come there is almost unanimous agreement in the
>     beowulf-community while the rest is almost unanimous convinced of
>     the opposite ? Are we just tapping ourselves on the back or is MP
>     not sufficiently dissiminated or ... ?
>      
>     Mmm ... I think the answer to this is that the rest of world
>     (non-HPC world) is in a time
>     warp.  HPC went through its SMP-threads phase in the early-mid 1990s
>     with OpenMP, and then we needed more a more scalable approach
>     (MPI).  Now that multi-core and multi-socket has brought parallelism
>     to the rest of the Universe, SMP-based parallelism has had a
>     resurgence ... this has also naturally caused some in HPC to revisit
>     the question as nodes have fattened. 
>      
>     The allure of a programming model that is intuitive, expressive,
>     symbolically light-weight,
>     and provides a way to manage the latency variance across memory
>     partitions is irresistable.
>      
>     I kind of like the CAF extension to Fortran and the concept of
>     co-arrays.  The co-array is
>     and array of identical normal arrays, but one per active
>     image/process.  They are defined as such:
>      
>               real, dimension (N) [*] ::  X, Y
>      
>     If the program is run on 8 cores/processors/images the * becomes 8. 
>     8, 1D arrays of size
>     N are created on each processor. In any references to the locale
>     component of the co-array
>     (the image on the processor referencing it), you can drop the []s
>     ... all other references (remote)
>     must include it.  This is symbolically light, but reminds the
>     programmer of every costly non-
>     local reference with the presence of the []s in the assignment or
>     operation.  There is much
>     more to it than that of course, but as the performance gap between
>     carefully constructed
>     MPI applications and CAF compiled code shrinks I can see the later
>     gaining some traction
>     for purely programming elegance related reasons.  If you accept that
>     notion that most MPI
>     programs are written at a B- level in terms of efficiency then the
>     idea of gap closing may not
>     be so far fetched.  CAF is supposed to be include in the Fortran
>     2008 standard.
>      
>     rbw
>      
>     -- 
> 
>     "Making predictions is hard, especially about the future."
> 
>     Niels Bohr
> 
>     -- 
> 
>     Richard Walsh
>     Thrashing River Consulting--
>     5605 Alameda St.
>     Shoreview, MN 55126 

But isn't CAF (and UPC, and Titanium) implicitly message passing for a
Beowulf anyway? It's attractive because it simplifies the process and
might be able to optimize communication, but underneath everything it's
still message passing.

-- 
Geoffrey D. Jacobs

To have no errors
  would be life without meaning
  No struggle, no joy


From gdjacobs at gmail.com  Sat Dec  8 09:49:41 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Sat, 08 Dec 2007 11:49:41 -0600
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <BAY115-W7B6D71D7E8998CB5BB309B4690@phx.gbl>
References: <120720072215.3444.4759C5FD00014E8500000D742207003201089C040E99D20B9D0E080C079D@comcast.net>
	<BAY115-W7B6D71D7E8998CB5BB309B4690@phx.gbl>
Message-ID: <475AD935.1040208@gmail.com>

Donald Shillady wrote:
> This is a very interesting discussion to me.  I have started to purchase
> components for an 8 core microWulf based on the Calvin College microWulf
> constructed by Prof. Joel Adams and his student except I will use
> slightly faster cores with an AMD X2 5400+ in the Master node (dual
> core) and three AMD X2 4000+ dual core processors enclosed in
> inexpensive boxes.  The Master node has an MSI K9N SLI Platinum
> motherboard which has two Gigabit ports so perhaps the initial
> configuration with three satellite dual core CPU can be extended to a
> second set of boxes later.  All these AM2-socket CPU are dual core and
> apparently Prof. Adams was able to address them in the microWulf as
> individual cores but there is, I believe, some hyperthreading between
> the dual cores so what is the story about how the dual cores can be
> addressed individually but still have hyperthreading between the dual
> cores?  I am an experienced programmer for Von Neuman architecture and a
> total novice on parallel systems but as I build the microWulf I wonder
> if MPI will decouple the hyperthreading or is it not there?  From what
> little I have learned so far the microWulf switch depends on the
> relatively slow Gigabit Ethernet so there is probably time within each
> dual core CPU for hyperthreading to occur if indeed provision is
> provided for hyperthreading in the AMD X2 dual cores.  Sorry to ask such
> a dumb question but I am trying to learn.
>  
> Don Shillady
> Emeritus PRofessor of Chemistry, VCU
> Ashland Va (working at home)
> 

I have always programmed in the past with a flat model utilizing MPI.
This has been for dual CPU, single core per CPU computers, but applies
equally to dual core.

Here is how the processes tended to map to physical computers, but it
varied depending on MPI configuration. I also had a lot of fun abusing
the process group file in MPICH1.x.

Computer 1	CPU 1	rank(1)
		CPU 2	rank(2)
Computer 2	CPU 1	rank(3)
		CPU 2	rank(4)
...
...
...

and so forth. Using threads, you could potentially do this:

Computer 1	CPU 1	rank(1)		thread(1)
		CPU 2			thread(2)
Computer 2 	CPU 1	rank(2)		thread(1)
		CPU2			thread(2)
...
...
...

Only the even numbered ranks from the first example are explicitly
utilized by MPI, but each process spawned by MPI creates two threads,
which the operating system on each computer load balances onto each CPU.

<snip>

-- 
Geoffrey D. Jacobs

To have no errors
  would be life without meaning
  No struggle, no joy


From tom.elken at qlogic.com  Sat Dec  8 09:51:48 2007
From: tom.elken at qlogic.com (Tom Elken)
Date: Sat, 8 Dec 2007 09:51:48 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <BAY115-W7B6D71D7E8998CB5BB309B4690@phx.gbl>
References: <120720072215.3444.4759C5FD00014E8500000D742207003201089C040E99D20B9D0E080C079D@comcast.net>
	<BAY115-W7B6D71D7E8998CB5BB309B4690@phx.gbl>
Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A018501E1@AVEXCH1.qlogic.org>

Don,

You asked:  
"...  AMD X2 5400+ in the Master node (dual core) and three AMD X2 4000+
dual core processors enclosed in inexpensive boxes. .... I believe, some
hyperthreading between the dual cores so what is the story about how the
dual cores can be addressed individually but still have hyperthreading
between the dual cores?"

There is no hyperthreading (hardware threading) in AMD CPUs.  Each core
appears like a separate CPU to the Operating System and is treated as
such by MPI libraries.   So you can happily use your Microwulf running
two MPI processes per node with good performance.  The thrust of the
discussion is that, for the average user, you can ignore software
threading between the cores of a node, just use MPI to obtain good
parallel speed-ups.
 
-Tom


________________________________

	From: beowulf-bounces at beowulf.org
[mailto:beowulf-bounces at beowulf.org] On Behalf Of Donald Shillady
	Sent: Friday, December 07, 2007 4:23 PM
	To: richard.walsh at comcast.net; Toon Knapen; BeowulfMailing List
	Subject: RE: [Beowulf] multi-threading vs. MPI
	
	
	This is a very interesting discussion to me.  I have started to
purchase components for an 8 core microWulf based on the Calvin College
microWulf constructed by Prof. Joel Adams and his student except I will
use slightly faster cores with an AMD X2 5400+ in the Master node (dual
core) and three AMD X2 4000+ dual core processors enclosed in
inexpensive boxes.  The Master node has an MSI K9N SLI Platinum
motherboard which has two Gigabit ports so perhaps the initial
configuration with three satellite dual core CPU can be extended to a
second set of boxes later.  All these AM2-socket CPU are dual core and
apparently Prof. Adams was able to address them in the microWulf as
individual cores but there is, I believe, some hyperthreading between
the dual cores so what is the story about how the dual cores can be
addressed individually but still have hyperthreading between the dual
cores?  I am an experienced programmer for Von Neuman architecture and a
total novice on parallel systems but as I build the microWulf I wonder
if MPI will decouple the hyperthreading or is it not there?  From what
little I have learned so far the microWulf switch depends on the
relatively slow Gigabit Ethernet so there is probably time within each
dual core CPU for hyperthreading to occur if indeed provision is
provided for hyperthreading in the AMD X2 dual cores.  Sorry to ask such
a dumb question but I am trying to learn. 
	 
	 
	Don Shillady
	Emeritus PRofessor of Chemistry, VCU
	Ashland Va (working at home)
	
	
________________________________

		From: richard.walsh at comcast.net
		To: toon.knapen at gmail.com; beowulf at beowulf.org
		Subject: Re: [Beowulf] multi-threading vs. MPI
		Date: Fri, 7 Dec 2007 22:15:25 +0000
		CC: 
		
		
			-------------- Original message -------------- 
			From: "Toon Knapen" <toon.knapen at gmail.com> 
			
			 
			How come there is almost unanimous agreement in
the beowulf-community while the rest is almost unanimous convinced of
the opposite ? Are we just tapping ourselves on the back or is MP not
sufficiently dissiminated or ... ? 
			 
			Mmm ... I think the answer to this is that the
rest of world (non-HPC world) is in a time
			warp.  HPC went through its SMP-threads phase in
the early-mid 1990s with OpenMP, and then we needed more a more scalable
approach (MPI).  Now that multi-core and multi-socket has brought
parallelism to the rest of the Universe, SMP-based parallelism has had a
resurgence ... this has also naturally caused some in HPC to revisit the
question as nodes have fattened.  
			 
			The allure of a programming model that is
intuitive, expressive, symbolically light-weight,
			and provides a way to manage the latency
variance across memory partitions is irresistable.
			 
			I kind of like the CAF extension to Fortran and
the concept of co-arrays.  The co-array is
			and array of identical normal arrays, but one
per active image/process.  They are defined as such:
			 
			          real, dimension (N) [*] ::  X, Y
			 
			If the program is run on 8
cores/processors/images the * becomes 8.  8, 1D arrays of size
			N are created on each processor. In any
references to the locale component of the co-array
			(the image on the processor referencing it), you
can drop the []s ... all other references (remote)
			must include it.  This is symbolically light,
but reminds the programmer of every costly non-
			local reference with the presence of the []s in
the assignment or operation.  There is much
			more to it than that of course, but as the
performance gap between carefully constructed
			MPI applications and CAF compiled code shrinks I
can see the later gaining some traction
			for purely programming elegance related reasons.
If you accept that notion that most MPI
			programs are written at a B- level in terms of
efficiency then the idea of gap closing may not
			be so far fetched.  CAF is supposed to be
include in the Fortran 2008 standard.
			 
			rbw
			 
			-- 
			
			"Making predictions is hard, especially about
the future." 
			
			Niels Bohr 
			
			-- 
			
			Richard Walsh 
			Thrashing River Consulting-- 
			5605 Alameda St. 
			Shoreview, MN 55126 

			--Forwarded Message Attachment--
			From: toon.knapen at gmail.com
			To: beowulf at beowulf.org
			Subject: [Beowulf] multi-threading vs. MPI
			Date: Fri, 7 Dec 2007 20:07:32 +0000
			
			
			_______________________________________________
			Beowulf mailing list, Beowulf at beowulf.org
			To change your subscription (digest mode or
unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
			

From richard.walsh at comcast.net  Sat Dec  8 11:14:38 2007
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Sat, 08 Dec 2007 19:14:38 +0000
Subject: [Beowulf] multi-threading vs. MPI
Message-ID: <120820071914.25507.475AED1D000E9FF4000063A32205886014089C040E99D20B9D0E080C079D@comcast.net>

-------------- Original message -------------- 
From: Geoff Jacobs <gdjacobs at gmail.com> 

> 
> But isn't CAF (and UPC, and Titanium) implicitly message passing for a 
> Beowulf anyway? It's attractive because it simplifies the process and 
> might be able to optimize communication, but underneath everything it's 
> still message passing. 
> 

Most of what you say here is true ...

It is low-level message passing between nodes, and can be either within ... depending
what optimizations the compiler does.  Still, the code generated is one layer closer to
the network adapter hardware and has a small potential performance advantage because
of this (although MPI can be used as a conduit).   

PGAS languages push the problem of managing latency off onto the compiler 
while offering a more implicit, language integrated approach to dealing with remote
references.  The []s are light-weight symbols that remind the programmer of the
overhead implicit in make remote references, but the work of actual making them
effecient is left up to the compiler.

rbw
-- 

"Making predictions is hard, especially about the future." 

Niels Bohr 

-- 

Richard Walsh 
Thrashing River Consulting-- 
5605 Alameda St. 
Shoreview, MN 55126 

Phone #: 612-382-4620
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071208/ddf37279/attachment.html>

From siegert at sfu.ca  Sat Dec  8 12:55:07 2007
From: siegert at sfu.ca (Martin Siegert)
Date: Sat, 8 Dec 2007 12:55:07 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475A23EE.4020302@tamu.edu>
References: <120820070351.17643.475A14BB000E8190000044EB2200735446089C040E99D20B9D0E080C079D@comcast.net>
	<475A23EE.4020302@tamu.edu>
Message-ID: <20071208205507.GA16760@stikine.ucs.sfu.ca>

Over the last months I have done quite a bit of benchmarking of
applications. One of the aspects we are interested in is the performance
of applications that are available in MPI, OpenMP and hybrid versions.
So far we looked at WRF and CPMD; we'll probably look at POP as well.

MPI vs. OpenMP on a SMP (64 core Power5):
walltime for cpmd benchmark on 32 cores:
MPI: 93.13s   OpenMP: 446.86s

Results for WRF on the same platform are similar. 
In short: the performance of OpenMP code isn't even close to that of the
MPI code.

We also looked at the hybrid version of these codes on clusters.
The difference in run times are in the 1% range - less than the
accuracy of the measurement.

Thus, if you have the choice, why would you even look at anything other
than MPI? Even if the programming effort for OpenMP is lower,
the performance penalty is huge.

That's my conclusion drawn from the cases we've looked at.
If anybody knows of applications where the OpenMP performance comes close
to the MPI performance and of applications where the hybrid performance
is significantly better than the pure MPI performance, then I would
love to hear from you. Thanks!

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
Academic Computing Services                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6

On Fri, Dec 07, 2007 at 10:56:14PM -0600, Gerry Creager wrote:
> WRF has been under development for 10 years.  It's got an OpenMP flavor, 
> an MPI flavor and a hybrid one.  We still don't have all the bugs worked 
> out of the hybrid so that it can handle large, high resolution domains 
> without being slower than the MPI version.  And, yeah, the OpenMP geeks 
> working on this... and the MPI folks, are good.
> 
> Hybrid isn't easy and isn't always foolproof.  And, as another thought, 
> OpenMP isn't always the best solution to the problem.
> 
> gerry
> 
> richard.walsh at comcast.net wrote:
> > -------------- Original message ----------------------
> >From: Toon Knapen <toon.knapen at gmail.com>
> >>Greg Lindahl wrote:
> >>>In real life (i.e. not HPC), everyone uses message passing between
> >>>nodes.  So I don't see what you're getting at.
> >>>
> >>Many on this list suggest that using multiple MPI-processes on one and 
> >>the same node is superior to MT approaches IIUC. However I have the 
> >>impression that almost the whole industry is looking into MT to benefit 
> >>from multi-core without even considering message-passing. Why is that so?
> >
> >I think what Greg and others are really saying is that if you have to use 
> >a distributed memory
> >model (MPI) as a first order response to meet your scalability 
> >requirements, then
> >the extra coding effort and complexity required to create a hybrid code 
> >may not be
> >a good performance return on your investment.  If on the other hand you 
> >only
> >need to scale within a singe SMP node (with cores and sockets on a single
> >board growing in number, this returns more performance than in the past), 
> >then you
> >may be able to avoid using MPI and chose a simpler model like OpenMP.  If 
> >you
> >have already written an efficient MPI code,  then (with some exceptions) 
> >the performance-gain divided by the hybrid coding-effort may seem small.
> >
> >Development in an SMP environment is easier.  I know of a number of sights
> >that work this way.  The experienced algorithm folks work up the code in 
> >OpenMP on say an SGI Altix or Power6 SMP, then they get a dedicated MPI
> >coding expert to convert it later for scalable production operation on a 
> >cluster.
> >In this situation, they do end up with hybrid versions in some cases.  In 
> >non-HPC
> >or smaller workgroup contexts your production code may not need to be 
> >converted.
> >
> >Cheers,
> >
> >rbw
> >
> >--
> >
> >"Making predictions is hard, especially about the future."
> >
> >Niels Bohr
> >
> >--
> >
> >Richard Walsh
> >Thrashing River Consulting--
> >5605 Alameda St.
> >Shoreview, MN 55126
> >
> >Phone #: 612-382-4620
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit 
> >http://www.beowulf.org/mailman/listinfo/beowulf
> 
> -- 
> Gerry Creager -- gerry.creager at tamu.edu
> Texas Mesonet -- AATLT, Texas A&M University	
> Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
> Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
Academic Computing Services                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6


From lindahl at pbm.com  Sat Dec  8 13:17:25 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Sat, 8 Dec 2007 13:17:25 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071208205507.GA16760@stikine.ucs.sfu.ca>
References: <120820070351.17643.475A14BB000E8190000044EB2200735446089C040E99D20B9D0E080C079D@comcast.net>
	<475A23EE.4020302@tamu.edu>
	<20071208205507.GA16760@stikine.ucs.sfu.ca>
Message-ID: <20071208211725.GA669@bx9.net>

On Sat, Dec 08, 2007 at 12:55:07PM -0800, Martin Siegert wrote:

> In short: the performance of OpenMP code isn't even close to that of the
> MPI code.

Thanks for more concrete data, Martin, do you have a URL for some more
detailed results? I'd love to have it next time this question comes
up.

Here's an interesting recent Usenet posting on this topic:

http://groups.google.com/group/comp.parallel/msg/7f4de81edc0575d1

-- greg


From siegert at sfu.ca  Sat Dec  8 14:04:50 2007
From: siegert at sfu.ca (Martin Siegert)
Date: Sat, 8 Dec 2007 14:04:50 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071208211725.GA669@bx9.net>
References: <120820070351.17643.475A14BB000E8190000044EB2200735446089C040E99D20B9D0E080C079D@comcast.net>
	<475A23EE.4020302@tamu.edu>
	<20071208205507.GA16760@stikine.ucs.sfu.ca>
	<20071208211725.GA669@bx9.net>
Message-ID: <20071208220450.GA16858@stikine.ucs.sfu.ca>

On Sat, Dec 08, 2007 at 01:17:25PM -0800, Greg Lindahl wrote:
> On Sat, Dec 08, 2007 at 12:55:07PM -0800, Martin Siegert wrote:
> 
> > In short: the performance of OpenMP code isn't even close to that of the
> > MPI code.
> 
> Thanks for more concrete data, Martin, do you have a URL for some more
> detailed results? I'd love to have it next time this question comes
> up.

I'll post a note about our benchmarking work in a few weeks (the work
isn't quite finished yet) and ask for comments at that time.

Nevertheless, if you know of well performing OpenMP applications I am
happy to include them in our benchmark collection.

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
Academic Computing Services                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6


From deadline at eadline.org  Sun Dec  9 10:19:23 2007
From: deadline at eadline.org (Douglas Eadline)
Date: Sun, 9 Dec 2007 13:19:23 -0500 (EST)
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
Message-ID: <46345.192.168.1.1.1197224363.squirrel@mail.eadline.org>


I like answering these types of questions with numbers,
so in my Sept 2007 Linux magazine column (which should
be showing up on the website soon) I did the following.

Downloaded the latest NAS benchmarks written in both
OpenMP and MPI. Ran them both on an 8 core Clovertown
(dual socket) system (multiple times) and reported
the following results:

Test      OpenMP              MPI
       gcc/gfortran 4.2    LAM 7.1.2
------------------------------------
CG         790.6             739.1
EP         166.5             162.8
FT        3535.9            2090.8
IS          51.1             122.5
LU        5620.5            5168.8
MG        1616.0            2046.2

My conclusion, it was a draw of sorts.
The article was basically looking at the
lazy assumption that threads (OpenMP) are
always better than MPI on a SMP  machine.

I'm going to re-run the tests using Harpertowns
real soon, maybe try other compilers and MPI
versions. It is easy to do. You can get the code here:

http://www.nas.nasa.gov/Resources/Software/npb.html

--
Doug


> On this list there is almost unanimous agreement that MPI is the way to go
> for parallelism and that combining multi-threading (MT) and
> message-passing
> (MP) is not even worth it, just sticking to MP is all that is necessary.
>
> However, in real-life most are talking and investing in MT while very few
> are interested in MP. I also just read on the blog of Arch Robison " TBB
> perhaps gives up a little performance short of optimal so you don't have
> to
> write message-passing " (here:
> http://softwareblogs.intel.com/2007/11/17/supercomputing-07-computer-environment-and-evolution/
>  )
>
> How come there is almost unanimous agreement in the beowulf-community
> while
> the rest is almost unanimous convinced of the opposite ? Are we just
> tapping
> ourselves on the back or is MP not sufficiently dissiminated or ... ?
>
> toon
>
>
> !DSPAM:4759a800241507095717635!
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> !DSPAM:4759a800241507095717635!
>


--
Doug


From toon.knapen at gmail.com  Sun Dec  9 11:37:27 2007
From: toon.knapen at gmail.com (Toon Knapen)
Date: Sun, 09 Dec 2007 20:37:27 +0100
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475AD37F.3040004@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com> <475AD37F.3040004@gmail.com>
Message-ID: <475C43F7.5070908@gmail.com>

Geoff Jacobs wrote:
> 
> If an application is going to be using only shared memory, I have no
> doubt the consensus here is to use native threads or OpenMP. If the
> application is going to be working over a network, MPI (or PVM) is the
> way to go.


If threads are better for SMP, but what about Numa like SGI Altix or HP 
Integrity ? On the latter for instance, inter-cell (cell = 1 board with 
4 cpu's) memory access suffers 200 times higher latency compared to 
intra-cell memory-access.

But also the AMD (and future Intel) processors are Numa. With threads 
you have no idea about referencing remote memory but is that a good thing?

And considering that future processors are even going more extreme in 
the Numa direction (e.g. the Intel 80-core), is'nt it more future-safe 
to go with MPI if one would start a large coding-project now?

thanks for all the reactions,

toon


From gdjacobs at gmail.com  Sun Dec  9 18:19:50 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Sun, 09 Dec 2007 20:19:50 -0600
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475C43F7.5070908@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com> <475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>
Message-ID: <475CA246.6000501@gmail.com>

Toon Knapen wrote:
> Geoff Jacobs wrote:
>>
>> If an application is going to be using only shared memory, I have no
>> doubt the consensus here is to use native threads or OpenMP. If the
>> application is going to be working over a network, MPI (or PVM) is the
>> way to go.
> 
> 
> If threads are better for SMP, but what about Numa like SGI Altix or HP
> Integrity ? On the latter for instance, inter-cell (cell = 1 board with
> 4 cpu's) memory access suffers 200 times higher latency compared to
> intra-cell memory-access.
> 
> But also the AMD (and future Intel) processors are Numa. With threads
> you have no idea about referencing remote memory but is that a good thing?
> 
> And considering that future processors are even going more extreme in
> the Numa direction (e.g. the Intel 80-core), is'nt it more future-safe
> to go with MPI if one would start a large coding-project now?
> 
> thanks for all the reactions,
> 
> toon
I do physics and computing. I am not a system programmer, so take this
with a grain of salt. However, it seems to me that a sufficiently
intelligent operating system would be able to attach the thread local
store to the local memory pool.

I've used a Superdome before (FORTRAN and HP MPI), so I know whereof you
speak, but MPI is not the general rule for software development in both
the Windows and UNIX world. Using the complexity of MPI isn't very
popular when a simpler method works okay.

However, it seems like threads have taken a little beating lately in
favor of discrete address spaces in security conscious system software
(most famously, DJBs stuff) which communicate with pipes and shm. And
lessons from QNX and BeOS tell us we might see some resurgence in the
use of message queues and other means of low level message passing in
the standard library.

-- 
Geoffrey D. Jacobs


From hahn at mcmaster.ca  Sun Dec  9 21:48:31 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Mon, 10 Dec 2007 00:48:31 -0500 (EST)
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475CA246.6000501@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<475CA246.6000501@gmail.com>
Message-ID: <Pine.LNX.4.64.0712092130270.12457@coffee.psychology.mcmaster.ca>

> with a grain of salt. However, it seems to me that a sufficiently
> intelligent operating system would be able to attach the thread local
> store to the local memory pool.

yes, everyone does numa-local allocation, to some extent or other.
but it's far from obvious which node should actually be the home of 
particular data - most of the time, a "first touch" policy is used
(whichever node touches the page first gets to host it.)  attempting
to dynamically adjust this affinity later sounds very tricky to me,
since afaikt it would require messing with TLB entries (slow).

> the Windows and UNIX world. Using the complexity of MPI isn't very
> popular when a simpler method works okay.

if you just want a quick hack and are satisfied with modest speedup,
threading is great!  I wouldn't agree that MPI is more complex - 
compared to an application which has had the same level of tuning
and refactoring to reach a high level of scalability.  in other words,
once you undertake to scale a code to hundreds of CPUs, using threads 
won't give you greater simplicity.

> However, it seems like threads have taken a little beating lately in
> favor of discrete address spaces in security conscious system software
> (most famously, DJBs stuff) which communicate with pipes and shm. And

threads, of course, are antithetical to security, since the whole point
is freedom to read/write anything.  I wouldn't say threads are antithetical
to The Unix Way, necessarily, though TUW obviously emphasizes fast fork/exec,
pipes, etc.  sendmail is less unixy than postfix, for instance.

and bear in mind that there are clever optimizations (splice) to optimize
the performance of pipes.

> lessons from QNX and BeOS tell us we might see some resurgence in the
> use of message queues and other means of low level message passing in
> the standard library.

I don't think so.  doing RPC-like message-passing over untyped channels
would be the modern trend (SOA).


From christian.bell at qlogic.com  Sun Dec  9 23:16:18 2007
From: christian.bell at qlogic.com (Christian Bell)
Date: Sun, 9 Dec 2007 23:16:18 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475AD5C7.9040906@gmail.com>
References: <120720072215.3444.4759C5FD00014E8500000D742207003201089C040E99D20B9D0E080C079D@comcast.net>
	<475AD5C7.9040906@gmail.com>
Message-ID: <20071210071618.GF19236@mv.qlogic.com>

On Sat, 08 Dec 2007, Geoff Jacobs wrote:

> But isn't CAF (and UPC, and Titanium) implicitly message passing for a
> Beowulf anyway? It's attractive because it simplifies the process and
> might be able to optimize communication, but underneath everything it's
> still message passing.

It's message passing to the extent that two processes exchange
"messages" over a network -- but it isn't MPI message passing which
would mean receiver-directed matching and placement of data.  On
clusters with advanced network interfaces, the level of message
passing would translate into low-level RDMA operations whereas an SMP
would implement these "messages" as reads and writes to physically
addressable memory.

What's meant by "message" requires a definition -- one can argue that
invalidating a cache line means sending a "message" to the cache
controller but it's far from what people usually think as MPI-level
messages.  PGAS attempts to provide "better programmability" while
targeting low-level communication primitives that do not involve the
MPI-level message passing baggage (matching, two-sided, pairwise
synchronization).


    . . christian

-- 
christian.bell at qlogic.com
(QLogic Host Solutions Group, formerly Pathscale)


From bencer at cauterized.net  Mon Dec 10 03:50:03 2007
From: bencer at cauterized.net (Jorge Salamero Sanz)
Date: Mon, 10 Dec 2007 12:50:03 +0100
Subject: [Beowulf] Request for comments: diskless cluster
Message-ID: <200712101250.03907.bencer@cauterized.net>


Hi all,

I'm going to move a 42-nodes beowulf to diskless mode (currently all local 
cloned installations).

Which system / tools do you recommend to manage the client-images ?

I was thinking on a debootstraped dir shared as NFS root. The differences 
between the nodes (/etc/hostname, /etc/fstab, /etc/exportfs ...) could be 
managed with unionfs.

Debian has a couple of tools that could help (live-helper for making custom 
images) but maybe lessdisk would be more suitable. Which one do you use ?

How do you manage this kind of cluster setup ?

Thanks !


From Hakon.Bugge at scali.com  Mon Dec 10 05:34:04 2007
From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge)
Date: Mon, 10 Dec 2007 14:34:04 +0100
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <200712092000.lB9K0BVj003880@bluewest.scyld.com>
References: <200712092000.lB9K0BVj003880@bluewest.scyld.com>
Message-ID: <20071210133405.7C95035AC61@mail.scali.no>

At Sun, 9 Dec 2007 13:19:23, "Douglas Eadline" <deadline at eadline.org> wrote:


>Test      OpenMP              MPI
>        gcc/gfortran 4.2    LAM 7.1.2
>------------------------------------
>FT        3535.9            2090.8

I have some suspicion that if you run this with one MPI-rank/thread, 
you will find the OMP version to be significant faster. Not that this 
undermines you finding, but I though it complements the picture.


Hakon


From deadline at eadline.org  Mon Dec 10 11:52:25 2007
From: deadline at eadline.org (Douglas Eadline)
Date: Mon, 10 Dec 2007 14:52:25 -0500 (EST)
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <46345.192.168.1.1.1197224363.squirrel@mail.eadline.org>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<46345.192.168.1.1.1197224363.squirrel@mail.eadline.org>
Message-ID: <56120.192.168.1.1.1197316345.squirrel@mail.eadline.org>

Some people had asked for more details:

NAS suite version 3.2.1
Test class was: B
Units are Mops (Million operations per second)
see the NAS docs for more information

--
Doug


> I like answering these types of questions with numbers,
> so in my Sept 2007 Linux magazine column (which should
> be showing up on the website soon) I did the following.
>
> Downloaded the latest NAS benchmarks written in both
> OpenMP and MPI. Ran them both on an 8 core Clovertown
> (dual socket) system (multiple times) and reported
> the following results:
>
> Test      OpenMP              MPI
>        gcc/gfortran 4.2    LAM 7.1.2
> ------------------------------------
> CG         790.6             739.1
> EP         166.5             162.8
> FT        3535.9            2090.8
> IS          51.1             122.5
> LU        5620.5            5168.8
> MG        1616.0            2046.2
>
> My conclusion, it was a draw of sorts.
> The article was basically looking at the
> lazy assumption that threads (OpenMP) are
> always better than MPI on a SMP  machine.
>
> I'm going to re-run the tests using Harpertowns
> real soon, maybe try other compilers and MPI
> versions. It is easy to do. You can get the code here:
>
> http://www.nas.nasa.gov/Resources/Software/npb.html
>
> --
> Doug
>
>
>
>
>
>
>
>
>
>> On this list there is almost unanimous agreement that MPI is the way to
>> go
>> for parallelism and that combining multi-threading (MT) and
>> message-passing
>> (MP) is not even worth it, just sticking to MP is all that is necessary.
>>
>> However, in real-life most are talking and investing in MT while very
>> few
>> are interested in MP. I also just read on the blog of Arch Robison " TBB
>> perhaps gives up a little performance short of optimal so you don't have
>> to
>> write message-passing " (here:
>> http://softwareblogs.intel.com/2007/11/17/supercomputing-07-computer-environment-and-evolution/
>>  )
>>
>> How come there is almost unanimous agreement in the beowulf-community
>> while
>> the rest is almost unanimous convinced of the opposite ? Are we just
>> tapping
>> ourselves on the back or is MP not sufficiently dissiminated or ... ?
>>
>> toon
>>
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>> !DSPAM:4759a800241507095717635!
>>
>
>
> --
> Doug
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> !DSPAM:475c325f61251246014193!
>


--
Doug


From wseas.headquarters at gmail.com  Sat Dec  8 10:46:54 2007
From: wseas.headquarters at gmail.com (Nikos Mastorakis)
Date: Sat, 8 Dec 2007 20:46:54 +0200
Subject: [Beowulf] WSEAS, Call for Papers
Message-ID: <5e0aeeab0712081046g7c4b0ac8t8401b6e7a80007f7@mail.gmail.com>

 CALL FOR PAPERS
=================

The annual convention and gathering of all the WSEAS entities (Working
Groups, Technical Committees, Editors,  Associate Editors, Research
Directors, Projects Coordinators,etc...) is held in July during the
CSCC.
http://www.wseas.org/conferences/2008/greece/

12th WSEAS CSCC Multiconference. Heraklion, Crete Island, Greece, July
22-25, 2008
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -

12th WSEAS Int. Conf. on CIRCUITS (July 22-24, 2008)
http://www.wseas.org/conferences/2008/greece/icc

12th WSEAS Int. Conf. on SYSTEMS (July 22-24, 2008)
http://www.wseas.org/conferences/2008/greece/ics

12th WSEAS Int. Conf. on COMMUNICATIONS (July 23-25, 2008)
http://www.wseas.org/conferences/2008/greece/iccom

12th WSEAS Int. Conf. on COMPUTERS (July 23-25, 2008)
http://www.wseas.org/conferences/2008/greece/iccomp

Heraklion, Crete, Greece, July 22-24, 2008


ENGINEERING EDUCATION (EE'08)
http://www.wseas.org/conferences/2008/greece/education

Best Regards

In 2006, the CSCC Multiconference received 1302 papers and approved
623 papers which was the maximum number of papers in its brilliant
history. In 2007, the organizers did not give extension in the
deadline and the accepted papers were approximately 550.


Prof. Nikos E. Mastorakis
www.wseas.org/mastorakis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071208/cf656091/attachment.html>

From johannesrs at gmail.com  Sat Dec  8 13:50:29 2007
From: johannesrs at gmail.com (Jones de Andrade)
Date: Sat, 8 Dec 2007 18:50:29 -0300
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071208205507.GA16760@stikine.ucs.sfu.ca>
References: <120820070351.17643.475A14BB000E8190000044EB2200735446089C040E99D20B9D0E080C079D@comcast.net>
	<475A23EE.4020302@tamu.edu>
	<20071208205507.GA16760@stikine.ucs.sfu.ca>
Message-ID: <54e4355e0712081350g530f2a1s6e556e7ab9112829@mail.gmail.com>

Hi all.

I usually just keep on looking on this list, but this discussion really
called my attention.

Could someone clarify a bit better *why* would openMP be such a bad
performer in comparison to MPI?

Moreover, concerning you tests  Dr. Siegert, could you please show us a bit
more? I mean for example the scalling you observed throw increasing the
number of cores.

I really don't immeadiatelly understand how openMP can perform so worst than
mpi on a smp machine, given the fact that not having to communicate with the
other 31 cores (on cpmd case, all the huge matrixes that should be
exchanged) should at least make things a bit easier.

But all that from the eyes and view of an young "amadorist".  ;)

Thanks a lot in advance,

Jones

On Dec 8, 2007 5:55 PM, Martin Siegert <siegert at sfu.ca> wrote:

> Over the last months I have done quite a bit of benchmarking of
> applications. One of the aspects we are interested in is the performance
> of applications that are available in MPI, OpenMP and hybrid versions.
> So far we looked at WRF and CPMD; we'll probably look at POP as well.
>
> MPI vs. OpenMP on a SMP (64 core Power5):
> walltime for cpmd benchmark on 32 cores:
> MPI: 93.13s   OpenMP: 446.86s
>
> Results for WRF on the same platform are similar.
> In short: the performance of OpenMP code isn't even close to that of the
> MPI code.
>
> We also looked at the hybrid version of these codes on clusters.
> The difference in run times are in the 1% range - less than the
> accuracy of the measurement.
>
> Thus, if you have the choice, why would you even look at anything other
> than MPI? Even if the programming effort for OpenMP is lower,
> the performance penalty is huge.
>
> That's my conclusion drawn from the cases we've looked at.
> If anybody knows of applications where the OpenMP performance comes close
> to the MPI performance and of applications where the hybrid performance
> is significantly better than the pure MPI performance, then I would
> love to hear from you. Thanks!
>
> Cheers,
> Martin
>
> --
> Martin Siegert
> Head, Research Computing
> WestGrid Site Lead
> Academic Computing Services                phone: 778 782-4691
> Simon Fraser University                    fax:   778 782-4242
> Burnaby, British Columbia                  email: siegert at sfu.ca
> Canada  V5A 1S6
>
> On Fri, Dec 07, 2007 at 10:56:14PM -0600, Gerry Creager wrote:
> > WRF has been under development for 10 years.  It's got an OpenMP flavor,
> > an MPI flavor and a hybrid one.  We still don't have all the bugs worked
> > out of the hybrid so that it can handle large, high resolution domains
> > without being slower than the MPI version.  And, yeah, the OpenMP geeks
> > working on this... and the MPI folks, are good.
> >
> > Hybrid isn't easy and isn't always foolproof.  And, as another thought,
> > OpenMP isn't always the best solution to the problem.
> >
> > gerry
> >
> > richard.walsh at comcast.net wrote:
> > > -------------- Original message ----------------------
> > >From: Toon Knapen <toon.knapen at gmail.com>
> > >>Greg Lindahl wrote:
> > >>>In real life (i.e. not HPC), everyone uses message passing between
> > >>>nodes.  So I don't see what you're getting at.
> > >>>
> > >>Many on this list suggest that using multiple MPI-processes on one and
> > >>the same node is superior to MT approaches IIUC. However I have the
> > >>impression that almost the whole industry is looking into MT to
> benefit
> > >>from multi-core without even considering message-passing. Why is that
> so?
> > >
> > >I think what Greg and others are really saying is that if you have to
> use
> > >a distributed memory
> > >model (MPI) as a first order response to meet your scalability
> > >requirements, then
> > >the extra coding effort and complexity required to create a hybrid code
> > >may not be
> > >a good performance return on your investment.  If on the other hand you
> > >only
> > >need to scale within a singe SMP node (with cores and sockets on a
> single
> > >board growing in number, this returns more performance than in the
> past),
> > >then you
> > >may be able to avoid using MPI and chose a simpler model like OpenMP.
>  If
> > >you
> > >have already written an efficient MPI code,  then (with some
> exceptions)
> > >the performance-gain divided by the hybrid coding-effort may seem
> small.
> > >
> > >Development in an SMP environment is easier.  I know of a number of
> sights
> > >that work this way.  The experienced algorithm folks work up the code
> in
> > >OpenMP on say an SGI Altix or Power6 SMP, then they get a dedicated MPI
> > >coding expert to convert it later for scalable production operation on
> a
> > >cluster.
> > >In this situation, they do end up with hybrid versions in some cases.
>  In
> > >non-HPC
> > >or smaller workgroup contexts your production code may not need to be
> > >converted.
> > >
> > >Cheers,
> > >
> > >rbw
> > >
> > >--
> > >
> > >"Making predictions is hard, especially about the future."
> > >
> > >Niels Bohr
> > >
> > >--
> > >
> > >Richard Walsh
> > >Thrashing River Consulting--
> > >5605 Alameda St.
> > >Shoreview, MN 55126
> > >
> > >Phone #: 612-382-4620
> > >
> > >_______________________________________________
> > >Beowulf mailing list, Beowulf at beowulf.org
> > >To change your subscription (digest mode or unsubscribe) visit
> > >http://www.beowulf.org/mailman/listinfo/beowulf
> >
> > --
> > Gerry Creager -- gerry.creager at tamu.edu
> > Texas Mesonet -- AATLT, Texas A&M University
> > Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
> > Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> Martin Siegert
> Head, Research Computing
> WestGrid Site Lead
> Academic Computing Services                phone: 778 782-4691
> Simon Fraser University                    fax:   778 782-4242
> Burnaby, British Columbia                  email: siegert at sfu.ca
> Canada  V5A 1S6
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071208/b62dfadd/attachment.html>

From examachine at gmail.com  Sun Dec  9 15:45:25 2007
From: examachine at gmail.com (Eray Ozkural)
Date: Mon, 10 Dec 2007 01:45:25 +0200
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475C43F7.5070908@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
Message-ID: <320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>

On Dec 9, 2007 9:37 PM, Toon Knapen <toon.knapen at gmail.com> wrote:

> And considering that future processors are even going more extreme in
> the Numa direction (e.g. the Intel 80-core), is'nt it more future-safe
> to go with MPI if one would start a large coding-project now?
>
> thanks for all the reactions,

I think that's a good point. For NUMA obviously MPI is more useful.

Best,

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://www.cs.bilkent.edu.tr/~erayo  Malfunct: http://myspace.com/malfunct
ai-philosophy: http://groups.yahoo.com/group/ai-philosophy


From rssr at lncc.br  Mon Dec 10 01:39:41 2007
From: rssr at lncc.br (Renato S. Silva)
Date: Mon, 10 Dec 2007 09:39:41 +0000
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <46345.192.168.1.1.1197224363.squirrel@mail.eadline.org>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<46345.192.168.1.1.1197224363.squirrel@mail.eadline.org>
Message-ID: <475D095D.4090001@lncc.br>

Hi

What is the CLASS you are using ?

One point is how the OpenMP version leads with memory, in relation to  
MPI version

I have two  sugestions

 I will be a good ideia to try to use the Intel compiler and
Why not download and run the multi-zone NAS benchmarks ?
It can give a complete view of the "problem"

They have :

>    NPB3.2-MZ-SER:  a serial version
>    NPB3.2-MZ-MPI:  a hybrid MPI + OpenMP version
>    NPB3.2-MZ-SMP:  a hybrid SMP + OpenMP version

I run in a very small clusters 4 core 2 duo,  using Intel compilers and 
it seens
and for the CLASS=C  I get this numbers for the MPI version and MPI+ OpenMP
with one (Hibrid(1))and two (Hibrib(2)) threads

>
> N. Processors  -     MPI     -  Hibrid (1)  -  Hibrid (2)
>        1               -  2496.59   -       *           - 1682.07
>        2               -  1085.58   -       *           -  846.82 
>        4               -    624.69   -   674.28      -   498.65
>        8               -    447.7     -   467.79      -   *     
>
Unfortunate they dont have a version with only OpenMP  I and I dont know 
if the results with one MPI process and several threads cam be usefull.

Notice that the benchmarks are different LU version from the NPB2.0
they change the "memory acess" .


Renato


Douglas Eadline wrote:

>I like answering these types of questions with numbers,
>so in my Sept 2007 Linux magazine column (which should
>be showing up on the website soon) I did the following.
>
>Downloaded the latest NAS benchmarks written in both
>OpenMP and MPI. Ran them both on an 8 core Clovertown
>(dual socket) system (multiple times) and reported
>the following results:
>
>Test      OpenMP              MPI
>       gcc/gfortran 4.2    LAM 7.1.2
>------------------------------------
>CG         790.6             739.1
>EP         166.5             162.8
>FT        3535.9            2090.8
>IS          51.1             122.5
>LU        5620.5            5168.8
>MG        1616.0            2046.2
>
>My conclusion, it was a draw of sorts.
>The article was basically looking at the
>lazy assumption that threads (OpenMP) are
>always better than MPI on a SMP  machine.
>
>I'm going to re-run the tests using Harpertowns
>real soon, maybe try other compilers and MPI
>versions. It is easy to do. You can get the code here:
>
>http://www.nas.nasa.gov/Resources/Software/npb.html
>
>--
>Doug
>
>
>
>
>
>
>
>
>
>  
>
>>On this list there is almost unanimous agreement that MPI is the way to go
>>for parallelism and that combining multi-threading (MT) and
>>message-passing
>>(MP) is not even worth it, just sticking to MP is all that is necessary.
>>
>>However, in real-life most are talking and investing in MT while very few
>>are interested in MP. I also just read on the blog of Arch Robison " TBB
>>perhaps gives up a little performance short of optimal so you don't have
>>to
>>write message-passing " (here:
>>http://softwareblogs.intel.com/2007/11/17/supercomputing-07-computer-environment-and-evolution/
>> )
>>
>>How come there is almost unanimous agreement in the beowulf-community
>>while
>>the rest is almost unanimous convinced of the opposite ? Are we just
>>tapping
>>ourselves on the back or is MP not sufficiently dissiminated or ... ?
>>
>>toon
>>
>>
>>!DSPAM:4759a800241507095717635!
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>>http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>>!DSPAM:4759a800241507095717635!
>>
>>    
>>
>
>
>--
>Doug
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>  
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071210/91cbcef8/attachment.html>

From siegert at sfu.ca  Mon Dec 10 16:27:57 2007
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 10 Dec 2007 16:27:57 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <54e4355e0712081350g530f2a1s6e556e7ab9112829@mail.gmail.com>
References: <120820070351.17643.475A14BB000E8190000044EB2200735446089C040E99D20B9D0E080C079D@comcast.net>
	<475A23EE.4020302@tamu.edu>
	<20071208205507.GA16760@stikine.ucs.sfu.ca>
	<54e4355e0712081350g530f2a1s6e556e7ab9112829@mail.gmail.com>
Message-ID: <20071211002757.GA28685@stikine.ucs.sfu.ca>

Hi Jones,

On Sat, Dec 08, 2007 at 06:50:29PM -0300, Jones de Andrade wrote:
> 
>    Hi all.
>    I  usually  just  keep  on  looking  on this list, but this discussion
>    really called my attention.
>    Could  someone  clarify  a bit better *why* would openMP be such a bad
>    performer in comparison to MPI?
>    Moreover,  concerning you tests  Dr. Siegert, could you please show us
>    a  bit  more?  I  mean  for  example  the  scalling you observed throw
>    increasing the number of cores.

As I said, I will post the full results in a few weeks, we are not quite
done yet.

>    I  really  don't  immeadiatelly  understand  how openMP can perform so
>    worst  than  mpi  on  a smp machine, given the fact that not having to
>    communicate  with  the  other  31  cores  (on  cpmd case, all the huge
>    matrixes  that  should be exchanged) should at least make things a bit
>    easier.

The way code is parallelized is different whe using MPI and OpenMP.
OpenMP code usually uses a loop level parallelization, i.e., this is
very fine grained. In the cases that I have mentioned MPI uses domain
decomposition - very coarse grained.
All what I am saying is that for the codes that I have seen the
MPI way of parallelizing the code is by far more efficient than
the OpenMP way. That does not mean that you could not use domain
decomposition with OpenMP, it just appears that that is not done
usually. I am speculating that that may be a consequence of the
(perceived?) easier way of programming using OpenMP. If you do
domain decomposition you probably end up writing code that looks very
similar to the MPI code even if you use OpenMP.

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
Academic Computing Services                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6

>    But all that from the eyes and view of an young "amadorist".  ;)
>    Thanks a lot in advance,
>    Jones
> 
>    On Dec 8, 2007 5:55 PM, Martin Siegert <[1] siegert at sfu.ca> wrote:
> 
>      Over the last months I have done quite a bit of benchmarking of
>      applications.  One  of  the  aspects  we  are  interested in is the
>      performance
>      of  applications  that  are  available  in  MPI,  OpenMP and hybrid
>      versions.
>      So  far  we  looked  at WRF and CPMD; we'll probably look at POP as
>      well.
>      MPI vs. OpenMP on a SMP (64 core Power5):
>      walltime for cpmd benchmark on 32 cores:
>      MPI: 93.13s   OpenMP: 446.86s
>      Results for WRF on the same platform are similar.
>      In  short:  the performance of OpenMP code isn't even close to that
>      of the
>      MPI code.
>      We also looked at the hybrid version of these codes on clusters.
>      The difference in run times are in the 1% range - less than the
>      accuracy of the measurement.
>      Thus,  if  you have the choice, why would you even look at anything
>      other
>      than MPI? Even if the programming effort for OpenMP is lower,
>      the performance penalty is huge.
>      That's my conclusion drawn from the cases we've looked at.
>      If anybody knows of applications where the OpenMP performance comes
>      close
>      to  the  MPI  performance  and  of  applications  where  the hybrid
>      performance
>      is significantly better than the pure MPI performance, then I would
>      love to hear from you. Thanks!
>      Cheers,
>      Martin
>      --
>      Martin Siegert
>      Head, Research Computing
>      WestGrid Site Lead
>      Academic Computing Services                phone: 778 782-4691
>      Simon Fraser University                    fax:   778 782-4242
>      Burnaby, British Columbia                  email: [2]siegert at sfu.ca
>      Canada  V5A 1S6
>      On Fri, Dec 07, 2007 at 10:56:14PM -0600, Gerry Creager wrote:
>      >  WRF has been under development for 10 years.  It's got an OpenMP
>      flavor,
>      > an MPI flavor and a hybrid one.  We still don't have all the bugs
>      worked
>      >  out  of  the hybrid so that it can handle large, high resolution
>      domains
>      > without being slower than the MPI version.  And, yeah, the OpenMP
>      geeks
>      > working on this... and the MPI folks, are good.
>      >
>      >  Hybrid  isn't  easy and isn't always foolproof.  And, as another
>      thought,
>      > OpenMP isn't always the best solution to the problem.
>      >
>      > gerry
>      >
>      > [3]richard.walsh at comcast.net wrote:
>      > > -------------- Original message ----------------------
>      > >From: Toon Knapen <[4] toon.knapen at gmail.com>
>      > >>Greg Lindahl wrote:
>      >  >>>In  real  life  (i.e. not HPC), everyone uses message passing
>      between
>      > >>>nodes.  So I don't see what you're getting at.
>      > >>>
>      >  >>Many on this list suggest that using multiple MPI-processes on
>      one and
>      > >>the same node is superior to MT approaches IIUC. However I have
>      the
>      > >>impression that almost the whole industry is looking into MT to
>      benefit
>      >  >>from  multi-core without even considering message-passing. Why
>      is that so?
>      > >
>      >  >I  think  what Greg and others are really saying is that if you
>      have to use
>      > >a distributed memory
>      > >model (MPI) as a first order response to meet your scalability
>      > >requirements, then
>      >  >the  extra  coding  effort  and complexity required to create a
>      hybrid code
>      > >may not be
>      >  >a  good performance return on your investment.  If on the other
>      hand you
>      > >only
>      > >need to scale within a singe SMP node (with cores and sockets on
>      a single
>      >  >board  growing in number, this returns more performance than in
>      the past),
>      > >then you
>      >  >may  be  able to avoid using MPI and chose a simpler model like
>      OpenMP.  If
>      > >you
>      >  >have  already  written  an efficient MPI code,  then (with some
>      exceptions)
>      >  >the  performance-gain  divided  by the hybrid coding-effort may
>      seem small.
>      > >
>      > >Development in an SMP environment is easier.  I know of a number
>      of sights
>      > >that work this way.  The experienced algorithm folks work up the
>      code in
>      >  >OpenMP  on  say  an  SGI  Altix  or Power6 SMP, then they get a
>      dedicated MPI
>      >  >coding  expert  to  convert  it  later  for scalable production
>      operation on a
>      > >cluster.
>      >  >In  this situation, they do end up with hybrid versions in some
>      cases.  In
>      > >non-HPC
>      >  >or smaller workgroup contexts your production code may not need
>      to be
>      > >converted.
>      > >
>      > >Cheers,
>      > >
>      > >rbw
>      > >
>      > >--
>      > >
>      > >"Making predictions is hard, especially about the future."
>      > >
>      > >Niels Bohr
>      > >
>      > >--
>      > >
>      > >Richard Walsh
>      > >Thrashing River Consulting--
>      > >5605 Alameda St.
>      > >Shoreview, MN 55126
>      > >
>      > >Phone #: 612-382-4620
>      > >
>      > >_______________________________________________
>      > >Beowulf mailing list, [5]Beowulf at beowulf.org
>      > >To change your subscription (digest mode or unsubscribe) visit
>      > >[6]http://www.beowulf.org/mailman/listinfo/beowulf
>      >
>      > --
>      > Gerry Creager -- [7]gerry.creager at tamu.edu
>      > Texas Mesonet -- AATLT, Texas A&M University
>      > Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
>      >  Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX
>      77843
>      > _______________________________________________
>      > Beowulf mailing list, [8]Beowulf at beowulf.org
>      > To change your subscription (digest mode or unsubscribe) visit
>      > [9]http://www.beowulf.org/mailman/listinfo/beowulf
>      --
>      Martin Siegert
>      Head, Research Computing
>      WestGrid Site Lead
>      Academic Computing Services                phone: 778 782-4691
>      Simon Fraser University                    fax:   778 782-4242
>      Burnaby,   British   Columbia                                email:
>      [10]siegert at sfu.ca
>      Canada  V5A 1S6
>      _______________________________________________
>      Beowulf mailing list, [11]Beowulf at beowulf.org
>      To  change  your  subscription  (digest  mode or unsubscribe) visit
>      [12]http://www.beowulf.org/mailman/listinfo/beowulf
> 
> References
> 
>    1. mailto:siegert at sfu.ca
>    2. mailto:siegert at sfu.ca
>    3. mailto:richard.walsh at comcast.net
>    4. mailto:toon.knapen at gmail.com
>    5. mailto:Beowulf at beowulf.org
>    6. http://www.beowulf.org/mailman/listinfo/beowulf
>    7. mailto:gerry.creager at tamu.edu
>    8. mailto:Beowulf at beowulf.org
>    9. http://www.beowulf.org/mailman/listinfo/beowulf
>   10. mailto:siegert at sfu.ca
>   11. mailto:Beowulf at beowulf.org
>   12. http://www.beowulf.org/mailman/listinfo/beowulf

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From landman at scalableinformatics.com  Mon Dec 10 16:40:15 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Mon, 10 Dec 2007 19:40:15 -0500
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
Message-ID: <475DDC6F.3060007@scalableinformatics.com>

Eray Ozkural wrote:
> On Dec 9, 2007 9:37 PM, Toon Knapen <toon.knapen at gmail.com> wrote:
> 
>> And considering that future processors are even going more extreme in
>> the Numa direction (e.g. the Intel 80-core), is'nt it more future-safe
>> to go with MPI if one would start a large coding-project now?
>>
>> thanks for all the reactions,
> 
> I think that's a good point. For NUMA obviously MPI is more useful.

I have been staying out of the debate thus far, as I believe that it is 
more likely to generate heat than light.

A few obvious points:

a) single benchmarks do not a definitive statement make
b) the only code that matters is your code (really, this should be 
everyone's mantra with benchmarking in general).

12 years ago, after starting work at SGI, I had to work hard to convince 
people that a 75 MHz R8000 chip could actually be faster (e.g. lower 
wall clock time on real app with real data) than a 233 MHz (or whatever 
it was) Alpha chip.  It was "obvious" to most people that Alpha was 
faster.  That was, it was obvious from the cpu clock, various "standard" 
benchmark cases, and so forth ... until they ran their own codes, and 
saw some rather different results.

The point of this is that I see the same thing playing out here, with 
people's opinions and notes generating the heat.  I would prefer to try 
to shed a little light if possible, and keep the heat level as low as 
possible.

FWIW:  I have been using OpenMP for something like 11 years (pretty much 
since inception), and MPI since about 1997.  I have used both in 
projects with customers, end users, collaborators.  I have taught 
graduate level courses in HPC programming using both.

Generally speaking, I find scientists/engineers generally "get" OpenMP 
more easily than MPI.  They have to work less hard to get some benefit 
from OpenMP than MPI.

This above statement I expect to generate great deals of heat, which is 
a shame, as the next statement should generate a great deal of light.

This said, since OpenMP does stuff for you, you have to think and work 
harder to prevent the performance killing conditions which can and often 
do show up in real code.  OpenMP lets you share data, and as you 
increase the number of CPUs sharing the data, on average the shared data 
is often the bottleneck.   Then again, with some careful re-crafting of 
the code ... not a complete rewrite, it is entirely possible to mitigate 
many of the issues.  That is OpenMP saves you from thinking hard to get 
some benefit, but you need to think hard to get good benefit for larger 
systems.  More about this in a minute.

MPI is harder (though some may disagree).  You have to rewrite and 
rethink your code.  While this is harder, this is also a good thing.  It 
forces you to explicitly consider data locality issues (NUMA is an 
example of a data locality hierarchy) which OpenMP does not explicitly 
force you to consider.  It forces you to avoid global data, and all the 
pain that goes with it (false sharing, atomic updates, ...).  It forces 
you to explicitly move data.

Also, unlike OpenMP, the communication model can be easily matched to 
the underlying problem.  Which tends to mean a tighter coupling of the 
computing resource to the algorithm.  OpenMP is a bag-o-threads, and you 
don't have an "explicit" communication pattern between threads.

I don't consider one "better" than the other for all problems.  For 
certain classes of problem, OpenMP is the logical and obvious choice, 
while MPI is the logical and obvious choice for other classes.  Aside 
from this, without channeling an ex-US president, we need to define what 
"better" means.  Faster execution on model problems?  Faster 
benchmarking?  Faster development, ease of code 
testing/debugging/management?

I do agree with Greg in that I have not to date seen a code where the 
hybrid model is better than the pure model.

Back to Eray's point.

For NUMA, you have a small set of data points which show that MPI 
provides superior performance on a code.  The question is whether or not 
the OpenMP code used first-touch or similar allocation ... without more 
information, it is fairly hard to draw conclusions, never mind general 
conclusions.  Large SGI machines have gobs of NUMA shared memory, and 
you can get very good scalability with (non-trivial) OpenMP codes.

What we see going forward are desktops with 4-16 cores (biased as this 
is what we are doing/selling) and a shared memory system.  NUMA for AMD, 
flat (non-NUMA) for Intel.  Intel is going to NUMA as far as I have seen 
at SC07 and elsewhere (and Intel folks, please do step in and let me 
know if I am wrong).  A well written OpenMP code, that knows how to use 
memory correctly, should be able to exploit these multiple memory buses 
without too many issues.  The streams code is an example of a "trivial" 
(sorry John) code which operates in OpenMP very nicely.

There are others.  A fair number of commercial codes with large solvers 
don't do decomposition very well, and tend not to have great MPI 
versions, or not so great MPI scalability.  They do shared memory quite 
nicely, and will scale well on large processor count machines with lots 
of memory buses (MSC/NASTRAN, various other similar codes, ...).

What I am much belaboring here is that it is *not* obvious at all that 
one or the other method is "better" in a general sense (due to the fact 
that "better" is not well defined to begin with in this context).

Our view has always been use what you are comfortable with, and what you 
need.  If you need to run across a cluster, use MPI.  If you need to run 
across a single large memory machine, use OpenMP.

FWIW:  I would suggest learning both.  With the advent of many-core 
workstations, and accelerator systems with many many cores, programming 
these things is more likely to be mediated by a compiler (OpenMP like) 
than putting MPI stacks on the Cell SPUs (not enough local scratchpad 
ram for it).

Just my $0.02, and I hope I generated light, and very little heat.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From lindahl at pbm.com  Mon Dec 10 16:46:31 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 10 Dec 2007 16:46:31 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <54e4355e0712081350g530f2a1s6e556e7ab9112829@mail.gmail.com>
References: <120820070351.17643.475A14BB000E8190000044EB2200735446089C040E99D20B9D0E080C079D@comcast.net>
	<475A23EE.4020302@tamu.edu>
	<20071208205507.GA16760@stikine.ucs.sfu.ca>
	<54e4355e0712081350g530f2a1s6e556e7ab9112829@mail.gmail.com>
Message-ID: <20071211004630.GA13308@bx9.net>

On Sat, Dec 08, 2007 at 06:50:29PM -0300, Jones de Andrade wrote:

> Could someone clarify a bit better *why* would openMP be such a bad
> performer in comparison to MPI?

MPI always gets locality right.

I learned this lesson on the BBN Butterfly in the early 90s, it's
funny that it's still true.

-- greg


From hahn at mcmaster.ca  Mon Dec 10 18:01:30 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Mon, 10 Dec 2007 21:01:30 -0500 (EST)
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475DDC6F.3060007@scalableinformatics.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
Message-ID: <Pine.LNX.4.64.0712102029270.28977@coffee.psychology.mcmaster.ca>

> Generally speaking, I find scientists/engineers generally "get" OpenMP more 
> easily than MPI.  They have to work less hard to get some benefit from OpenMP 
> than MPI.
>
> This above statement I expect to generate great deals of heat, which is a

I don't think so many would disagree.  OpenMP presents a programming model
that, for simple codes, is very close to plain old serial.  getting "some"
benefit is very easy.

the issue, though, is whether it's practical and efficient to mix both.
I think the answer is no - sort of a correlary of the following:

 	Debugging is twice as hard as writing the code in the first place.
 	Therefore, if you write the code as cleverly as possible, you are,
 	by definition, not smart enough to debug it."  (Brian W Kernighan)

doing just OpenMP or MPI well is hard enough - at the edge of most people's 
ability.  debugging it is therefore beyond their ability ;)

> channeling an ex-US president, we need to define what "better" means.  Faster 
> execution on model problems?  Faster benchmarking?  Faster development, ease 
> of code testing/debugging/management?

in my world, a code (or trivial variants) is run many times, and consumes a
lot of CPU resources, so I encourage people to write efficient code even if it
takes more effort.  the relative value of the compute hardware is different
in, for instance, an engineering company.

> What we see going forward are desktops with 4-16 cores (biased as this is 
> what we are doing/selling) and a shared memory system.

I'm skeptical about how quickly the market will reward 8-core chips.

> (non-NUMA) for Intel.  Intel is going to NUMA as far as I have seen at SC07 
> and elsewhere (and Intel folks, please do step in and let me know if I am

there's really no choice: you simply can't scale a flat memory system.

> issues.  The streams code is an example of a "trivial" (sorry John) code 
> which operates in OpenMP very nicely.

it's embarassingly parallel, that's all.  I don't think John would disagree.
of course, stream-in-MPI scales even better ;)


From lindahl at pbm.com  Mon Dec 10 22:36:03 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Mon, 10 Dec 2007 22:36:03 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475DDC6F.3060007@scalableinformatics.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
Message-ID: <20071211063603.GA18419@bx9.net>

On Mon, Dec 10, 2007 at 07:40:15PM -0500, Joe Landman wrote:

> b) the only code that matters is your code (really, this should be 
> everyone's mantra with benchmarking in general).

Well, the problem here is that the question most people are asking is,
"how should I parallelize my code?" This question gets asked before
you know the performance on your code.

So the mantra doesn't help.

-- greg


From deadline at eadline.org  Tue Dec 11 05:17:19 2007
From: deadline at eadline.org (Douglas Eadline)
Date: Tue, 11 Dec 2007 08:17:19 -0500 (EST)
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071211063603.GA18419@bx9.net>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
Message-ID: <60349.192.168.1.1.1197379039.squirrel@mail.eadline.org>

This is indeed the issue. Where to invest time?

My opinion, and it is only my opinion, is the following.
Please share your own.

Threaded approaches do not scale across clusters. The memory
architecture of multi-core is making nodes look more like
small clusters i.e. memory is becoming more localized.
As Don Becker mentioned in a recent post, efforts to program
distributed memory like it were shared memory often end
up looking like stylized message passing systems.

One other thing about messages. The problem of
trying to optimize the compute to communication issue is
easier than trying to optimize the compute to locality
issue.

Therefore, if I were to start a new parallel project of some sort
or parallelize an existing code, I would use MPI. Although
OpenMP might get me up and running quicker, I would feel more
comfortable with a problem cast in MPI.

I'm interested in others opinions on this because, I think it
is an important issue for the general programing audience
and not just us cluster geeks. The difference is we have had
a lot more time and experience with this stuff.

--
Doug


> On Mon, Dec 10, 2007 at 07:40:15PM -0500, Joe Landman wrote:
>
>> b) the only code that matters is your code (really, this should be
>> everyone's mantra with benchmarking in general).
>
> Well, the problem here is that the question most people are asking is,
> "how should I parallelize my code?" This question gets asked before
> you know the performance on your code.
>
> So the mantra doesn't help.
>
> -- greg
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> !DSPAM:475e33cf132131336712104!
>


--
Doug


From landman at scalableinformatics.com  Tue Dec 11 05:52:11 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 11 Dec 2007 08:52:11 -0500
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071211063603.GA18419@bx9.net>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
Message-ID: <475E960B.4070509@scalableinformatics.com>

Greg Lindahl wrote:
> On Mon, Dec 10, 2007 at 07:40:15PM -0500, Joe Landman wrote:
> 
>> b) the only code that matters is your code (really, this should be 
>> everyone's mantra with benchmarking in general).
> 
> Well, the problem here is that the question most people are asking is,
> "how should I parallelize my code?" This question gets asked before
> you know the performance on your code.
> 
> So the mantra doesn't help.

On the contrary, it is precisely because people are asking "how should I 
parallelize" that they need to ask the basic question of "where does my 
code spend time for my problems."

Most people I know working on parallelization or optimization have at 
least asked that question.

Starting parallelization without this knowledge in advance is an 
exercise pretty much guaranteed to fail.  Unless you happen to be an 
extremely lucky guesser.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From hahn at mcmaster.ca  Tue Dec 11 06:28:17 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 11 Dec 2007 09:28:17 -0500 (EST)
Subject: [Beowulf] ever heard of ScaleMP?
Message-ID: <Pine.LNX.4.64.0712100920250.12457@coffee.psychology.mcmaster.ca>

there's a company, ScaleMP, which seems to be selling some kind of 
kit which enables to fairly large shared-memory x86_64 systems.
their website is nearly useless (http://www.scalemp.com/), but a little
more info can be had from SGI, which apparently uses ScaleMP for their
f1200 product (rebadged Ciara?).

they claim to support up to 32 sockets, 512GB memory.  "Versatile SMP".

SGI seems to aim it purely at structural/cfd/crash sims - 
mainly using Abaqus and related tools.

as far as I can tell, the 32s configuration is 4x 8-socket boxes,
each with 7x 1 Gb links to their peers.  seems to claim that it runs 
unmodified rh/fc/suse systems.  marketing docs claim that the secret
sauce is bios firmware, and mention that "at least 10%" of the memory
is "reserved" for system cache.  I'm not sure how the bios is involved,
but it sounds like a pretty generic network-shared-memory system,
which would be OK for uncontended pages, but would thrash once procs
wait for ~80 us to reference a remote page...

I would be most grateful if anyone has experience or knowlege of what
ScaleMP actually does.

thanks, mark hahn.


From landman at scalableinformatics.com  Tue Dec 11 06:56:00 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 11 Dec 2007 09:56:00 -0500
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <Pine.LNX.4.64.0712100920250.12457@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.0712100920250.12457@coffee.psychology.mcmaster.ca>
Message-ID: <475EA500.9030709@scalableinformatics.com>

Mark Hahn wrote:
> there's a company, ScaleMP, which seems to be selling some kind of kit 
> which enables to fairly large shared-memory x86_64 systems.
> their website is nearly useless (http://www.scalemp.com/), but a little
> more info can be had from SGI, which apparently uses ScaleMP for their
> f1200 product (rebadged Ciara?).
> 
> they claim to support up to 32 sockets, 512GB memory.  "Versatile SMP".

The signage at SC07 claimed up to 1 TB ram.  Works with Intel and AMD.

> SGI seems to aim it purely at structural/cfd/crash sims - mainly using 
> Abaqus and related tools.

Yup.  Big single memory/system image machines.

> as far as I can tell, the 32s configuration is 4x 8-socket boxes,
> each with 7x 1 Gb links to their peers.  seems to claim that it runs 
> unmodified rh/fc/suse systems.  marketing docs claim that the secret
> sauce is bios firmware, and mention that "at least 10%" of the memory

Yup.  This is about right.

> is "reserved" for system cache.  I'm not sure how the bios is involved,
> but it sounds like a pretty generic network-shared-memory system,
> which would be OK for uncontended pages, but would thrash once procs
> wait for ~80 us to reference a remote page...

They really want you to use a fast network connection ... think IB or 
similar.

This is BTW quite similar to what Panta was doing when it was alive.

> 
> I would be most grateful if anyone has experience or knowlege of what
> ScaleMP actually does.

Build big SMPs out of little SMPs.  Shared memory, NUMA, need to 
localize pages and access.  But you can allocate several hundred gigs if 
you want/need to.

The boxen are commodity.  Shai Fulthem (CTO) occasionally posts on LKML 
and other places (may lurk here).  Technology is neat, allows you to 
aggregate smaller units into larger units.  Not on the fly, but on boot 
(cold boot).


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From gdjacobs at gmail.com  Tue Dec 11 07:09:34 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Tue, 11 Dec 2007 09:09:34 -0600
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <60349.192.168.1.1.1197379039.squirrel@mail.eadline.org>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>
	<60349.192.168.1.1.1197379039.squirrel@mail.eadline.org>
Message-ID: <475EA82E.2030101@gmail.com>

Douglas Eadline wrote:
> This is indeed the issue. Where to invest time?
> 
> My opinion, and it is only my opinion, is the following.
> Please share your own.
> 
> Threaded approaches do not scale across clusters. The memory
> architecture of multi-core is making nodes look more like
> small clusters i.e. memory is becoming more localized.
> As Don Becker mentioned in a recent post, efforts to program
> distributed memory like it were shared memory often end
> up looking like stylized message passing systems.
> 
> One other thing about messages. The problem of
> trying to optimize the compute to communication issue is
> easier than trying to optimize the compute to locality
> issue.
> 
> Therefore, if I were to start a new parallel project of some sort
> or parallelize an existing code, I would use MPI. Although
> OpenMP might get me up and running quicker, I would feel more
> comfortable with a problem cast in MPI.
> 
> I'm interested in others opinions on this because, I think it
> is an important issue for the general programing audience
> and not just us cluster geeks. The difference is we have had
> a lot more time and experience with this stuff.
> 
> --
> Doug

Please note that we are all speaking as developers on clusters. Although
this is a valuable niche in the market, it is still only a niche. Even
with workstations, systems still only come with a few cores. In this
regime, threads are still relevant from a performance standpoint.
Furthermore, threads (of whichever flavor) are a first class API in most
operating systems. Until Joe Sixpack can buy Windows ZX (or whatever)
with the new message passing system preloaded, the old system which is
preinstalled will still be the development target of most ISVs.

-- 
Geoffrey D. Jacobs


From laytonjb at charter.net  Tue Dec 11 07:19:07 2007
From: laytonjb at charter.net (laytonjb at charter.net)
Date: Tue, 11 Dec 2007 7:19:07 -0800
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <Pine.LNX.4.64.0712100920250.12457@coffee.psychology.mcmaster.ca>
Message-ID: <20071211101907.9OFM9.11486.root@fepweb09>

---- Mark Hahn <hahn at mcmaster.ca> wrote: 
> there's a company, ScaleMP, which seems to be selling some kind of 
> kit which enables to fairly large shared-memory x86_64 systems.
> their website is nearly useless (http://www.scalemp.com/), but a little
> more info can be had from SGI, which apparently uses ScaleMP for their
> f1200 product (rebadged Ciara?).
> 
> 
> SGI seems to aim it purely at structural/cfd/crash sims - 
> mainly using Abaqus and related tools.

Abaqus is now MPI capable (the first of the implicit FEM codes that I know
of). So ScaleMP isn't needed for the newer version of Abaqus.

Jeff


From laytonjb at charter.net  Tue Dec 11 07:21:46 2007
From: laytonjb at charter.net (laytonjb at charter.net)
Date: Tue, 11 Dec 2007 7:21:46 -0800
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <475EA500.9030709@scalableinformatics.com>
Message-ID: <20071211102146.UJ68P.11651.root@fepweb09>

---- Joe Landman <landman at scalableinformatics.com> wrote: 
> Mark Hahn wrote:
> > there's a company, ScaleMP, which seems to be selling some kind of kit 
> > which enables to fairly large shared-memory x86_64 systems.
> > their website is nearly useless (http://www.scalemp.com/), but a little
> > more info can be had from SGI, which apparently uses ScaleMP for their
> > f1200 product (rebadged Ciara?).
> > 
> > they claim to support up to 32 sockets, 512GB memory.  "Versatile SMP".
> 
> The signage at SC07 claimed up to 1 TB ram.  Works with Intel and AMD.
> 
> > SGI seems to aim it purely at structural/cfd/crash sims - mainly using 
> > Abaqus and related tools.
> 
> Yup.  Big single memory/system image machines.
> 
> > as far as I can tell, the 32s configuration is 4x 8-socket boxes,
> > each with 7x 1 Gb links to their peers.  seems to claim that it runs 
> > unmodified rh/fc/suse systems.  marketing docs claim that the secret
> > sauce is bios firmware, and mention that "at least 10%" of the memory
> 
> Yup.  This is about right.
> 
> > is "reserved" for system cache.  I'm not sure how the bios is involved,
> > but it sounds like a pretty generic network-shared-memory system,
> > which would be OK for uncontended pages, but would thrash once procs
> > wait for ~80 us to reference a remote page...
> 
> They really want you to use a fast network connection ... think IB or 
> similar.

Flextronics was showing a small cluster where they had 4 boxes connected
by IB and within each box they had 4 systems connected by IB. They were
running ScaleMP on it. They had a graph of running Stream on top of the
system. They were plotting bandwidth vs. number of cores and it was fairly
linear (I didn't get a close look at it).

Jeff


From landman at scalableinformatics.com  Tue Dec 11 07:40:57 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 11 Dec 2007 10:40:57 -0500
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475EA82E.2030101@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<60349.192.168.1.1.1197379039.squirrel@mail.eadline.org>
	<475EA82E.2030101@gmail.com>
Message-ID: <475EAF89.3050209@scalableinformatics.com>

Geoff Jacobs wrote:

> Please note that we are all speaking as developers on clusters. Although
> this is a valuable niche in the market, it is still only a niche. Even
> with workstations, systems still only come with a few cores. In this
> regime, threads are still relevant from a performance standpoint.
> Furthermore, threads (of whichever flavor) are a first class API in most
> operating systems. Until Joe Sixpack can buy Windows ZX (or whatever)
> with the new message passing system preloaded, the old system which is
> preinstalled will still be the development target of most ISVs.

This historical inertia is a killer.  I have heard that lots of windows 
ISVs had balked at doing 64 bit CCS ports due to the fact that their 
code already ran on windows.  This is one of the problems SGI faced with 
32 bit code during the 32->64 bit transition.  Many users of AMD 
Opterons in 2004-2005 were completely unaware of any performance 
advantage they got, effectively for free, by recompiling their code for 
64 bit vs 32 bit.  We still have many customers happily using 32 bit 
software (pre-compiled) on 64 bit hardware and OSes, as it is a path of 
least resistance.

With OpenMP now part of gcc (as of 4.2, and there should be a nice 
little article about using this coming out soon ... cough cough) I would 
expect to see a great deal more interest in using it for multi-core 
programming from people with serial codes or codes that could 
potentially take advantage of multiple threads.

Large cluster programming will always need an MPI or MPI-like system. 
Small SMP programming might have easier to use alternatives that are 
"good enough".  That "good enough" factor is not one to be discounted 
lightly, you do so at your own peril.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From i.kozin at dl.ac.uk  Tue Dec 11 07:44:14 2007
From: i.kozin at dl.ac.uk (Kozin, I (Igor))
Date: Tue, 11 Dec 2007 15:44:14 -0000
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071211002757.GA28685@stikine.ucs.sfu.ca>
Message-ID: <CC86E164E4091C4D88A62DADB4AC647903A5D22B@exchange06.fed.cclrc.ac.uk>

I'm no expert in CPMD but
- AFAIK  CPMD is not meant to run as pure OpenMP; OpenMP parallelization is auxiliary to MPI
- in the paper " Dual-level parallelism for ab initio molecular dynamics: Reaching teraflop performance with the CPMD code" the developers (J?rg Hutter and Alessandro Curioni) compare performance of pure MPI and mixed MPI+OpenMP; while pure MPI wins on small problems size MPI+OpenMP is more efficient on the largest problem. 
Now you can argue whether one could create a more efficient pure MPI code or not, or whether there was a problem with IBM pSeries 690 cluster it seems quite obvious that in order to achieve good scaling on 1000s of cores the programming difficulties are bound to advance to a new level and complexity will generally grow.


-----Original Message-----
From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Martin Siegert
Sent: 11 December 2007 00:28
To: Jones de Andrade
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] multi-threading vs. MPI

Hi Jones,

On Sat, Dec 08, 2007 at 06:50:29PM -0300, Jones de Andrade wrote:
> 
>    Hi all.
>    I  usually  just  keep  on  looking  on this list, but this discussion
>    really called my attention.
>    Could  someone  clarify  a bit better *why* would openMP be such a bad
>    performer in comparison to MPI?
>    Moreover,  concerning you tests  Dr. Siegert, could you please show us
>    a  bit  more?  I  mean  for  example  the  scalling you observed throw
>    increasing the number of cores.

As I said, I will post the full results in a few weeks, we are not quite
done yet.

>    I  really  don't  immeadiatelly  understand  how openMP can perform so
>    worst  than  mpi  on  a smp machine, given the fact that not having to
>    communicate  with  the  other  31  cores  (on  cpmd case, all the huge
>    matrixes  that  should be exchanged) should at least make things a bit
>    easier.

The way code is parallelized is different whe using MPI and OpenMP.
OpenMP code usually uses a loop level parallelization, i.e., this is
very fine grained. In the cases that I have mentioned MPI uses domain
decomposition - very coarse grained.
All what I am saying is that for the codes that I have seen the
MPI way of parallelizing the code is by far more efficient than
the OpenMP way. That does not mean that you could not use domain
decomposition with OpenMP, it just appears that that is not done
usually. I am speculating that that may be a consequence of the
(perceived?) easier way of programming using OpenMP. If you do
domain decomposition you probably end up writing code that looks very
similar to the MPI code even if you use OpenMP.

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
Academic Computing Services                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6

>    But all that from the eyes and view of an young "amadorist".  ;)
>    Thanks a lot in advance,
>    Jones
> 
>    On Dec 8, 2007 5:55 PM, Martin Siegert <[1] siegert at sfu.ca> wrote:
> 
>      Over the last months I have done quite a bit of benchmarking of
>      applications.  One  of  the  aspects  we  are  interested in is the
>      performance
>      of  applications  that  are  available  in  MPI,  OpenMP and hybrid
>      versions.
>      So  far  we  looked  at WRF and CPMD; we'll probably look at POP as
>      well.
>      MPI vs. OpenMP on a SMP (64 core Power5):
>      walltime for cpmd benchmark on 32 cores:
>      MPI: 93.13s   OpenMP: 446.86s
>      Results for WRF on the same platform are similar.
>      In  short:  the performance of OpenMP code isn't even close to that
>      of the
>      MPI code.
>      We also looked at the hybrid version of these codes on clusters.
>      The difference in run times are in the 1% range - less than the
>      accuracy of the measurement.
>      Thus,  if  you have the choice, why would you even look at anything
>      other
>      than MPI? Even if the programming effort for OpenMP is lower,
>      the performance penalty is huge.
>      That's my conclusion drawn from the cases we've looked at.
>      If anybody knows of applications where the OpenMP performance comes
>      close
>      to  the  MPI  performance  and  of  applications  where  the hybrid
>      performance
>      is significantly better than the pure MPI performance, then I would
>      love to hear from you. Thanks!
>      Cheers,
>      Martin
>      --
>      Martin Siegert
>      Head, Research Computing
>      WestGrid Site Lead
>      Academic Computing Services                phone: 778 782-4691
>      Simon Fraser University                    fax:   778 782-4242
>      Burnaby, British Columbia                  email: [2]siegert at sfu.ca
>      Canada  V5A 1S6
>      On Fri, Dec 07, 2007 at 10:56:14PM -0600, Gerry Creager wrote:
>      >  WRF has been under development for 10 years.  It's got an OpenMP
>      flavor,
>      > an MPI flavor and a hybrid one.  We still don't have all the bugs
>      worked
>      >  out  of  the hybrid so that it can handle large, high resolution
>      domains
>      > without being slower than the MPI version.  And, yeah, the OpenMP
>      geeks
>      > working on this... and the MPI folks, are good.
>      >
>      >  Hybrid  isn't  easy and isn't always foolproof.  And, as another
>      thought,
>      > OpenMP isn't always the best solution to the problem.
>      >
>      > gerry
>      >
>      > [3]richard.walsh at comcast.net wrote:
>      > > -------------- Original message ----------------------
>      > >From: Toon Knapen <[4] toon.knapen at gmail.com>
>      > >>Greg Lindahl wrote:
>      >  >>>In  real  life  (i.e. not HPC), everyone uses message passing
>      between
>      > >>>nodes.  So I don't see what you're getting at.
>      > >>>
>      >  >>Many on this list suggest that using multiple MPI-processes on
>      one and
>      > >>the same node is superior to MT approaches IIUC. However I have
>      the
>      > >>impression that almost the whole industry is looking into MT to
>      benefit
>      >  >>from  multi-core without even considering message-passing. Why
>      is that so?
>      > >
>      >  >I  think  what Greg and others are really saying is that if you
>      have to use
>      > >a distributed memory
>      > >model (MPI) as a first order response to meet your scalability
>      > >requirements, then
>      >  >the  extra  coding  effort  and complexity required to create a
>      hybrid code
>      > >may not be
>      >  >a  good performance return on your investment.  If on the other
>      hand you
>      > >only
>      > >need to scale within a singe SMP node (with cores and sockets on
>      a single
>      >  >board  growing in number, this returns more performance than in
>      the past),
>      > >then you
>      >  >may  be  able to avoid using MPI and chose a simpler model like
>      OpenMP.  If
>      > >you
>      >  >have  already  written  an efficient MPI code,  then (with some
>      exceptions)
>      >  >the  performance-gain  divided  by the hybrid coding-effort may
>      seem small.
>      > >
>      > >Development in an SMP environment is easier.  I know of a number
>      of sights
>      > >that work this way.  The experienced algorithm folks work up the
>      code in
>      >  >OpenMP  on  say  an  SGI  Altix  or Power6 SMP, then they get a
>      dedicated MPI
>      >  >coding  expert  to  convert  it  later  for scalable production
>      operation on a
>      > >cluster.
>      >  >In  this situation, they do end up with hybrid versions in some
>      cases.  In
>      > >non-HPC
>      >  >or smaller workgroup contexts your production code may not need
>      to be
>      > >converted.
>      > >
>      > >Cheers,
>      > >
>      > >rbw
>      > >
>      > >--
>      > >
>      > >"Making predictions is hard, especially about the future."
>      > >
>      > >Niels Bohr
>      > >
>      > >--
>      > >
>      > >Richard Walsh
>      > >Thrashing River Consulting--
>      > >5605 Alameda St.
>      > >Shoreview, MN 55126
>      > >
>      > >Phone #: 612-382-4620
>      > >
>      > >_______________________________________________
>      > >Beowulf mailing list, [5]Beowulf at beowulf.org
>      > >To change your subscription (digest mode or unsubscribe) visit
>      > >[6]http://www.beowulf.org/mailman/listinfo/beowulf
>      >
>      > --
>      > Gerry Creager -- [7]gerry.creager at tamu.edu
>      > Texas Mesonet -- AATLT, Texas A&M University
>      > Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
>      >  Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX
>      77843
>      > _______________________________________________
>      > Beowulf mailing list, [8]Beowulf at beowulf.org
>      > To change your subscription (digest mode or unsubscribe) visit
>      > [9]http://www.beowulf.org/mailman/listinfo/beowulf
>      --
>      Martin Siegert
>      Head, Research Computing
>      WestGrid Site Lead
>      Academic Computing Services                phone: 778 782-4691
>      Simon Fraser University                    fax:   778 782-4242
>      Burnaby,   British   Columbia                                email:
>      [10]siegert at sfu.ca
>      Canada  V5A 1S6
>      _______________________________________________
>      Beowulf mailing list, [11]Beowulf at beowulf.org
>      To  change  your  subscription  (digest  mode or unsubscribe) visit
>      [12]http://www.beowulf.org/mailman/listinfo/beowulf
> 
> References
> 
>    1. mailto:siegert at sfu.ca
>    2. mailto:siegert at sfu.ca
>    3. mailto:richard.walsh at comcast.net
>    4. mailto:toon.knapen at gmail.com
>    5. mailto:Beowulf at beowulf.org
>    6. http://www.beowulf.org/mailman/listinfo/beowulf
>    7. mailto:gerry.creager at tamu.edu
>    8. mailto:Beowulf at beowulf.org
>    9. http://www.beowulf.org/mailman/listinfo/beowulf
>   10. mailto:siegert at sfu.ca
>   11. mailto:Beowulf at beowulf.org
>   12. http://www.beowulf.org/mailman/listinfo/beowulf

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From peter.st.john at gmail.com  Tue Dec 11 08:13:06 2007
From: peter.st.john at gmail.com (Peter St. John)
Date: Tue, 11 Dec 2007 11:13:06 -0500
Subject: [Beowulf] WSEAS, Call for Papers
In-Reply-To: <5e0aeeab0712081046g7c4b0ac8t8401b6e7a80007f7@mail.gmail.com>
References: <5e0aeeab0712081046g7c4b0ac8t8401b6e7a80007f7@mail.gmail.com>
Message-ID: <e4d4fd070712110813j59f4cc12n62dc9ec65d2e0771@mail.gmail.com>

I just recreated an article "WSEAS" at the English language Wikipedia, as I
had had to hunt around some to find out what WSEAS is. It turns out that
there had been an aritlce previously, but it was deleted for "copyright
violation". Almost always that means someone from an organization created an
article by copying from the organization's web page; and usually the
organization's web page is more like advertising and self-promotion, than an
encyclopedia entry. This distinction is very difficult to explain to PR
goons. We'll try and do better with the WSEAS piece but unless RGB is going
to Greece, or something, I just won't fight for it. If the PR folks come and
puff it up, and then some editor who just doesn't care deletes it again, it
will just be business as usual and I'll swallow my pride.

Sorry for the off-topic rant, but I'm a great believer in the utility of
wiki -- it's peer-reviewed, just a new definition of "peer" and "review"
:-)  and I wish everyone would drop in to contribute bits and pieces from
time to time.

The WSEAS article is just a stub, http://en.wikipedia.org/wiki/WSEAS ;
please note the "Talk Page" (each article has at least two tabs, one for the
article, one for discussion about the article) where I try to warn others
about the recent deletion. But the PR goons don't even know that talk pages
exist.  I'd be more than happy to answer any questions, or review any
additions to the article. Please believe me that indulging the
organization's good mission just gets the article deleted, which does nobody
any good. If you have something to add to the article I suggest putting it
in the Talk page first, to get feedback from others.

Wiki questions, and requests for me to Watchlist technical articles, can be
dropped at my own talk page, see
http://en.wikipedia.org/wiki/User:PeterStJohn (click on the Discussion tab,
then the Edit This Page tab to add something to my Talk page.)

Peter


On Dec 8, 2007 1:46 PM, Nikos Mastorakis <wseas.headquarters at gmail.com>
wrote:

>
>  CALL FOR PAPERS
> =================
>
> The annual convention and gathering of all the WSEAS entities (Working
> Groups, Technical Committees, Editors,  Associate Editors, Research
> Directors, Projects Coordinators,etc...) is held in July during the
> CSCC.
> http://www.wseas.org/conferences/2008/greece/
> ...
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071211/f2927b9a/attachment.html>

From tjrc at sanger.ac.uk  Tue Dec 11 08:22:24 2007
From: tjrc at sanger.ac.uk (Tim Cutts)
Date: Tue, 11 Dec 2007 16:22:24 +0000
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <20071211101907.9OFM9.11486.root@fepweb09>
References: <20071211101907.9OFM9.11486.root@fepweb09>
Message-ID: <469CFAA8-9C0A-48DD-9CA3-E5FB02BCDA5D@sanger.ac.uk>


On 11 Dec 2007, at 3:19 pm, <laytonjb at charter.net> wrote:

> ---- Mark Hahn <hahn at mcmaster.ca> wrote:
>> there's a company, ScaleMP, which seems to be selling some kind of
>> kit which enables to fairly large shared-memory x86_64 systems.
>> their website is nearly useless (http://www.scalemp.com/), but a  
>> little
>> more info can be had from SGI, which apparently uses ScaleMP for  
>> their
>> f1200 product (rebadged Ciara?).
>>
>>
>> SGI seems to aim it purely at structural/cfd/crash sims -
>> mainly using Abaqus and related tools.
>
> Abaqus is now MPI capable (the first of the implicit FEM codes that  
> I know
> of). So ScaleMP isn't needed for the newer version of Abaqus.

I've been quite curious to try something like the f1200 as a potential  
replacement for our Altixes, which were bought predominantly for  
running single-threaded large-memory jobs.

Tim


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From landman at scalableinformatics.com  Tue Dec 11 08:53:18 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 11 Dec 2007 11:53:18 -0500
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <469CFAA8-9C0A-48DD-9CA3-E5FB02BCDA5D@sanger.ac.uk>
References: <20071211101907.9OFM9.11486.root@fepweb09>
	<469CFAA8-9C0A-48DD-9CA3-E5FB02BCDA5D@sanger.ac.uk>
Message-ID: <475EC07E.6040905@scalableinformatics.com>

Tim Cutts wrote:

> I've been quite curious to try something like the f1200 as a potential 
> replacement for our Altixes, which were bought predominantly for running 
> single-threaded large-memory jobs.

It is fairly easy (barring cost issues) to get a single system image 
machine with 8-16 processor cores and 128 GB ram.  Beyond that, you need 
something like ScaleMP or a "proprietary" box to get more RAM.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From tjrc at sanger.ac.uk  Tue Dec 11 08:57:59 2007
From: tjrc at sanger.ac.uk (Tim Cutts)
Date: Tue, 11 Dec 2007 16:57:59 +0000
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <475EC07E.6040905@scalableinformatics.com>
References: <20071211101907.9OFM9.11486.root@fepweb09>
	<469CFAA8-9C0A-48DD-9CA3-E5FB02BCDA5D@sanger.ac.uk>
	<475EC07E.6040905@scalableinformatics.com>
Message-ID: <E5392389-D7E9-4C38-AECE-CFB431E4187C@sanger.ac.uk>


On 11 Dec 2007, at 4:53 pm, Joe Landman wrote:

> Tim Cutts wrote:
>
>> I've been quite curious to try something like the f1200 as a  
>> potential replacement for our Altixes, which were bought  
>> predominantly for running single-threaded large-memory jobs.
>
> It is fairly easy (barring cost issues) to get a single system image  
> machine with 8-16 processor cores and 128 GB ram.  Beyond that, you  
> need something like ScaleMP or a "proprietary" box to get more RAM.
>

Precisely.  Currently we have two machines, each with 192 GB RAM (one  
has four CPUs, the other has 16), which are nearing the end of their  
life.  The f1200 looks attractive partly because it can provide a  
similar size machine, and partly because it's no longer Itanium (which  
will remove a large body of software maintenance headache, at least  
for us - our Altixes are our to all intents and purposes our only  
Itanium machines)

Tim


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From hahn at mcmaster.ca  Tue Dec 11 08:59:53 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 11 Dec 2007 11:59:53 -0500 (EST)
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <20071211102146.UJ68P.11651.root@fepweb09>
References: <20071211102146.UJ68P.11651.root@fepweb09>
Message-ID: <Pine.LNX.4.64.0712111155490.24221@coffee.psychology.mcmaster.ca>

> Flextronics was showing a small cluster where they had 4 boxes connected
> by IB and within each box they had 4 systems connected by IB. They were
> running ScaleMP on it. They had a graph of running Stream on top of the
> system. They were plotting bandwidth vs. number of cores and it was fairly
> linear (I didn't get a close look at it).

but stream is embarassingly parallel, so even if their interconnect was 
wet string, it should scale perfectly with number of nodes.  (well, 
start and end-of-loop synchronization probably doesn't work well with 
wet string, but that just means you crank up the array size ;)

does anyone know how the coherency actually works?  without a full-fledged
memory proxy (as SGI has in their NUMAlink machines, or as in the Newisys
Horus), it seems like this approach is going to spend a lot of time twiddling
the MMU and taking page faults.


From dnlombar at ichips.intel.com  Tue Dec 11 09:16:18 2007
From: dnlombar at ichips.intel.com (Lombard, David N)
Date: Tue, 11 Dec 2007 09:16:18 -0800
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <20071211101907.9OFM9.11486.root@fepweb09>
References: <Pine.LNX.4.64.0712100920250.12457@coffee.psychology.mcmaster.ca>
	<20071211101907.9OFM9.11486.root@fepweb09>
Message-ID: <20071211171618.GA10963@nlxdcldnl2.cl.intel.com>

On Tue, Dec 11, 2007 at 07:19:07AM -0800, laytonjb at charter.net wrote:
> ---- Mark Hahn <hahn at mcmaster.ca> wrote: 
> > 
> > SGI seems to aim it purely at structural/cfd/crash sims - 
> > mainly using Abaqus and related tools.
> 
> Abaqus is now MPI capable (the first of the implicit FEM codes that I know
> of). So ScaleMP isn't needed for the newer version of Abaqus.

MSC.Nastran has provided MPI capabilities for a very long time now; clearly,
NX Nastran is also MPI capable.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.


From hahn at mcmaster.ca  Tue Dec 11 09:50:03 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 11 Dec 2007 12:50:03 -0500 (EST)
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <475EC07E.6040905@scalableinformatics.com>
References: <20071211101907.9OFM9.11486.root@fepweb09>
	<469CFAA8-9C0A-48DD-9CA3-E5FB02BCDA5D@sanger.ac.uk>
	<475EC07E.6040905@scalableinformatics.com>
Message-ID: <Pine.LNX.4.64.0712111200130.24221@coffee.psychology.mcmaster.ca>

>> I've been quite curious to try something like the f1200 as a potential 
>> replacement for our Altixes, which were bought predominantly for running 
>> single-threaded large-memory jobs.

we have an Altix as well, and I always cringe when I see a single-thread,
large-memory job running on it.  ours has 128p, 256G, and I think 6M/core caches.
so large-mem serial job, assuming uniform memory access, would have a 
hit rate of .00002289.  and in any case, there is >800 GB/s of memory
bandwidth available, but at best 6.4 GB/s in use.  don't forget that the it2
is a fairly strict in-order chip, as well.

sure, perhaps a large-memory serial code has a small working set that 
fits in cache.  but doesn't it strike you as strange to have a 
working set that's 1/40000 of the total footprint?  I suspect that you 
could reformulate such a code as a "memory-extension" MPI job and avoid
the need for custom hardware.  (ie, let rank0 do all the work, and just 
operate a software cache of data fed by all the other ranks.  of course,
this begs the question of whether the code _has_ to be serial...)

> It is fairly easy (barring cost issues) to get a single system image machine 
> with 8-16 processor cores and 128 GB ram.  Beyond that, you need something 
> like ScaleMP or a "proprietary" box to get more RAM.

I'm guessing ScaleMP is approximately the same speed as a user-level 
network-shared-memory implementation, but would love to see real numbers.

regards, mark hahn.


From jlb17 at duke.edu  Tue Dec 11 09:57:00 2007
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Tue, 11 Dec 2007 12:57:00 -0500 (EST)
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <20071211101907.9OFM9.11486.root@fepweb09>
References: <20071211101907.9OFM9.11486.root@fepweb09>
Message-ID: <alpine.LRH.0.99999.0712111255330.7254@hogwarts.egr.duke.edu>

On Tue, 11 Dec 2007 at 7:19am, laytonjb at charter.net wrote

> Abaqus is now MPI capable (the first of the implicit FEM codes that I know
> of). So ScaleMP isn't needed for the newer version of Abaqus.

The implicit mode of LS-DYNA is also MPI capable, as of ls971 (i.e., a 
year or so ago, IIRC).

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


From apittman at concurrent-thinking.com  Tue Dec 11 10:21:21 2007
From: apittman at concurrent-thinking.com (Ashley Pittman)
Date: Tue, 11 Dec 2007 18:21:21 +0000
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <Pine.LNX.4.64.0712100920250.12457@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.0712100920250.12457@coffee.psychology.mcmaster.ca>
Message-ID: <1197397281.12303.16.camel@bruce.priv.wark.uk.streamline-computing.com>


On Tue, 2007-12-11 at 09:28 -0500, Mark Hahn wrote:
> there's a company, ScaleMP, which seems to be selling some kind of 
> kit which enables to fairly large shared-memory x86_64 systems.
> their website is nearly useless (http://www.scalemp.com/), but a little
> more info can be had from SGI, which apparently uses ScaleMP for their
> f1200 product (rebadged Ciara?).

This reminds me of a talk at the Machine Evaluation Workshop a couple of
weeks ago by a company called "workstations uk".  They don't appear to
have a working website but also have a product called f1200 so I assume
are related somehow.

The talk is on-line although I'll admit it was about half way through
before I understood what they were talking about.

http://www.cse.scitech.ac.uk/disco/mew18/Presentations/Day2/7th_Session/RobinHarker.pdf

Ashley,


From James.P.Lux at jpl.nasa.gov  Tue Dec 11 10:30:25 2007
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Tue, 11 Dec 2007 10:30:25 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <60349.192.168.1.1.1197379039.squirrel@mail.eadline.org>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<60349.192.168.1.1.1197379039.squirrel@mail.eadline.org>
Message-ID: <6.2.3.4.2.20071211102608.02d635f0@mail.jpl.nasa.gov>

At 05:17 AM 12/11/2007, Douglas Eadline wrote:
>This is indeed the issue. Where to invest time?
>
>My opinion, and it is only my opinion, is the following.
>Please share your own.
>
>Threaded approaches do not scale across clusters. The memory
>architecture of multi-core is making nodes look more like
>small clusters i.e. memory is becoming more localized.
>As Don Becker mentioned in a recent post, efforts to program
>distributed memory like it were shared memory often end
>up looking like stylized message passing systems.
>
>One other thing about messages. The problem of
>trying to optimize the compute to communication issue is
>easier than trying to optimize the compute to locality
>issue.
>
>Therefore, if I were to start a new parallel project of some sort
>or parallelize an existing code, I would use MPI. Although
>OpenMP might get me up and running quicker, I would feel more
>comfortable with a problem cast in MPI.
>
>I'm interested in others opinions on this because, I think it
>is an important issue for the general programing audience
>and not just us cluster geeks. The difference is we have had
>a lot more time and experience with this stuff.
>
>--
>Doug


Another huge advantage of going to a message passing paradigm is that 
it forces you to explicitly deal with the time synchronization (or 
lack thereof) among processes in that an underlying assumption is 
that passing the message takes non-zero time.   Therefore, in any 
message passing system, there's not necessarily any concept of 
"absolute time" among all processes.  (You have to pass time 
messages, just like any other).

As the propagation delay (light time) among processors gets to be a 
significant fraction of the message length, this is a bigger and bigger deal.

For myself, this is an issue because I work with systems that are 
distributed over huge distances (where light time is seconds or 
minutes and it varies), but it also applies on a finer grain where 
you have delays in the communications paths in the 
microseconds/milliseconds scale, especially if they are variable and 
non-deterministic.  (NTP, for instance, assumes that the delays are 
deterministic in the long term sense, even if there's a lot of short 
term variability)

Jim Lux


From lindahl at pbm.com  Tue Dec 11 12:06:08 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 11 Dec 2007 12:06:08 -0800
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <Pine.LNX.4.64.0712100920250.12457@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.64.0712100920250.12457@coffee.psychology.mcmaster.ca>
Message-ID: <20071211200608.GC6379@bx9.net>

On Tue, Dec 11, 2007 at 09:28:17AM -0500, Mark Hahn wrote:

> there's a company, ScaleMP, which seems to be selling some kind of 
> kit which enables to fairly large shared-memory x86_64 systems.

As far as I can tell, they are just another software distributed
shared memory company. Which has been proven to not work well 50 times
already.

-- greg


From toon.knapen at gmail.com  Tue Dec 11 12:20:55 2007
From: toon.knapen at gmail.com (Toon Knapen)
Date: Tue, 11 Dec 2007 21:20:55 +0100
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <20071211101907.9OFM9.11486.root@fepweb09>
References: <20071211101907.9OFM9.11486.root@fepweb09>
Message-ID: <475EF127.8020303@gmail.com>

laytonjb at charter.net wrote:
> Abaqus is now MPI capable (the first of the implicit FEM codes that I know
> of).


I'm happy to correct you that Actran (http://www.fft.be/?id=10) is 
already MPI capable since 2003. Actran is is also an implicit FEM code 
but focused on aero-acoustics and vibro-acoustics (latter distributed by 
MSC)

toon
(former employee of FFT working on parallelisation ;-)


From lindahl at pbm.com  Tue Dec 11 14:27:26 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 11 Dec 2007 14:27:26 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475E960B.4070509@scalableinformatics.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
Message-ID: <20071211222726.GC9072@bx9.net>

On Tue, Dec 11, 2007 at 08:52:11AM -0500, Joe Landman wrote:

> On the contrary, it is precisely because people are asking "how should I 
> parallelize" that they need to ask the basic question of "where does my 
> code spend time for my problems."

OK, so say I have a garden-variety finite-difference code. I know how
to use OpenMP to parallelize all the loops, and how to use halo
exchange with MPI to parallelize it. And maybe I'm even clever enough
to know how to use halo exchange with OpenMP, which is pretty ugly
code, but has better locality and scalability.

Before I've coded all 3 up, how does that help me pick which method to
use? It doesn't. And that's the question people are asking: should I
spend the time to implement X or Y in the hopes it'll be faster than
my existing code Z?

-- greg


From landman at scalableinformatics.com  Tue Dec 11 15:07:36 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 11 Dec 2007 18:07:36 -0500
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071211222726.GC9072@bx9.net>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<20071211222726.GC9072@bx9.net>
Message-ID: <475F1838.8030305@scalableinformatics.com>

Greg Lindahl wrote:
> On Tue, Dec 11, 2007 at 08:52:11AM -0500, Joe Landman wrote:
> 
>> On the contrary, it is precisely because people are asking "how should I 
>> parallelize" that they need to ask the basic question of "where does my 
>> code spend time for my problems."
> 
> OK, so say I have a garden-variety finite-difference code. I know how

[... deletia ...]

Greg, you missed my point, entirely.  By a wide margin.

This is why I note that talking about MPI vs OpenMP and other 
pseudo-debates generates mostly heat and very little light.  Reminds me 
of editor battles, shell scripting battles ...

Bowing out of this part of the discussion, so no more heat is generated.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From Michael.Frese at NumerEx.com  Tue Dec 11 16:21:36 2007
From: Michael.Frese at NumerEx.com (Michael H. Frese)
Date: Tue, 11 Dec 2007 17:21:36 -0700
Subject: [Beowulf] multi-threading vs. MPI
Message-ID: <6.2.5.6.2.20071211172009.04fc0100@NumerEx.com>

Thanks for the results, and the link.  In section 6.7 of the NAS 
Parallel Benchmark 
(<http://www.nas.nasa.gov/News/Techreports/1996/PDF/nas-96-010.pdf>NPB 
2.1 Results Report, NAS-95-010 (PDF-213KB) on MPI, I found a 
discussion of the Clustered-SMP issues discussed so far in this 
thread.  Its interesting that these issues discussed twelve years ago 
are coming around again.  La plus ca change..., I suppose.

In addition, there is a table of results in that section for an SGI 
Power Challenge Array showing that idling processors on a given node 
and using more nodes improves the speed per processor across four 
different code kernels and two different problem sizes.  This doesn't 
tell us how a hybrid MP/MT application would work within a 4 core 2 
CPU node, but it does hint that memory contention can be just as 
nasty a problem as high latency message transmission.


Mike

At 12:52 PM 12/10/2007, you wrote:
>Some people had asked for more details:
>
>NAS suite version 3.2.1
>Test class was: B
>Units are Mops (Million operations per second)
>see the NAS docs for more information
>
>--
>Doug
>
>
> > I like answering these types of questions with numbers,
> > so in my Sept 2007 Linux magazine column (which should
> > be showing up on the website soon) I did the following.
> >
> > Downloaded the latest NAS benchmarks written in both
> > OpenMP and MPI. Ran them both on an 8 core Clovertown
> > (dual socket) system (multiple times) and reported
> > the following results:
> >
> > Test      OpenMP              MPI
> >        gcc/gfortran 4.2    LAM 7.1.2
> > ------------------------------------
> > CG         790.6             739.1
> > EP         166.5             162.8
> > FT        3535.9            2090.8
> > IS          51.1             122.5
> > LU        5620.5            5168.8
> > MG        1616.0            2046.2
> >
> > My conclusion, it was a draw of sorts.
> > The article was basically looking at the
> > lazy assumption that threads (OpenMP) are
> > always better than MPI on a SMP  machine.
> >
> > I'm going to re-run the tests using Harpertowns
> > real soon, maybe try other compilers and MPI
> > versions. It is easy to do. You can get the code here:
> >
> > http://www.nas.nasa.gov/Resources/Software/npb.html
> >
> > --
> > Doug
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >> On this list there is almost unanimous agreement that MPI is the way to
> >> go
> >> for parallelism and that combining multi-threading (MT) and
> >> message-passing
> >> (MP) is not even worth it, just sticking to MP is all that is necessary.
> >>
> >> However, in real-life most are talking and investing in MT while very
> >> few
> >> are interested in MP. I also just read on the blog of Arch Robison " TBB
> >> perhaps gives up a little performance short of optimal so you don't have
> >> to
> >> write message-passing " (here:
> >> 
> http://softwareblogs.intel.com/2007/11/17/supercomputing-07-computer-environment-and-evolution/
> >>  )
> >>
> >> How come there is almost unanimous agreement in the beowulf-community
> >> while
> >> the rest is almost unanimous convinced of the opposite ? Are we just
> >> tapping
> >> ourselves on the back or is MP not sufficiently dissiminated or ... ?
> >>
> >> toon
> >>
> >>
> >>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org
> >> To change your subscription (digest mode or unsubscribe) visit
> >> http://www.beowulf.org/mailman/listinfo/beowulf
> >>
> >>
> >> !DSPAM:4759a800241507095717635!
> >>
> >
> >
> > --
> > Doug
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> >
> > !DSPAM:475c325f61251246014193!
> >
>
>
>--
>Doug
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071211/b5401829/attachment.html>

From z_anthrops at yahoo.co.uk  Mon Dec 10 20:01:05 2007
From: z_anthrops at yahoo.co.uk (Ali Zahoor)
Date: Tue, 11 Dec 2007 09:01:05 +0500
Subject: [Beowulf] Re: Beowulf Digest, Vol 46, Issue 12
Message-ID: <001301c83baa$7549d070$4901a8c0@itconsultant>

I am new to the parallel computing but have the desire to learn more and to
utilize this technology for my organization needs. Can any body guide me for
better available resources?


Ali Zahoor


From hahn at mcmaster.ca  Tue Dec 11 20:06:40 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 11 Dec 2007 23:06:40 -0500 (EST)
Subject: [Beowulf] Re: Beowulf Digest, Vol 46, Issue 12
In-Reply-To: <001301c83baa$7549d070$4901a8c0@itconsultant>
References: <001301c83baa$7549d070$4901a8c0@itconsultant>
Message-ID: <Pine.LNX.4.64.0712112238050.32328@coffee.psychology.mcmaster.ca>

> I am new to the parallel computing but have the desire to learn more and to
> utilize this technology for my organization needs. Can any body guide me for
> better available resources?

parallel computing has been around a long time; lots of good books out there.

but what you should first do is look at your organization.  are there
aspects of processing that take too long?  analyses that are naive or 
truncated because doing it properly takes too long?  the books can show
you solutions, but you have to find the problems...

regards, mark hahn.


From toon.knapen at gmail.com  Wed Dec 12 04:08:31 2007
From: toon.knapen at gmail.com (Toon Knapen)
Date: Wed, 12 Dec 2007 13:08:31 +0100
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475F1838.8030305@scalableinformatics.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<4759B001.4090004@gmail.com> <475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<20071211222726.GC9072@bx9.net>
	<475F1838.8030305@scalableinformatics.com>
Message-ID: <d5bdff000712120408m34246d1cg2570d2ac519fc7cc@mail.gmail.com>

On 12/12/07, Joe Landman <landman at scalableinformatics.com> wrote:
>
> This is why I note that talking about MPI vs OpenMP and other
> pseudo-debates generates mostly heat and very little light.  Reminds me
> of editor battles, shell scripting battles ...


I agree that discussions like these easily degenerate.

That is actually one of the reasons why I'm looking for authoritive
documents discussion the difference between both approaches. Such documents
could come in handy when discussing the strategy to use concerning
parallelisation of a project to bring the discussion forward in an objective
way.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071212/5e556f1d/attachment.html>

From gerry.creager at tamu.edu  Wed Dec 12 05:25:13 2007
From: gerry.creager at tamu.edu (Gerry Creager)
Date: Wed, 12 Dec 2007 07:25:13 -0600
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <d5bdff000712120408m34246d1cg2570d2ac519fc7cc@mail.gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<475E960B.4070509@scalableinformatics.com>	<20071211222726.GC9072@bx9.net>	<475F1838.8030305@scalableinformatics.com>
	<d5bdff000712120408m34246d1cg2570d2ac519fc7cc@mail.gmail.com>
Message-ID: <475FE139.6060108@tamu.edu>

Debates and differences aside, often-times, this forum *is* an 
authoritative source of information.

gerry

Toon Knapen wrote:
> 
> 
> On 12/12/07, *Joe Landman* <landman at scalableinformatics.com 
> <mailto:landman at scalableinformatics.com>> wrote:
> 
>     This is why I note that talking about MPI vs OpenMP and other
>     pseudo-debates generates mostly heat and very little light.  Reminds me
>     of editor battles, shell scripting battles ...
> 
>  
>  
> I agree that discussions like these easily degenerate.
>  
> That is actually one of the reasons why I'm looking for authoritive 
> documents discussion the difference between both approaches. Such 
> documents could come in handy when discussing the strategy to use 
> concerning parallelisation of a project to bring the discussion forward 
> in an objective way.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843


From tom.elken at qlogic.com  Wed Dec 12 09:35:38 2007
From: tom.elken at qlogic.com (Tom Elken)
Date: Wed, 12 Dec 2007 09:35:38 -0800
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	librarieswith MPI
In-Reply-To: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A01850590@AVEXCH1.qlogic.org>

 > -----Original Message-----
> [mailto:beowulf-bounces at beowulf.org] On Behalf Of Tom Elken
> Sent: Thursday, November 29, 2007 11:27 AM
> 
> Have you used compiler auto-parallel features mixed with MPI with
> success on your clusters?
> 
> Have you used multi-threaded math or scientific libraries 
> mixed with MPI
> with success on your clusters?
> 
> If you just want to 'reply' to me only with simpler Yes/No answers, I
> will report on a summary of the results to this list and to 
> the SPEC HPG committee.

Results of the VERY non-scientific survey:

# reporting use of Autoparallel features with MPI:          0

# reporting use of multi-threaded math libraries with MPI:  1

The '1' was using multithreaded BLAS and MPI on HPL (a benchmark, not an
application) and his recollection was that it was not a win over pure
MPI.

I'll let you know later when we've resolved the discussion in the HPG
committee on how this might affect SPEC MPI2007.

But the discussion engendered by this post (and similar ones) was quite
entertaining and educational, with a lot of heat and some light.
Especially between my ex-SGI colleague, Joe Landman, and my ex-PathScale
colleague, Greg Lindahl.

Thanks,
Tom

> 
> If you have success or failure stories that might be useful to the
> Beowulf list, please 'reply-all'.  
> 
> Thanks,
> Tom Elken,
> member SPEC HPG committee
> -----------------------------
> *  For example, if an autoparallelizing compiler could find effective
> 4-way thread-level parallelism in an MPI code and you were 
> running on a
> cluster of 8 nodes each with two quad-core CPUs, 64 cores total, you
> might choose to run with 16 MPI threads and set your NUM_THREADS
> variable to 4, to run with all 64 cores of the cluster executing work
> with reasonable efficiency. 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


From examachine at gmail.com  Wed Dec 12 02:28:25 2007
From: examachine at gmail.com (Eray Ozkural)
Date: Wed, 12 Dec 2007 12:28:25 +0200
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475DDC6F.3060007@scalableinformatics.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
Message-ID: <320e992a0712120228i60be834fo46a4ae1084ff0474@mail.gmail.com>

On Dec 11, 2007 2:40 AM, Joe Landman <landman at scalableinformatics.com> wrote:
> Our view has always been use what you are comfortable with, and what you
> need.  If you need to run across a cluster, use MPI.  If you need to run
> across a single large memory machine, use OpenMP.
>
> FWIW:  I would suggest learning both.  With the advent of many-core
> workstations, and accelerator systems with many many cores, programming
> these things is more likely to be mediated by a compiler (OpenMP like)
> than putting MPI stacks on the Cell SPUs (not enough local scratchpad
> ram for it).
>
> Just my $0.02, and I hope I generated light, and very little heat.

Thanks for your comments. I've seen that OpenMP is much easier to
develop with than MPI, which in my experience takes a lot of time to
program with due to its low-level nature and complicated side-effects.
I would in fact prefer to use implicit parallelism in a high-level
language (functional) to program complicated memory architectures.
Which doesn't exist in the way I imagine it due to the
short-sightedness of programming language people. I don't think that
the programmer should be too involved with the exact details of
caches, etc. It doesn't make too much sense to me. Back in the day, I
learned how to write assembly code that fits in some 256 byte code
cache. But how do you do that kind of optimization in a large scale
NUMA architecture, what if you move the code to a slightly different
architecture? The OpenMP model is superior to old ways of programming
shared memory systems (like the awful pthreads), but as the
architectures get more complicated I doubt that it will allow the
programmers to extract sufficient performance from those systems.

BTW, is OpenMP usable with the Cell processor on the Playstation 3? I wondered.

Best,

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://www.cs.bilkent.edu.tr/~erayo  Malfunct: http://myspace.com/malfunct
ai-philosophy: http://groups.yahoo.com/group/ai-philosophy


From arnoldg at ncsa.uiuc.edu  Wed Dec 12 06:37:57 2007
From: arnoldg at ncsa.uiuc.edu (Galen Arnold)
Date: Wed, 12 Dec 2007 08:37:57 -0600 (CST)
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475FE139.6060108@tamu.edu>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<4759B001.4090004@gmail.com> <475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<20071211222726.GC9072@bx9.net>
	<475F1838.8030305@scalableinformatics.com>
	<d5bdff000712120408m34246d1cg2570d2ac519fc7cc@mail.gmail.com>
	<475FE139.6060108@tamu.edu>
Message-ID: <Pine.LNX.4.64.0712120827050.17563@osage.ncsa.uiuc.edu>


Gerry,

> Debates and differences aside, often-times, this forum *is* an authoritative 
> source of information.
>

Indeed it is.

By the way, we've got an old-ish course on multilevel 
parallel programming at ci-tutor.ncsa.uiuc.edu in case anybody wants to go 
there and see what people were thinking a couple years ago when they wrote 
it.

I've seen the benchmark speedups with mixed-mode as well.  With the right 
code, on a day with the wind to your back, threads can make good use of 
communication induced idle time on a node [yeah MPI supports overlap...
that's more difficult to achieve than it would appear, requiring-- 
excellent programming, most excellent MPI implementation].

By the way, if you've used Intel's mkl, you may have run with hybrid code 
already and not know it [do you know how your system sets OMP_NUM_THREADS 
?].

-Galen


From toon at moene.indiv.nluug.nl  Wed Dec 12 10:46:49 2007
From: toon at moene.indiv.nluug.nl (Toon Moene)
Date: Wed, 12 Dec 2007 19:46:49 +0100
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071211222726.GC9072@bx9.net>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<475E960B.4070509@scalableinformatics.com>
	<20071211222726.GC9072@bx9.net>
Message-ID: <47602C99.5020802@moene.indiv.nluug.nl>

Greg Lindahl wrote:

> On Tue, Dec 11, 2007 at 08:52:11AM -0500, Joe Landman wrote:
> 
>> On the contrary, it is precisely because people are asking "how should I 
>> parallelize" that they need to ask the basic question of "where does my 
>> code spend time for my problems."
> 
> OK, so say I have a garden-variety finite-difference code. I know how
> to use OpenMP to parallelize all the loops,

Well, our weather forecasting code is certainly garden-variety 
finite-difference code (we don't even use multi-grids), but I recently 
looked into the OpenMP parallellization (done by people who spent much 
much more time on looking into performance issues than I did) and I 
noticed only a few loops were parallellized.

As atmospheric movement on Earth is (for weather forecasting purposes, 
i.e. on length scales of days) a primarily two-dimensional phenomenon, 
the parallellized loops are:

1. Over the vertical layers (while all loops over horizontal boxes are
    left alone).

2. (In a different part of the code) over tasks computing vertical
    phenomena in a set of columns.

In other words, the parallellization is pushed outwards as far as 
possible - the majority of the loops don't even know there's more than 
one processor in the machine.

-- 
Toon Moene - e-mail: toon at moene.indiv.nluug.nl - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.indiv.nluug.nl/~toon/
GNU Fortran's path to Fortran 2003: http://gcc.gnu.org/wiki/Fortran2003


From toon.knapen at gmail.com  Wed Dec 12 12:20:18 2007
From: toon.knapen at gmail.com (Toon Knapen)
Date: Wed, 12 Dec 2007 21:20:18 +0100
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475EAF89.3050209@scalableinformatics.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<60349.192.168.1.1.1197379039.squirrel@mail.eadline.org>	<475EA82E.2030101@gmail.com>
	<475EAF89.3050209@scalableinformatics.com>
Message-ID: <47604282.2080504@gmail.com>

Joe Landman wrote:

> Large cluster programming will always need an MPI or MPI-like system. 
> Small SMP programming might have easier to use alternatives that are 
> "good enough".  That "good enough" factor is not one to be discounted 
> lightly, you do so at your own peril.


But what would be the advantages/disadvantages of both these 
technologies if one starts a new project now knowing that it will have 
to run in parallel and knowing that multi-core processors are 
increasingly Numa ?

toon


From gdjacobs at gmail.com  Wed Dec 12 12:35:56 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Wed, 12 Dec 2007 14:35:56 -0600
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <47604282.2080504@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<60349.192.168.1.1.1197379039.squirrel@mail.eadline.org>	<475EA82E.2030101@gmail.com>
	<475EAF89.3050209@scalableinformatics.com>
	<47604282.2080504@gmail.com>
Message-ID: <4760462C.8010601@gmail.com>

Toon Knapen wrote:
> Joe Landman wrote:
> 
>> Large cluster programming will always need an MPI or MPI-like system.
>> Small SMP programming might have easier to use alternatives that are
>> "good enough".  That "good enough" factor is not one to be discounted
>> lightly, you do so at your own peril.
> 
> 
> But what would be the advantages/disadvantages of both these
> technologies if one starts a new project now knowing that it will have
> to run in parallel and knowing that multi-core processors are
> increasingly Numa ?

Here are the advantages and disadvantages:
OpenMP is easy, but only works on one computer.
MPI requires more effort up front, but works across many computers.

In both cases, some careful thought has to be applied in partitioning
the problem to achieve good speedup, and even then the results can be mixed.

-- 
Geoffrey D. Jacobs


From rgb at phy.duke.edu  Wed Dec 12 14:28:37 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 12 Dec 2007 17:28:37 -0500 (EST)
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <475FE139.6060108@tamu.edu>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<4759B001.4090004@gmail.com> <475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<20071211222726.GC9072@bx9.net>
	<475F1838.8030305@scalableinformatics.com>
	<d5bdff000712120408m34246d1cg2570d2ac519fc7cc@mail.gmail.com>
	<475FE139.6060108@tamu.edu>
Message-ID: <Pine.LNX.4.64.0712121702070.12612@lilith.rgb.private.net>

On Wed, 12 Dec 2007, Gerry Creager wrote:

> Debates and differences aside, often-times, this forum *is* an authoritative 
> source of information.

Not to mention the fun when people get all hot under the collar...;-)

Heat DOES make light, after all.

Seriously, I think that it has been a very productive thread.  I've
certainly learned a lot.  One very interesting part of which is that it
sounds like we're coming around the corner in a very, very long cycle to
where multiprocessor machines (e.g. quads and beyond) with large numbers
of processors (and cores per processor) in a single box with a single
large memory are going to once again be in vogue, be they CC-NUMA or
flat memory model boxes, and that this is likely to once again
significantly change the topology of the parallel computing landscape.

Or maybe not.  MPI was originally created for big iron machines like
this way back when only PVM or raw sockets were providing beowulfish
clustering of COTS boxes (more or less -- some of the COST systems were
themselves supercomputers in the early days) on OTC networks.  Yes, it
should continue to be a productive paradigm as the wheel comes around
again that actually HELPS coders understand the limitations of CPU/IPC
bottlenecks, whether they are the result of shared memory bottlenecks of
one sort or another or due to a real external network.  It isn't just
about networking, even though on this list it has mostly been about
networking for some time.

I'm certainly interested in keeping an "open"(MP:-) mind, though, as the
hardware folks aren't exactly done turning the wheel, and it seems at
least possible that they'll be able to create hardware and associated
compilers and/or library support that permits the equally old shared
memory programming models come around again as well as efficient
paradigms.  Many of the objections raised (e.g. processor affinity) SEEM
like they are in principle controllable by e.g. kernel and hardware
working in tandem, once a clear picture of what is required for
efficient operation emerges.  In that case the winner may be (if I
understand the arguments thus far) determined by ease of programming, or
the fact that with ENOUGH low-level support MPI represents at best an
additional layer of call structures that can only slow code down, not
speed it up.  Possibly trivially slow it down, allowing the MPI folks to
invoke ease of coding in the form of code portability the other way.

It is important to remember in both cases that not all parallel code
needs bleeding edge scaling -- all it needs is to scale "well enough"
across the available processors and be easy for the coder (whoever they
happen to be) to program.  Or is anyone asserting that embarrassingly
parallel programs, or very coarse grained, master-slave type parallel
programs, are going to perform vastly better with one paradigm than with
the other?  Surely there is a cut-off of sorts in IPC density below
which it really doesn't matter which one you use from a performance
point of view, just as there may or many not be tasks for which one or
the other is especially well suited beyond that threshold...

That's the debateable point I understand, but is it being asserted that
it is NEVER going to be sensible to use OpenMP in favor of MPI or just
that it is most LIKELY going to be smarter to use one or the other?  Or
even weaker, that there are now known to be certain specific tasks for
which one is better than the other, and a vast unknown elsewhere...?

    rgb

>
> gerry
>
> Toon Knapen wrote:
>> 
>> 
>> On 12/12/07, *Joe Landman* <landman at scalableinformatics.com 
>> <mailto:landman at scalableinformatics.com>> wrote:
>>
>>     This is why I note that talking about MPI vs OpenMP and other
>>     pseudo-debates generates mostly heat and very little light.  Reminds me
>>     of editor battles, shell scripting battles ...
>>
>>   I agree that discussions like these easily degenerate.
>>  That is actually one of the reasons why I'm looking for authoritive 
>> documents discussion the difference between both approaches. Such documents 
>> could come in handy when discussing the strategy to use concerning 
>> parallelisation of a project to bring the discussion forward in an 
>> objective way.
>> 
>> 
>> ------------------------------------------------------------------------
>> 
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From csamuel at vpac.org  Wed Dec 12 20:50:41 2007
From: csamuel at vpac.org (Chris Samuel)
Date: Thu, 13 Dec 2007 15:50:41 +1100
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <Pine.LNX.4.64.0712092130270.12457@coffee.psychology.mcmaster.ca>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<475CA246.6000501@gmail.com>
	<Pine.LNX.4.64.0712092130270.12457@coffee.psychology.mcmaster.ca>
Message-ID: <200712131550.44020.csamuel@vpac.org>

On Mon, 10 Dec 2007, Mark Hahn wrote:

> threads, of course, are antithetical to security, since the whole
> point is freedom to read/write anything.

This is one of the reasons that Tridge (of Samba fame) rants against 
the use threads by people thinking that they'll be faster than 
independent processes.. :-)

Viz:

http://lists.samba.org/archive/samba-technical/2004-December/038301.html

> no, you're still clinging to the notion that threads are somehow
> inherently faster than processes. They aren't. They are inherently 
> slower, no matter what OS you are talking about. 
>
> Some OSes might implement processes so badly that threads come out 
> ahead. It is fundamental computer science that doing operations in a 
> threaded environment will be slower than doing operations in an
> equivalent process based environment, because they have to do more
> work. 
> 
> Using processes allows you to take advantage of a hardware memory
> protection system. Using threads doesn't. 

and:

http://lists.samba.org/archive/samba-technical/2004-December/038298.html

> What is it about the word "thread" that people find so damn sexy?
>
> Maybe it needs a name change 
> "slow-as-hell-no-memory-protection-locks-dont-work" API might be
> suitable, but I suspect the standards committees wouldn't like that 
> one. 
> 
> The MMU was added to CPUs for a very good reason. Why is it so hard
> to understand that trying to avoid it is a bad idea? 

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071213/c3f3ed49/attachment.sig>

From fumie.costen at manchester.ac.uk  Thu Dec 13 05:37:43 2007
From: fumie.costen at manchester.ac.uk (f.costen@cs.man.ac.uk)
Date: Thu, 13 Dec 2007 13:37:43 +0000
Subject: [Beowulf] large array to run
In-Reply-To: <475E960B.4070509@scalableinformatics.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
Message-ID: <476135A7.8090509@cs.man.ac.uk>

Dear All,
I am facing a very stange problem both g77 and ifort at the
moment.
When I try to use the array of
integer actual(9915,9915,9915)
in the test program which does not have
any other arrays,
the compilation works but
when I tried to run it
the program is immediately killed.
I do not get any error message for that

I tried this at our local cluster
and the Univ's supercomputer and
I get the same situation.
If any of you in the list has this sort of
problem and know the remedy for it
can you let me have your idea ?

Thank you very much
Fumie


From lindahl at pbm.com  Thu Dec 13 11:18:24 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Thu, 13 Dec 2007 11:18:24 -0800
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <Pine.LNX.4.64.0712121702070.12612@lilith.rgb.private.net>
References: <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<20071211222726.GC9072@bx9.net>
	<475F1838.8030305@scalableinformatics.com>
	<d5bdff000712120408m34246d1cg2570d2ac519fc7cc@mail.gmail.com>
	<475FE139.6060108@tamu.edu>
	<Pine.LNX.4.64.0712121702070.12612@lilith.rgb.private.net>
Message-ID: <20071213191824.GA18438@bx9.net>

On Wed, Dec 12, 2007 at 05:28:37PM -0500, Robert G. Brown wrote:

> That's the debateable point I understand, but is it being asserted that
> it is NEVER going to be sensible to use OpenMP in favor of MPI or just
> that it is most LIKELY going to be smarter to use one or the other?

The second. And that many people have wasted time when they make a
code do both.

-- greg


From hahn at mcmaster.ca  Thu Dec 13 11:56:58 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 13 Dec 2007 14:56:58 -0500 (EST)
Subject: [Beowulf] large array to run
In-Reply-To: <476135A7.8090509@cs.man.ac.uk>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<476135A7.8090509@cs.man.ac.uk>
Message-ID: <Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>

> When I try to use the array of
> integer actual(9915,9915,9915)

I don't speak fortran natively, but isn't that array
approximately 3.6 TB in size?


From peter.st.john at gmail.com  Thu Dec 13 12:28:57 2007
From: peter.st.john at gmail.com (Peter St. John)
Date: Thu, 13 Dec 2007 15:28:57 -0500
Subject: [Beowulf] large array to run
In-Reply-To: <Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<4759B001.4090004@gmail.com> <475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<476135A7.8090509@cs.man.ac.uk>
	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
Message-ID: <e4d4fd070712131228g5fe9240cg5b7d890ddec6b834@mail.gmail.com>

In C I'd get a return value from malloc (like, NULL in this case, assuming
the memory allocation failed). How does that work in modern fortran?
Peter

On Dec 13, 2007 2:56 PM, Mark Hahn <hahn at mcmaster.ca> wrote:

> > When I try to use the array of
> > integer actual(9915,9915,9915)
>
> I don't speak fortran natively, but isn't that array
> approximately 3.6 TB in size?
>  _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071213/cfe25a9e/attachment.html>

From gdjacobs at gmail.com  Thu Dec 13 13:07:37 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Thu, 13 Dec 2007 15:07:37 -0600
Subject: [Beowulf] large array to run
In-Reply-To: <476135A7.8090509@cs.man.ac.uk>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<475E960B.4070509@scalableinformatics.com>
	<476135A7.8090509@cs.man.ac.uk>
Message-ID: <47619F19.1040206@gmail.com>

f.costen at cs.man.ac.uk wrote:
> Dear All,
> I am facing a very stange problem both g77 and ifort at the
> moment.
> When I try to use the array of
> integer actual(9915,9915,9915)
> in the test program which does not have
> any other arrays,
> the compilation works but
> when I tried to run it
> the program is immediately killed.
> I do not get any error message for that
> 
> I tried this at our local cluster
> and the Univ's supercomputer and
> I get the same situation.
> If any of you in the list has this sort of
> problem and know the remedy for it
> can you let me have your idea ?
> 
> Thank you very much
> Fumie

You do realize that array is utilizing almost a gig of ram. Do you have
that much available (physical and virtual)? Is your kernel configured to
allow that high a ram limit per process? What happens when the size of
one dimension is tuned down slightly?

-- 
Geoffrey D. Jacobs


From gdjacobs at gmail.com  Thu Dec 13 13:11:03 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Thu, 13 Dec 2007 15:11:03 -0600
Subject: [Beowulf] large array to run
In-Reply-To: <Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<475E960B.4070509@scalableinformatics.com>	<476135A7.8090509@cs.man.ac.uk>
	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
Message-ID: <47619FE7.3050409@gmail.com>

Mark Hahn wrote:
>> When I try to use the array of
>> integer actual(9915,9915,9915)
> 
> I don't speak fortran natively, but isn't that array
> approximately 3.6 TB in size?

Oops, forgot to put the decimal in the right place.

9915^3 * 8 bits/integer / 1024^3 bytes/GB = 907 GB.

It could be done with a 64 bit kernel. Too big for PAE.

-- 
Geoffrey D. Jacobs

To have no errors
  would be life without meaning
  No struggle, no joy


From gdjacobs at gmail.com  Thu Dec 13 13:17:35 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Thu, 13 Dec 2007 15:17:35 -0600
Subject: [Beowulf] large array to run
In-Reply-To: <Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>
	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<475E960B.4070509@scalableinformatics.com>	<476135A7.8090509@cs.man.ac.uk>
	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
Message-ID: <4761A16F.30202@gmail.com>

Mark Hahn wrote:
>> When I try to use the array of
>> integer actual(9915,9915,9915)
> 
> I don't speak fortran natively, but isn't that array
> approximately 3.6 TB in size?

And quadruple that, 'cuz integers are 32 bits per in FORTRAN, by
default. So, you are correct.


Note to self: make sure I am alert before I post.

-- 
Geoffrey D. Jacobs


From hahn at mcmaster.ca  Thu Dec 13 13:13:05 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 13 Dec 2007 16:13:05 -0500 (EST)
Subject: [Beowulf] large array to run
In-Reply-To: <e4d4fd070712131228g5fe9240cg5b7d890ddec6b834@mail.gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com> 
	<4759B001.4090004@gmail.com> <475AD37F.3040004@gmail.com>
	<475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com> 
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<476135A7.8090509@cs.man.ac.uk>
	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
	<e4d4fd070712131228g5fe9240cg5b7d890ddec6b834@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712131608520.11484@coffee.psychology.mcmaster.ca>

> In C I'd get a return value from malloc (like, NULL in this case, assuming

in C, it depends on the size of the allocation.  in linux, the 
behavior depends on the /proc/vm/overcommit_memory setting:
it can refuse to give you more memory than it has, or the max 
size can be set to swap+ram*/proc/vm/overcommit_ratio or 
(the default setting, overcommit_memory=0) a heuristic that lets
you make absurdly large allocations, as long as you don't touch 
them all...

(it also depends on whether your glibc contains a malloc that 
switches to mmap for large allocations rather than sbrk, etc.)


From gdjacobs at gmail.com  Thu Dec 13 13:31:29 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Thu, 13 Dec 2007 15:31:29 -0600
Subject: [Beowulf] large array to run
In-Reply-To: <e4d4fd070712131228g5fe9240cg5b7d890ddec6b834@mail.gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<475E960B.4070509@scalableinformatics.com>	<476135A7.8090509@cs.man.ac.uk>	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
	<e4d4fd070712131228g5fe9240cg5b7d890ddec6b834@mail.gmail.com>
Message-ID: <4761A4B1.9010400@gmail.com>

Peter St. John wrote:
> In C I'd get a return value from malloc (like, NULL in this case,
> assuming the memory allocation failed). How does that work in modern
> fortran?
> Peter

It fails in amusingly unpredictable ways. Depending on how I spin the
counters, I can get it to repeat endlessly, or segfault initializing the
first few cells. This is with GFortran.

-- 
Geoffrey D. Jacobs


From rgb at phy.duke.edu  Thu Dec 13 14:56:43 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 13 Dec 2007 17:56:43 -0500 (EST)
Subject: [Beowulf] large array to run
In-Reply-To: <47619FE7.3050409@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<476135A7.8090509@cs.man.ac.uk>
	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
	<47619FE7.3050409@gmail.com>
Message-ID: <Pine.LNX.4.64.0712131751260.12612@lilith.rgb.private.net>

On Thu, 13 Dec 2007, Geoff Jacobs wrote:

> Mark Hahn wrote:
>>> When I try to use the array of
>>> integer actual(9915,9915,9915)
>>
>> I don't speak fortran natively, but isn't that array
>> approximately 3.6 TB in size?
>
> Oops, forgot to put the decimal in the right place.
>
> 9915^3 * 8 bits/integer / 1024^3 bytes/GB = 907 GB.
>
> It could be done with a 64 bit kernel. Too big for PAE.

Yeah, if you had a box with several hundred memory slots....

Which I say only semi-sarcastically.  They sound like they're coming,
they're coming.  Who knows, maybe they're here and I'm just out of
touch.

If it is a sparse matrix, them just maybe one can do something on this
scale, but otherwise, well, it's like telling mathematica to go and
compute umpty-something factorial -- it will go out, make a herioc
effort, use all the free memory in the universe, and die valiantly
(perhaps taking down your computer with it if the kernel happens to need
some memory at a critical time when their isn't any).  Large scale
computation as a DOS attack...

    rgb

>
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From csamuel at vpac.org  Thu Dec 13 15:00:54 2007
From: csamuel at vpac.org (Chris Samuel)
Date: Fri, 14 Dec 2007 10:00:54 +1100
Subject: [Beowulf] large array to run
In-Reply-To: <Pine.LNX.4.64.0712131608520.11484@coffee.psychology.mcmaster.ca>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<e4d4fd070712131228g5fe9240cg5b7d890ddec6b834@mail.gmail.com>
	<Pine.LNX.4.64.0712131608520.11484@coffee.psychology.mcmaster.ca>
Message-ID: <200712141000.59286.csamuel@vpac.org>

On Fri, 14 Dec 2007, Mark Hahn wrote:

> (it also depends on whether your glibc contains a malloc that
> switches to mmap for large allocations rather than sbrk, etc.)

I was looking at that the other day wondering why the maximum RAM & 
data segment size limits set by Torque with -l mem=2g were not being 
enforced.

After a bit of head scratching I tracked it down to the fact that 
between somewhere around glibc 2.3 the old malloc() implementation 
using brk() was ripped out and replaced with one that uses mmap() for 
allocations of 128KB or more.

Unfortunately the kernel implementation of mmap() doesn't check the 
maximum memory size (RLIMIT_RSS) or maximum data size (RLIMIT_DATA) 
limits which were being set, but only the maximum virtual RAM size 
(RLIMIT_AS) - this is documented in the setrlimit(2) man page:

>       RLIMIT_AS
>              The maximum size of the process?s virtual memory
>              (address space) in  bytes.   This  limit  affects calls
>              to brk(2), mmap(2) and  mremap(2), which fail with
>              the error ENOMEM upon exceeding  this limit.

(it also says that RLIMIT_RSS hasn't worked since 2.4.29, which seems 
to be born out by a quick grep of 2.6.24-rc3 I have to hand)

In other words you can set a low memory limit of say 10MB with:

$ ulimit -m $((10*1024)

and then run a program that allocates 2GB RAM in large chunks 
successfully and only fails when it tries to request a trivial amount 
of RAM. :-(

I've submitted a patch for Torque to set RLIMIT_AS as well as 
RLIMIT_RSS & RLIMIT_DATA.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071214/9a1aab49/attachment.sig>

From csamuel at vpac.org  Thu Dec 13 15:17:11 2007
From: csamuel at vpac.org (Chris Samuel)
Date: Fri, 14 Dec 2007 10:17:11 +1100
Subject: [Beowulf] large array to run
In-Reply-To: <Pine.LNX.4.64.0712131751260.12612@lilith.rgb.private.net>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<47619FE7.3050409@gmail.com>
	<Pine.LNX.4.64.0712131751260.12612@lilith.rgb.private.net>
Message-ID: <200712141017.11532.csamuel@vpac.org>

On Fri, 14 Dec 2007, Robert G. Brown wrote:

> Which I say only semi-sarcastically. ?They sound like they're
> coming, they're coming. ?Who knows, maybe they're here and I'm just
> out of touch.

Dunno how close to not being vapourware these are (they may already be 
available from the last part of the article):

http://www.theregister.co.uk/2007/12/10/amd_violin_memory/

> The company sells a Violin 1010 unit that holds up to 504GB of DRAM
> in a 2U box. Fill a rack, and you're looking at 10TB of DRAM. 

of course it's not cheap..

> Those of you who want to try Violin's gear now can get a
> 120GB ?starter kit? for $50,000. 

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071214/09644fa9/attachment.sig>

From toon at moene.indiv.nluug.nl  Thu Dec 13 11:44:19 2007
From: toon at moene.indiv.nluug.nl (Toon Moene)
Date: Thu, 13 Dec 2007 20:44:19 +0100
Subject: [Beowulf] large array to run
In-Reply-To: <476135A7.8090509@cs.man.ac.uk>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<475E960B.4070509@scalableinformatics.com>
	<476135A7.8090509@cs.man.ac.uk>
Message-ID: <47618B93.3030208@moene.indiv.nluug.nl>

f.costen at cs.man.ac.uk wrote:

> Dear All,
> I am facing a very stange problem both g77 and ifort at the
> moment.
> When I try to use the array of
> integer actual(9915,9915,9915)

That's 10^4)^3 integers of 4 bytes, or 4*10^12 bytes (that's 4
Terabytes).  Are you sure you're allowed to use that many ?  Are that
many bytes present in your computer ?

-- 
Toon Moene - e-mail: toon at moene.indiv.nluug.nl - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.indiv.nluug.nl/~toon/
GNU Fortran's path to Fortran 2003: http://gcc.gnu.org/wiki/Fortran2003


From bdobbins at gmail.com  Thu Dec 13 12:51:27 2007
From: bdobbins at gmail.com (Brian Dobbins)
Date: Thu, 13 Dec 2007 15:51:27 -0500
Subject: [Beowulf] large array to run
In-Reply-To: <e4d4fd070712131228g5fe9240cg5b7d890ddec6b834@mail.gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<476135A7.8090509@cs.man.ac.uk>
	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
	<e4d4fd070712131228g5fe9240cg5b7d890ddec6b834@mail.gmail.com>
Message-ID: <2b5e0c120712131251ne38775cpa067dbcd99502f8b@mail.gmail.com>

In C I'd get a return value from malloc (like, NULL in this case, assuming
> the memory allocation failed). How does that work in modern fortran?
>

If it's a static array, it'll just crash without even 'starting' simply
because not enough memory is available.  If it's using the ALLOCATE command
in F90, I believe it stops unless the 'STAT' (status) option is provided and
is greater than 0.

  Fortran *does* allow you to specify weird bounds, such as (10:12), and
even though the 'upper bound' in this example is the number twelve, the
number of elements is only 3 (with indices of 10, 11 and 12), but that
doesn't seem to be the case with the original poster's code, so the array
does seem to be allocating 3.6TB, assuming integers are 4 bytes.

  Cheers,
  - Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071213/86fea093/attachment.html>

From csamuel at vpac.org  Thu Dec 13 15:29:57 2007
From: csamuel at vpac.org (Chris Samuel)
Date: Fri, 14 Dec 2007 10:29:57 +1100
Subject: [Beowulf] large array to run
In-Reply-To: <200712141000.59286.csamuel@vpac.org>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<Pine.LNX.4.64.0712131608520.11484@coffee.psychology.mcmaster.ca>
	<200712141000.59286.csamuel@vpac.org>
Message-ID: <200712141029.57603.csamuel@vpac.org>

On Fri, 14 Dec 2007, Chris Samuel wrote:

> In other words you can set a low memory limit of say 10MB with:
>
> $ ulimit -m $((10*1024)

Mea culpa - should have been:

> In other words you can set a low data seg limit of say 10MB with:
>
> $ ulimit -d $((10*1024))

because, of course, the memory limit checking no longer exists in the 
kernel..

-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071214/0926dff8/attachment.sig>

From rgb at phy.duke.edu  Thu Dec 13 15:34:49 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 13 Dec 2007 18:34:49 -0500 (EST)
Subject: [Beowulf] large array to run
In-Reply-To: <200712141017.11532.csamuel@vpac.org>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<47619FE7.3050409@gmail.com>
	<Pine.LNX.4.64.0712131751260.12612@lilith.rgb.private.net>
	<200712141017.11532.csamuel@vpac.org>
Message-ID: <Pine.LNX.4.64.0712131829410.12612@lilith.rgb.private.net>

On Fri, 14 Dec 2007, Chris Samuel wrote:

> On Fri, 14 Dec 2007, Robert G. Brown wrote:
>
>> Which I say only semi-sarcastically. ?They sound like they're
>> coming, they're coming. ?Who knows, maybe they're here and I'm just
>> out of touch.
>
> Dunno how close to not being vapourware these are (they may already be
> available from the last part of the article):
>
> http://www.theregister.co.uk/2007/12/10/amd_violin_memory/
>

Ya, reminiscent of Duke's old trapeze project -- build a cluster that
just acts like a virtual or otherwise extension of memory in a NUMA
model of one sort or another.

The question there is what is the bandwidth, what sort of latency, how
does one manage e.g. locking.  Should one REALLY parallelize the
project, and split the array up on a real cluster (and use the cluster
nodes to manage parallelized portions of e.g. matrix operations on the
array) or try to use a single processor with a huge virtual memory
attached via a network or use a multiprocessor with a huge virtual
memory attached via a network.  Oh my aching head.  Maybe we should just
try shrinking the size of the matrix by an order of magnitude in each
dimension and live in the 4 GB or so THAT would take...;-)

>> The company sells a Violin 1010 unit that holds up to 504GB of DRAM
>> in a 2U box. Fill a rack, and you're looking at 10TB of DRAM.
>
> of course it's not cheap..
>
>> Those of you who want to try Violin's gear now can get a
>> 120GB ??starter kit?? for $50,000.

Let me run right down to the bank...;-)

Maybe if my novels ever sell eight million copies and I can buy boxes
like this just to play with them in my garage...;-)

    rgb

>
> cheers,
> Chris
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977

From peter.skomoroch at gmail.com  Thu Dec 13 15:50:28 2007
From: peter.skomoroch at gmail.com (Peter Skomoroch)
Date: Thu, 13 Dec 2007 18:50:28 -0500
Subject: [Beowulf] large array to run
Message-ID: <e4fc0d2a0712131550t3cb83e7na733666e8b87e36d@mail.gmail.com>

This reminds me of a similar issue I had.  What approaches do you take for
large dense matrix multiplication in MPI, when the matrices are too large to
fit into cluster memory?  If I hack up something to cache intermediate
results to disk, the IO seems to drag everything to a halt and I'm looking
for a better solution.  I'd like to use some libraries like PETSc, but how
would you work around memory limitations like this (short of building a
bigger cluster)?


> >> I don't speak fortran natively, but isn't that array
> >> approximately 3.6 TB in size?
> >
> > Oops, forgot to put the decimal in the right place.
> >
> > 9915^3 * 8 bits/integer / 1024^3 bytes/GB = 907 GB.
> >
> > It could be done with a 64 bit kernel. Too big for PAE.
>
> Yeah, if you had a box with several hundred memory slots....
>
> Which I say only semi-sarcastically.  They sound like they're coming,
> they're coming.  Who knows, maybe they're here and I'm just out of
> touch.
>
> If it is a sparse matrix, them just maybe one can do something on this
> scale, but otherwise, well, it's like telling mathematica to go and
> compute umpty-something factorial -- it will go out, make a herioc
> effort, use all the free memory in the universe, and die valiantly
> (perhaps taking down your computer with it if the kernel happens to need
> some memory at a critical time when their isn't any).  Large scale
> computation as a DOS attack...
>


-- 
Peter N. Skomoroch
peter.skomoroch at gmail.com
http://www.datawrangling.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071213/002d23fc/attachment.html>

From landman at scalableinformatics.com  Thu Dec 13 18:01:27 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Thu, 13 Dec 2007 21:01:27 -0500
Subject: [Beowulf] large array to run
In-Reply-To: <e4fc0d2a0712131550t3cb83e7na733666e8b87e36d@mail.gmail.com>
References: <e4fc0d2a0712131550t3cb83e7na733666e8b87e36d@mail.gmail.com>
Message-ID: <4761E3F7.8010503@scalableinformatics.com>

Peter Skomoroch wrote:
> This reminds me of a similar issue I had.  What approaches do you take for
> large dense matrix multiplication in MPI, when the matrices are too large to
> fit into cluster memory?  If I hack up something to cache intermediate

Hi Peter:

> results to disk, the IO seems to drag everything to a halt and I'm looking
> for a better solution.  I'd like to use some libraries like PETSc, but how

   Disk memory has a latency of 10^-3 seconds or so, and a bandwidth of 
from 10^7 to 10^8 bytes/second.  Compare that to physical ram. 
Latencies of 10^-7 seconds or less, and bandwidths of 10^9 to 10^10 seconds.

   If you are going to do disk IO, pay that latency cost once for many 
pages, not once per page with seeks.

   Just like with other streaming calculations, you likely need to do 
some sort of double buffering.  That said, disk IO is not really the answer.

> would you work around memory limitations like this (short of building a
> bigger cluster)?

   20+ years ago I worked on a large dense Markov matrix calculation 
where after computing the relevant matrix elements and using them in the 
calculation, I would throw them away.  It was cheaper (less time 
consuming) than spilling them to disk and then trying to recover them 
later.  Then again, this was an IBM 3090 VF 180 ... so ...

   Since you are doing matrix multiplication, I might suggest looking at 
the Golub and Van Loan bible on Matrix Computations for some ideas. 
That said, Matrix multiplications are decomposable.  If you can 
reconstruct matrix elements easily (more quickly than storage/retrieval) 
this might be a good method.  Or if you can decompose it far enough, or 
if the problem has some sort of essential symmetry you can exploit in 
the matrix structure, this could help.  Symmetries not only imply 
conservation laws, they tend to reduce storage requirements.

   Out of curiousity, what size matrices are you using?  I know some of 
the structural folks can, with large enough DoF problems hit 10^8 or so 
on a side.  Not dense (usually with specific banded structure).

   And that brings up another possibility.  If you can perform various 
transforms on your matrix to get it into a well known form (banded, ...) 
this could make multiplications go much faster.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From rgb at phy.duke.edu  Thu Dec 13 18:35:37 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 13 Dec 2007 21:35:37 -0500 (EST)
Subject: [Beowulf] large array to run
In-Reply-To: <e4fc0d2a0712131550t3cb83e7na733666e8b87e36d@mail.gmail.com>
References: <e4fc0d2a0712131550t3cb83e7na733666e8b87e36d@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712132125580.1172@lilith.rgb.private.net>

On Thu, 13 Dec 2007, Peter Skomoroch wrote:

> This reminds me of a similar issue I had.  What approaches do you take for
> large dense matrix multiplication in MPI, when the matrices are too large to
> fit into cluster memory?  If I hack up something to cache intermediate
> results to disk, the IO seems to drag everything to a halt and I'm looking
> for a better solution.  I'd like to use some libraries like PETSc, but how
> would you work around memory limitations like this (short of building a
> bigger cluster)?

You can build a cluster differently, maybe -- designing a bunch of nodes
that basically just form a memory-network-memory cache.  Spending more
money on memory and network, less on CPU.  But if you have a fundamental
limitation of less aggregate memory than the size of your matrix, you
pretty much have to store it somewhere, the only question is where and
how fast the store is and how much it costs to build it.

    rgb

>
>
>
>
>
>>>> I don't speak fortran natively, but isn't that array
>>>> approximately 3.6 TB in size?
>>>
>>> Oops, forgot to put the decimal in the right place.
>>>
>>> 9915^3 * 8 bits/integer / 1024^3 bytes/GB = 907 GB.
>>>
>>> It could be done with a 64 bit kernel. Too big for PAE.
>>
>> Yeah, if you had a box with several hundred memory slots....
>>
>> Which I say only semi-sarcastically.  They sound like they're coming,
>> they're coming.  Who knows, maybe they're here and I'm just out of
>> touch.
>>
>> If it is a sparse matrix, them just maybe one can do something on this
>> scale, but otherwise, well, it's like telling mathematica to go and
>> compute umpty-something factorial -- it will go out, make a herioc
>> effort, use all the free memory in the universe, and die valiantly
>> (perhaps taking down your computer with it if the kernel happens to need
>> some memory at a critical time when their isn't any).  Large scale
>> computation as a DOS attack...
>>
>
>
>
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From tjrc at sanger.ac.uk  Fri Dec 14 01:49:17 2007
From: tjrc at sanger.ac.uk (Tim Cutts)
Date: Fri, 14 Dec 2007 09:49:17 +0000
Subject: [Beowulf] large array to run
In-Reply-To: <Pine.LNX.4.64.0712131751260.12612@lilith.rgb.private.net>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>
	<20071207202431.GA17274@bx9.net> <4759B001.4090004@gmail.com>
	<475AD37F.3040004@gmail.com> <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<476135A7.8090509@cs.man.ac.uk>
	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
	<47619FE7.3050409@gmail.com>
	<Pine.LNX.4.64.0712131751260.12612@lilith.rgb.private.net>
Message-ID: <0FC65769-AC1D-4175-A99F-0148E3558753@sanger.ac.uk>


On 13 Dec 2007, at 10:56 pm, Robert G. Brown wrote:

> Yeah, if you had a box with several hundred memory slots....
>
> Which I say only semi-sarcastically.  They sound like they're coming,
> they're coming.  Who knows, maybe they're here and I'm just out of
> touch.

They've been around for a long time, just not as commodity.  The SGI  
Altix can certainly scale to terabytes of memory.  I have heard of one  
Altix installation with 8TB of physical memory.  Must have cost a  
fortune...  In the X86 space they are, as you say, only just starting  
to appear.

Tim


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From Daniel.Pfenniger at obs.unige.ch  Fri Dec 14 08:07:22 2007
From: Daniel.Pfenniger at obs.unige.ch (Daniel Pfenniger)
Date: Fri, 14 Dec 2007 17:07:22 +0100
Subject: [Beowulf] multi-threading vs. MPI
In-Reply-To: <20071213191824.GA18438@bx9.net>
References: <475C43F7.5070908@gmail.com>
	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>
	<475DDC6F.3060007@scalableinformatics.com>
	<20071211063603.GA18419@bx9.net>
	<475E960B.4070509@scalableinformatics.com>
	<20071211222726.GC9072@bx9.net>
	<475F1838.8030305@scalableinformatics.com>
	<d5bdff000712120408m34246d1cg2570d2ac519fc7cc@mail.gmail.com>
	<475FE139.6060108@tamu.edu>
	<Pine.LNX.4.64.0712121702070.12612@lilith.rgb.private.net>
	<20071213191824.GA18438@bx9.net>
Message-ID: <4762AA3A.2010002@obs.unige.ch>


Greg Lindahl wrote:
> On Wed, Dec 12, 2007 at 05:28:37PM -0500, Robert G. Brown wrote:
> 
>> That's the debateable point I understand, but is it being asserted that
>> it is NEVER going to be sensible to use OpenMP in favor of MPI or just
>> that it is most LIKELY going to be smarter to use one or the other?
> 
> The second. And that many people have wasted time when they make a
> code do both.
> 
> -- greg
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

My experience tells me: it depends.

What I have seen in clusters of SMP nodes is that one first may well develop
a pure MPI code that scales well when running 1 process per node.  At this
stage the processes enjoy maximum network capacity, RAM space and disk,
but many CPUs stay idle.

The options to make use of these CPUs are:

1) Run several processes per nodes keeping the MPI code unchanged.
    Depending on the code and cluster characteristics, scaling may drop
    however due to the shared network capacity, RAM space, or disk.

2) Keep 1 process per node but use OpenMP within local processes.
    Depending on the type of code this may provide better speed-up than 1).
    At least it should improve performance wrt 1 process per node.

In summary my recommendation would be to parallelize as much as possible
at high level with MPI only.  But if network, RAM or disk would become
bottlenecks when running several processes per node, parallelize the
code with OpenMP.  Such a nested parallelism can be easily  ported on
different SMP node clusters with different characteristics.

Notice that at the level of each CPU, compilers and microcode achieve
already a lower nesting of parallelism.  The same in networks or in
hard drives.  Over the computer history nested parallelism over
increasingly many levels has proven to be the way to proceed when
codes become increasingly complex.

	Dan


From tom.elken at qlogic.com  Fri Dec 14 08:41:15 2007
From: tom.elken at qlogic.com (Tom Elken)
Date: Fri, 14 Dec 2007 08:41:15 -0800
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	librarieswith MPI
In-Reply-To: <320e992a0712140210t29c93fc8je2e2541533812461@mail.gmail.com>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
	<6DB5B58A8E5AB846A7B3B3BFF1B4315A01850590@AVEXCH1.qlogic.org>
	<320e992a0712140210t29c93fc8je2e2541533812461@mail.gmail.com>
Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A018F7FF1@AVEXCH1.qlogic.org>

> -----Original Message-----
> From: Eray Ozkural [mailto:examachine at gmail.com] 
> Sent: Friday, December 14, 2007 2:11 AM
> To: Tom Elken
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] Using Autoparallel compilers or 
> Multi-Threaded librarieswith MPI
> 
> On Dec 12, 2007 7:35 PM, Tom Elken <tom.elken at qlogic.com> wrote:
> > Results of the VERY non-scientific survey:
> >
> > # reporting use of Autoparallel features with MPI:          0
> >
> > # reporting use of multi-threaded math libraries with MPI:  1
> >
 
> Well, then, is there really such a thing that extracts 
> threads from those
> horrible C codes and generates MPI code?

I have heard of SW tools that try to do some of that, but they did not
achieve much commercial success.
But that is not what I meant.

I guess I was relying on memory of readers about my original post about
this subject.  Since that post was way back in November, that was a
dangerous assumption.  Thankfully we have an archive:
http://www.beowulf.org/archive/2007-November/020211.html

'Autoparallel features with MPI' came from this in the original post:
"I was wondering how many people use either auto-parallel compiler
features, or multi-threaded math libraries (Goto, MKL, ACML, etc.) to
provide some thread-level parallelism on a cluster where you primarily
use MPI to achieve your parallel execution.*"

So I meant that the source code is parallelized using MPI.  Then in an
effort to create something like a hybrid MPI/OpenMP program, but without
having to add the OpenMP directives, you use the automatic
parallelization feature of common compilers:
-parallel  in the Intel compiler
-apo       in the PathScale compiler
-Mconcur   in the PGI compiler,  etc.
to find loops which can profitably be parallelized using threads.

Here was the example I mentioned in the original post:
"For example, if an autoparallelizing compiler could find effective
4-way thread-level parallelism in an MPI code and you were running on a
cluster of 8 nodes each with two quad-core CPUs, 64 cores total, you
might choose to run with 16 MPI threads and set your NUM_THREADS
variable to 4, to run with all 64 cores of the cluster executing work
with reasonable efficiency. "

So no one responded that they have done this, let alone finding it to be
faster than running it with purely MPI ranks (no threads).

-Tom


> Not that I believe 
> it is impossible
> (since I work for a company that does a similar thing) but I 
> would like to know
> which autoparallel MPI code the posters had in mind. Is there 
> a market for
> that kind of a compiler?
> 
> Best,
> 
> -- 
> Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent 
> University, Ankara
> 


From examachine at gmail.com  Fri Dec 14 02:10:54 2007
From: examachine at gmail.com (Eray Ozkural)
Date: Fri, 14 Dec 2007 12:10:54 +0200
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	librarieswith MPI
In-Reply-To: <6DB5B58A8E5AB846A7B3B3BFF1B4315A01850590@AVEXCH1.qlogic.org>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
	<6DB5B58A8E5AB846A7B3B3BFF1B4315A01850590@AVEXCH1.qlogic.org>
Message-ID: <320e992a0712140210t29c93fc8je2e2541533812461@mail.gmail.com>

On Dec 12, 2007 7:35 PM, Tom Elken <tom.elken at qlogic.com> wrote:
> Results of the VERY non-scientific survey:
>
> # reporting use of Autoparallel features with MPI:          0
>
> # reporting use of multi-threaded math libraries with MPI:  1
>
> The '1' was using multithreaded BLAS and MPI on HPL (a benchmark, not an
> application) and his recollection was that it was not a win over pure
> MPI.
>
> I'll let you know later when we've resolved the discussion in the HPG
> committee on how this might affect SPEC MPI2007.

Well, then, is there really such a thing that extracts threads from those
horrible C codes and generates MPI code? Not that I believe it is impossible
(since I work for a company that does a similar thing) but I would like to know
which autoparallel MPI code the posters had in mind. Is there a market for
that kind of a compiler?

Best,

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara


From examachine at gmail.com  Fri Dec 14 09:49:30 2007
From: examachine at gmail.com (Eray Ozkural)
Date: Fri, 14 Dec 2007 19:49:30 +0200
Subject: [Beowulf] Using Autoparallel compilers or Multi-Threaded
	librarieswith MPI
In-Reply-To: <6DB5B58A8E5AB846A7B3B3BFF1B4315A018F7FF1@AVEXCH1.qlogic.org>
References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A0184F6FD@AVEXCH1.qlogic.org>
	<6DB5B58A8E5AB846A7B3B3BFF1B4315A01850590@AVEXCH1.qlogic.org>
	<320e992a0712140210t29c93fc8je2e2541533812461@mail.gmail.com>
	<6DB5B58A8E5AB846A7B3B3BFF1B4315A018F7FF1@AVEXCH1.qlogic.org>
Message-ID: <320e992a0712140949y7913e811v76501df1b170d483@mail.gmail.com>

On Dec 14, 2007 6:41 PM, Tom Elken <tom.elken at qlogic.com> wrote:
> > -----Original Message-----
> > From: Eray Ozkural [mailto:examachine at gmail.com]
> > Well, then, is there really such a thing that extracts
> > threads from those
> > horrible C codes and generates MPI code?
>
> I have heard of SW tools that try to do some of that, but they did not
> achieve much commercial success.
> But that is not what I meant.

Sorry for the misunderstanding.

> I guess I was relying on memory of readers about my original post about
> this subject.  Since that post was way back in November, that was a
> dangerous assumption.  Thankfully we have an archive:
> http://www.beowulf.org/archive/2007-November/020211.html
>
> 'Autoparallel features with MPI' came from this in the original post:
> "I was wondering how many people use either auto-parallel compiler
> features, or multi-threaded math libraries (Goto, MKL, ACML, etc.) to
> provide some thread-level parallelism on a cluster where you primarily
> use MPI to achieve your parallel execution.*"
>
> So I meant that the source code is parallelized using MPI.  Then in an
> effort to create something like a hybrid MPI/OpenMP program, but without
> having to add the OpenMP directives, you use the automatic
> parallelization feature of common compilers:
> -parallel  in the Intel compiler
> -apo       in the PathScale compiler
> -Mconcur   in the PGI compiler,  etc.
> to find loops which can profitably be parallelized using threads.

Well, then, I seem to recall, only in a very blurred fashion, some pragmas of
the SGI compiler. I even recall there was support in STL, or maybe I am
making up things. Quite possible.

I hadn't realized there was auto parallel features in so many compilers,
thank you for  the information. Do these guys work well?

Best,

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://www.cs.bilkent.edu.tr/~erayo  Malfunct: http://myspace.com/malfunct
ai-philosophy: http://groups.yahoo.com/group/ai-philosophy


From ctierney at hypermall.net  Fri Dec 14 18:18:19 2007
From: ctierney at hypermall.net (Craig Tierney)
Date: Fri, 14 Dec 2007 19:18:19 -0700
Subject: [Beowulf] large array to run
In-Reply-To: <4761A16F.30202@gmail.com>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<475E960B.4070509@scalableinformatics.com>	<476135A7.8090509@cs.man.ac.uk>	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
	<4761A16F.30202@gmail.com>
Message-ID: <4763396B.5040701@hypermall.net>

Geoff Jacobs wrote:
> Mark Hahn wrote:
>>> When I try to use the array of
>>> integer actual(9915,9915,9915)
>> I don't speak fortran natively, but isn't that array
>> approximately 3.6 TB in size?
> 
> And quadruple that, 'cuz integers are 32 bits per in FORTRAN, by
> default. So, you are correct.
> 

(Ready to be corrected)

Integer*4 are 32 bits in Fortran, integers are not.  The standard does
not specify the size of an integer.  On older Crays (at least not the
AMD based XT series), integers are 64-bit.

Craig


> 
> Note to self: make sure I am alert before I post.
> 


From gdjacobs at gmail.com  Fri Dec 14 19:46:42 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Fri, 14 Dec 2007 21:46:42 -0600
Subject: [Beowulf] large array to run
In-Reply-To: <4763396B.5040701@hypermall.net>
References: <d5bdff000712070426q2236991bk28add574bfa4d48f@mail.gmail.com>	<20071207202431.GA17274@bx9.net>	<4759B001.4090004@gmail.com>	<475AD37F.3040004@gmail.com>	<475C43F7.5070908@gmail.com>	<320e992a0712091545t5832dad6ocf828436111e5774@mail.gmail.com>	<475DDC6F.3060007@scalableinformatics.com>	<20071211063603.GA18419@bx9.net>	<475E960B.4070509@scalableinformatics.com>	<476135A7.8090509@cs.man.ac.uk>	<Pine.LNX.4.64.0712131456180.11484@coffee.psychology.mcmaster.ca>
	<4761A16F.30202@gmail.com> <4763396B.5040701@hypermall.net>
Message-ID: <47634E22.3070303@gmail.com>

Craig Tierney wrote:
<snippage />

Let me rephrase: Integers are 32 bits in FORTRAN on my system, which is
 using gfortran 4.2.3 as compiler, targeting the i386 architecture.

IIRC, the CDC 6600 (and compatibles) had a 48 bit integer format as
default. So, yes it does vary. I guess I was rolling the dice a little.

-- 
Geoffrey D. Jacobs


From TPierce at rohmhaas.com  Fri Dec 14 17:07:24 2007
From: TPierce at rohmhaas.com (Thomas H Dr Pierce)
Date: Fri, 14 Dec 2007 20:07:24 -0500
Subject: Fw: [Beowulf] large array to run
Message-ID: <OF4774E0CF.B6BE0EAE-ON852573B2.0005AA8F-852573B2.00062C67@rohmhaas.com>

beowulf-bounces at beowulf.org wrote on 12/13/2007 06:50:28 PM:

> This reminds me of a similar issue I had.  What approaches do you 
> take for large dense matrix multiplication in MPI, when the matrices
> are too large to fit into cluster memory?  If I hack up something to
> cache intermediate results to disk, the IO seems to drag everything 
> to a halt and I'm looking for a better solution.  I'd like to use 
> some libraries like PETSc, but how would you work around memory 
> limitations like this (short of building a bigger cluster)? 
> 

Dear Peter, 

There are many algorithms for Matrix operations that depend on the 
properties of the matrix and the operation.
You can easily add writing to a tmpfs RAM disk filesystem to speed methods 
that involve reading and writing of temporary files. 

So what I do now is take those old Fortran codes that read and write files 
and keep the intermediate result files in ramdisk. 

------
Sincerely,

   Tom Pierce
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071214/d07cffce/attachment.html>

From sigut at id.ethz.ch  Mon Dec 17 01:33:49 2007
From: sigut at id.ethz.ch (G.M.Sigut)
Date: Mon, 17 Dec 2007 10:33:49 +0100
Subject: [Beowulf] Re:
In-Reply-To: <200712152001.lBFK097n019366@bluewest.scyld.com>
References: <200712152001.lBFK097n019366@bluewest.scyld.com>
Message-ID: <1197884030.25232.6.camel@gms2.ethz.ch>

On Sat, 2007-12-15 at 12:01 -0800, beowulf-request at beowulf.org wrote:
...
>    2. Re: large array to run (Geoff Jacobs)
...
> Message: 2
> Date: Fri, 14 Dec 2007 21:46:42 -0600
> From: Geoff Jacobs <gdjacobs at gmail.com>
> Subject: Re: [Beowulf] large array to run
> To: Craig Tierney <ctierney at hypermall.net>
> Cc: beowulf at beowulf.org, Mark Hahn <hahn at mcmaster.ca>
...
> Craig Tierney wrote:
> <snippage />
...
> IIRC, the CDC 6600 (and compatibles) had a 48 bit integer format as
> default. So, yes it does vary. I guess I was rolling the dice a little.

My memory is telling me, that while the arithmetics on the 6000 series
might have been done with 48 bits, the memory assignment was in words.
One integer took one word of 60 bits (i.e. 5 bytes of 12 bits each).

George

-- 
 >>>>>>>>>>>>>>>>>>>>>>>>>>  George M. Sigut  <<<<<<<<<<<<<<<<<<<<<<<<<<
 ETH Zurich,  Informatikdienste, Abteilung Systemdienste, CH-8092 Zurich
 Swiss Federal Inst. of Technology Zurich, IT Services,  System Services
 e-mail: sigut at id.ethz.ch,  Phone:+41 44 632 5763,  Fax: +41 44 632 1022
 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>-<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


From sigut at id.ethz.ch  Mon Dec 17 01:36:36 2007
From: sigut at id.ethz.ch (G.M.Sigut)
Date: Mon, 17 Dec 2007 10:36:36 +0100
Subject: [Beowulf] large array to run
Message-ID: <1197884196.25232.8.camel@gms2.ethz.ch>

Blast, forgot the subject...

On Sat, 2007-12-15 at 12:01 -0800, beowulf-request at beowulf.org wrote:
...
>    2. Re: large array to run (Geoff Jacobs)
...
> Message: 2
> Date: Fri, 14 Dec 2007 21:46:42 -0600
> From: Geoff Jacobs <gdjacobs at gmail.com>
> Subject: Re: [Beowulf] large array to run
> To: Craig Tierney <ctierney at hypermall.net>
> Cc: beowulf at beowulf.org, Mark Hahn <hahn at mcmaster.ca>
...
> Craig Tierney wrote:
> <snippage />
...
> IIRC, the CDC 6600 (and compatibles) had a 48 bit integer format as
> default. So, yes it does vary. I guess I was rolling the dice a little.

My memory is telling me, that while the arithmetics on the 6000 series
might have been done with 48 bits, the memory assignment was in words.
One integer took one word of 60 bits (i.e. 5 bytes of 12 bits each).

George

-- 
 >>>>>>>>>>>>>>>>>>>>>>>>>>  George M. Sigut  <<<<<<<<<<<<<<<<<<<<<<<<<<
 ETH Zurich,  Informatikdienste, Abteilung Systemdienste, CH-8092 Zurich
 Swiss Federal Inst. of Technology Zurich, IT Services,  System Services
 e-mail: sigut at id.ethz.ch,  Phone:+41 44 632 5763,  Fax: +41 44 632 1022
 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>-<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<


From fumie.costen at manchester.ac.uk  Mon Dec 17 01:47:11 2007
From: fumie.costen at manchester.ac.uk (f.costen@cs.man.ac.uk)
Date: Mon, 17 Dec 2007 09:47:11 +0000
Subject: [Beowulf] thanks for the array issue
Message-ID: <4766459F.7080601@cs.man.ac.uk>

Dear Donald(Shillady),James(Cownie),
Alan(scheinine),Toon(Moene),Geoff(Jacobs)

Thank you very much for your feedback.
This array is meant to  be used to build
an FDTD radio environment at the master ( MPI_WORLD == 0 ).
And part of it is planned to be sent to each slave.
That's why this particular array is used at the master.

Sadly, as you pointed out, the size of the memory per core
restricted my calculation
although the shell environment setting in .cshrc is done like:
limit coredumpsize 0
unlimit stacksize

These two lines did not turn the computers infinite ;-) > Mike(Frese)

So  the  solution, which I came up with,
 is, as Brian(Dobbins) said on 13th,
that
#define ntmax 820
integer*4, ALLOCATABLE :: a( : , : , :)
allocate(a(ntmax,ntmax,localsmalldimension))
to distribute this matrix a(,,) to each slave
to fit to the size of the memory available at each slave (and master)

Thank you very much
Fumie


From deadline at eadline.org  Mon Dec 17 06:56:24 2007
From: deadline at eadline.org (Douglas Eadline)
Date: Mon, 17 Dec 2007 09:56:24 -0500 (EST)
Subject: [Beowulf] Interesting Webinar: Multi-core in HPC
In-Reply-To: <4766459F.7080601@cs.man.ac.uk>
References: <4766459F.7080601@cs.man.ac.uk>
Message-ID: <38618.192.168.1.1.1197903384.squirrel@mail.eadline.org>


A bit of self promotion.

This Wednesday, December 19, 2007, there will be a free webinar
sponsored by IBM, Cisco, and Intel that will discuss
Multi-core in HPC. This will be a panel discussion
so you can *ask* questions!

The webinar will be live: 1:00 PM Eastern | 10:00 AM Pacific | 5:00 PM GMT

I will be moderating the webinar and you will be able to
submit questions during the webinar. The Webinar is called:

   Ask the Experts: Effective Use of Multi-core in HPC

You can get more information and sign-up at:

http://www.linux-mag.com/microsites.php?site=business-class-hpc&sid=main&p=4587

See you there.

--
Doug


From gdjacobs at gmail.com  Mon Dec 17 12:40:22 2007
From: gdjacobs at gmail.com (Geoff Jacobs)
Date: Mon, 17 Dec 2007 14:40:22 -0600
Subject: [Beowulf] Re:
In-Reply-To: <1197884030.25232.6.camel@gms2.ethz.ch>
References: <200712152001.lBFK097n019366@bluewest.scyld.com>
	<1197884030.25232.6.camel@gms2.ethz.ch>
Message-ID: <4766DEB6.7060305@gmail.com>

G.M.Sigut wrote:
> On Sat, 2007-12-15 at 12:01 -0800, beowulf-request at beowulf.org wrote:
> ...
>>    2. Re: large array to run (Geoff Jacobs)
> ...
>> Message: 2
>> Date: Fri, 14 Dec 2007 21:46:42 -0600
>> From: Geoff Jacobs <gdjacobs at gmail.com>
>> Subject: Re: [Beowulf] large array to run
>> To: Craig Tierney <ctierney at hypermall.net>
>> Cc: beowulf at beowulf.org, Mark Hahn <hahn at mcmaster.ca>
> ...
>> Craig Tierney wrote:
>> <snippage />
> ...
>> IIRC, the CDC 6600 (and compatibles) had a 48 bit integer format as
>> default. So, yes it does vary. I guess I was rolling the dice a little.
> 
> My memory is telling me, that while the arithmetics on the 6000 series
> might have been done with 48 bits, the memory assignment was in words.
> One integer took one word of 60 bits (i.e. 5 bytes of 12 bits each).
> 
> George
> 

Right you are. 5 12 bit bytes. I was confusing with the FP coefficient.
Still weird, which was the whole point. Namely, more obscure examples of
hardware and software will have differences in data representation.

-- 
Geoffrey D. Jacobs


From richard.walsh at comcast.net  Mon Dec 17 13:26:13 2007
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Mon, 17 Dec 2007 21:26:13 +0000
Subject: [Beowulf] Stream numbers for SiCortex's MIPS based SOC ... 
Message-ID: <121720072126.8135.4766E9750001CC7B00001FC72207021573089C040E99D20B9D0E080C079D@comcast.net>

All,

Anyone seem Stream numbers for one and/or more cores from SiCortx, say a SiCortex
Catapult System.  The chip has two memory controllers, and I have heard provides:

"more than 10 Terabytesof bandwidth"

in the largest configuration, but have not seen any measured memory bandwidth numbers
for this box.  Come to think of it,  I have not seen measured number for its interconnect
performance either. Sustaining a reasonable ratio bytes delivered from memory to flops
should be easier on this processor with its lower clock, but is does have 2 cores.  I am 
interested in how looks compared to Opteron, etc. It is supposed to be a balanced 
design, but it seems there are few measured results available to validate this.

As always your thoughts are appreciated ...

Regards,

rbw 
-- 

"Making predictions is hard, especially about the future." 

Niels Bohr 

-- 

Richard Walsh 
Thrashing River Consulting-- 
5605 Alameda St. 
Shoreview, MN 55126 

Phone #: 612-382-4620
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071217/fce379a8/attachment.html>

From richard.walsh at comcast.net  Mon Dec 17 13:54:20 2007
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Mon, 17 Dec 2007 21:54:20 +0000
Subject: [Beowulf] Stream numbers for SiCortex's MIPS based SOC ... 
Message-ID: <121720072154.28414.4766F00C0004A7F800006EFE2207021553089C040E99D20B9D0E080C079D@comcast.net>


-------------- Original message -------------- 
From: richard.walsh at comcast.net

> should be easier on this processor with its lower clock, but is does have 2 cores.  I am 

Ugh! How did I type that ... that should read "6 cores"

Sorry,

rbw 
-- 

"Making predictions is hard, especially about the future." 

Niels Bohr 

-- 

Richard Walsh 
Thrashing River Consulting-- 
5605 Alameda St. 
Shoreview, MN 55126 

Phone #: 612-382-4620
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071217/ea84ca92/attachment.html>
-------------- next part --------------
An embedded message was scrubbed...
From: richard.walsh at comcast.net
Subject: [Beowulf] Stream numbers for SiCortex's MIPS based SOC ... 
Date: Mon, 17 Dec 2007 21:27:29 +0000
Size: 714
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071217/ea84ca92/attachment.mht>

From peter.st.john at gmail.com  Tue Dec 18 07:17:40 2007
From: peter.st.john at gmail.com (Peter St. John)
Date: Tue, 18 Dec 2007 10:17:40 -0500
Subject: [Beowulf] NY times re parallel computing
Message-ID: <e4d4fd070712180717h68172585yc4b048322230bb05@mail.gmail.com>

John Markoff (http://en.wikipedia.org/wiki/John_Markoff) has an item
about parallel computing, particularly Microsoft's effort, in the NY Times:
http://www.nytimes.com/2007/12/17/technology/17chip.html?pagewanted=1&_r=1

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071218/a2164e5e/attachment.html>

From moloney.brendan at gmail.com  Mon Dec 17 19:03:35 2007
From: moloney.brendan at gmail.com (Brendan Moloney)
Date: Mon, 17 Dec 2007 19:03:35 -0800
Subject: [Beowulf] Help with inconsistent network performance
Message-ID: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>

I have a cluster of 8 Linux machines connected with gigabit
ethernet (full duplex) to a HP Procurve 2848 switch.   I am using the
machines to do interactive distributed rendering.  I have noticed that the
final gather stage (where the intermediate images from the render nodes are
sent back to the viewing node) has "hiccups" in the performance.  These
hiccups occur with as few as two render nodes, and become more common as I
add more render nodes.  With a 512x512 image the final gather usually takes
a few milliseconds for each frame, but when the hiccups occur it is more
like 200+ milliseconds.

Since it is a full duplex switched network, there should not be any
collisions happening.  Since the image is less than 1 MB total, I don't
think I am saturating the switch.  I have checked the contents of
/sbin/ifconfig and there are zero erroneous packets being reported.  At this
point I am really at a loss as to what is causing this.  Any input on things
to check would be greatly appreciated.

Thanks,
Brendan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071217/b56a2dc3/attachment.html>

From smulcahy at aplpi.com  Tue Dec 18 08:31:28 2007
From: smulcahy at aplpi.com (stephen mulcahy)
Date: Tue, 18 Dec 2007 16:31:28 +0000
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
Message-ID: <4767F5E0.5090603@aplpi.com>

Brendan Moloney wrote:
> Any input on things to check would be greatly appreciated.

It might be useful to run some tools like dstat, top/htop, vmstat and 
iostat while performing the rendering and summarise any behaviour which 
co-incides with the hiccups.

Have you ganglia on your cluster? You might notice system resource 
spikes which co-incide with your hiccups.

-stephen

-- 
Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland.  +353.91.751262  http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)


From smulcahy at aplpi.com  Tue Dec 18 08:41:18 2007
From: smulcahy at aplpi.com (stephen mulcahy)
Date: Tue, 18 Dec 2007 16:41:18 +0000
Subject: [Beowulf] Request for comments: diskless cluster
In-Reply-To: <200712101250.03907.bencer@cauterized.net>
References: <200712101250.03907.bencer@cauterized.net>
Message-ID: <4767F82E.2020706@aplpi.com>

Jorge Salamero Sanz wrote:
> Hi all,
> 
> I'm going to move a 42-nodes beowulf to diskless mode (currently all local 
> cloned installations).
> 
> Which system / tools do you recommend to manage the client-images ?
> 
> I was thinking on a debootstraped dir shared as NFS root. The differences 
> between the nodes (/etc/hostname, /etc/fstab, /etc/exportfs ...) could be 
> managed with unionfs.
> 
> Debian has a couple of tools that could help (live-helper for making custom 
> images) but maybe lessdisk would be more suitable. Which one do you use ?
> 
> How do you manage this kind of cluster setup ?

Hi Jorge,

We built a system like this for a customer a few years ago and it has 
performed very well. We have a head-node which acts as a management node 
and an NFS server for the diskless workstations in the cluster.

We used debbootstrap to build images for the diskless nodes. We opted to 
keep separate disk images for each diskless node in order to keep things 
simple - I'm sure you could do the same with unionfs or similar but 
diskspace is cheap and the effort to put together something with unionfs 
didn't seem to be justified at the time.

To add a new node, we simply copy the debootstrapped directory contents 
and change the hostname.

Each diskless node uses PXE to boot and a monolithic kernel compiled 
with just the basics needed for the compute nodes. We did some 
experiments with initrd images and modular kernels but there were some 
issues with Debian which caused us problems (see bugs 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=386959 and 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=388761) this may have 
been fixed since we last looked at it but again, the effort to fix it, 
given that we had a working system wasn't justified.

You may need to use a separate network for PXE booting your nodes - we 
experienced problems using a gigabit ethernet network for PXE booting 
where nodes randomly failed to get a response from the DHCP server. I 
suspect there was a bug in the PXE firmware which occasionally caused it 
to fail while the network cards were negotiating gigabit speed (but I 
have no evidence to back this up) - moving PXE booting to a separate 
fast ethernet network resolved the problem.

I'm not familiar with live-helper or lessdisk, perhaps I need to do more 
reading :)

Hope this info is of some use, I've probably only covered some random 
aspects of our config that spring to mind ...

-stephen


-- 
Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center,
GMIT, Dublin Rd, Galway, Ireland.  +353.91.751262  http://www.aplpi.com
Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway)


From jlforrest at berkeley.edu  Tue Dec 18 09:21:56 2007
From: jlforrest at berkeley.edu (Jon Forrest)
Date: Tue, 18 Dec 2007 09:21:56 -0800
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
Message-ID: <476801B4.3090501@berkeley.edu>

Brendan Moloney wrote:
> Since it is a full duplex switched network, there should not be any 
> collisions happening. 

I have a similar situation with a slightly larger cluster.
At first I also thought it was a network performance
problem. But then I ran the iftop program to watch
the network in realtime and I saw that I wasn't
even close to sending enough data to tax the switch.

I still don't know the cause of the problem but
I'm pretty sure it's not cause by excessive
network traffic. I suggest you try iftop to
see what your program is really doing.

Cordially,
-- 
Jon Forrest
Unix Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlforrest at berkeley.edu


From hahn at mcmaster.ca  Tue Dec 18 09:08:48 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 18 Dec 2007 12:08:48 -0500 (EST)
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712181204480.29614@coffee.psychology.mcmaster.ca>

> final gather stage (where the intermediate images from the render nodes are
> sent back to the viewing node) has "hiccups" in the performance.  These

as perceived how?  do you mean your gather/gui machine pauses?  could it 
be as simple as allocating memory?  (if you do a significant memory
allocation under linux, the memory will be virtual until its written,
at which time pages will be allocated.  those allocations can trigger
frantic scavenging of memory by the kernel, and this can certainly 
interfere with, for instance, networking, especially at the user level.)

simply pre-zeroing the frames might make the problem go away.

if you really have some reason to blame the network, I'd go brute-force
and run tcpdump on all nodes to catch a hiccup, and see what's happening.


From landman at scalableinformatics.com  Tue Dec 18 09:32:36 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 18 Dec 2007 12:32:36 -0500
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
Message-ID: <47680434.2050504@scalableinformatics.com>

Hi Brendan:

Brendan Moloney wrote:
> I have a cluster of 8 Linux machines connected with gigabit
> ethernet (full duplex) to a HP Procurve 2848 switch.   I am using the
> machines to do interactive distributed rendering.  I have noticed that the
> final gather stage (where the intermediate images from the render nodes are
> sent back to the viewing node) has "hiccups" in the performance.  These

How are they sent?  NFS? Sockets? ...

> hiccups occur with as few as two render nodes, and become more common as I
> add more render nodes.  With a 512x512 image the final gather usually takes
> a few milliseconds for each frame, but when the hiccups occur it is more
> like 200+ milliseconds.

Is this "real time" rendering so that frame rate isthe most important 
aspect?

> Since it is a full duplex switched network, there should not be any
> collisions happening.  Since the image is less than 1 MB total, I don't

There could be blocking ...  if one unit grabs the single network pipe 
of the display node while the another node tries to send data, then the 
late node will back off (well with TCP it will) in a pre-determined manner.

> think I am saturating the switch.  I have checked the contents of
> /sbin/ifconfig and there are zero erroneous packets being reported.  At this

You wouldn't see it there.  It would be on the switch, and even then it 
wouldn't term it a collision.  It is a switch behaving normally.

> point I am really at a loss as to what is causing this.  Any input on things
> to check would be greatly appreciated.

I assume you have a single gigabit from the display node to the switch. 
  As you scale up the number of render nodes, you notice more of these 
"hiccups" scaling about linearly with the number of nodes.

This suggests resource contention.  Each image would be fragmented into 
units of 175  1500-byte packets.  This assumes 8 bit images.  If you are 
using 8 bits per color, 3 colors and an alpha channel, then this is ~700 
packets.  Each 1500 byte packet takes about 11us to transmit, and has a 
non-trivial latency associated with it.  I will estimate the latency at 
30us (this is switch latency of ~ 5us + network stack latency on each 
side of about 12.5us).  So for each packet, you have about 41us to 
transfer it.   If you have 8 bit images, then this corresponds to 7.2 
ms.  There may be some other caching effects that I am missing, or 
mis-computed.  For 32 bits (3x 8bit color channels + 1 alpha channel), 
this is looking like 28.8 ms for each image.  Best case you could do 
with this is about 34.7 frames per second.

If on the other hand, you used jumbo frames with 9000 byte packets, you 
would need 30 to transfer each image, which would require 67.1us to 
move, and still 30 us of latency, for 97.1us per packet.  For 30 
packets, this is 2.9ms.  For the 32 bit version as indicated previously 
(3x 8 bit color channels, and one alpha channel) this would be about 
11.6ms.  Or 85.9 frames per second.

Based on this, I would suggest seeing if changing mtu to 9000 helps.

	ifconfig eth0 mtu 9000

on all your nodes (every one).

The argument for this is that you have less latency to pay for, even 
though it takes longer to transfer the payload.

Another possibility is channel bonding on your display node.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From Michael.Frese at NumerEx.com  Tue Dec 18 10:23:17 2007
From: Michael.Frese at NumerEx.com (Michael H. Frese)
Date: Tue, 18 Dec 2007 11:23:17 -0700
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.co
 m>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
Message-ID: <6.2.5.6.2.20071218112023.09c04718@NumerEx.com>

Brendan,

If you are doing this via nfs, you should be sure that mounts are 
done using the tcp parameter in /etc/fstab.  Otherwise you may get 
udp, and I have seen problems with that as recently as Fedora 8 this morning!


Mike

At 08:03 PM 12/17/2007, Brendan Moloney wrote:
>I have a cluster of 8 Linux machines connected with gigabit ethernet 
>(full duplex) to a HP Procurve 2848 switch.   I am using the 
>machines to do interactive distributed rendering.  I have noticed 
>that the final gather stage (where the intermediate images from the 
>render nodes are sent back to the viewing node) has "hiccups" in the 
>performance.  These hiccups occur with as few as two render nodes, 
>and become more common as I add more render nodes.  With a 512x512 
>image the final gather usually takes a few milliseconds for each 
>frame, but when the hiccups occur it is more like 200+ milliseconds.
>
>Since it is a full duplex switched network, there should not be any 
>collisions happening.  Since the image is less than 1 MB total, I 
>don't think I am saturating the switch.  I have checked the contents 
>of /sbin/ifconfig and there are zero erroneous packets being 
>reported.  At this point I am really at a loss as to what is causing 
>this.  Any input on things to check would be greatly appreciated.
>
>Thanks,
>Brendan
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071218/eabb09d6/attachment.html>

From patrick at myri.com  Tue Dec 18 15:21:35 2007
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 18 Dec 2007 18:21:35 -0500
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <47680434.2050504@scalableinformatics.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<47680434.2050504@scalableinformatics.com>
Message-ID: <476855FF.4030902@myri.com>

Hi Joe, Brendan

Joe Landman wrote:
>> Since it is a full duplex switched network, there should not be any
>> collisions happening.  Since the image is less than 1 MB total, I don't
> 
> There could be blocking ...  if one unit grabs the single network pipe 
> of the display node while the another node tries to send data, then the 
> late node will back off (well with TCP it will) in a pre-determined manner.

It definitively looks like natural switch contention (N->1 pattern). 
However, TCP's reaction will depend on how the switch itself handles 
contention. If the hardware flow-control is turned off, packets will be 
dropped in the switch, and TCP will quickly shrink its send window: big 
hiccup. If the hardware flow-control is turned on, the sender NICs will 
be paused and (hopefully) no packets are dropped. TCP will not be aware 
of the backpressure and the send window may even increase a bit because 
of the pausing delay: no big hiccup.

I don't know about the hardware flow-control implementation in the 
Procurve 2848, and it may just be off by default like most Ethernet 
switches. FWIW, there was no working hardware flow-control on the 10GigE 
Procurve switch that I have played with, even when turned on.

Patrick


From landman at scalableinformatics.com  Tue Dec 18 15:27:22 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Tue, 18 Dec 2007 18:27:22 -0500
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <47680434.2050504@scalableinformatics.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<47680434.2050504@scalableinformatics.com>
Message-ID: <4768575A.1070507@scalableinformatics.com>

As has been pointed out to me offline, my numbers may be a bit more 
pessimistic than needed, in part to pipelining and other effects.  If my 
numbers were the result of a correct analysis, the most you would be 
able to see from a gigabit link would be about 37 MB/s for 1500 byte 
packets.  This is obviously not the case, so assume this to be a "worst 
case" analysis  (and I am going to go back and review what I seem to 
have dropped from the TCP bits).

Joe

Joe Landman wrote:
> Hi Brendan:
> 
> Brendan Moloney wrote:
>> I have a cluster of 8 Linux machines connected with gigabit
>> ethernet (full duplex) to a HP Procurve 2848 switch.   I am using the
>> machines to do interactive distributed rendering.  I have noticed that 
>> the
>> final gather stage (where the intermediate images from the render 
>> nodes are
>> sent back to the viewing node) has "hiccups" in the performance.  These
> 
> How are they sent?  NFS? Sockets? ...
> 
>> hiccups occur with as few as two render nodes, and become more common 
>> as I
>> add more render nodes.  With a 512x512 image the final gather usually 
>> takes
>> a few milliseconds for each frame, but when the hiccups occur it is more
>> like 200+ milliseconds.
> 
> Is this "real time" rendering so that frame rate isthe most important 
> aspect?
> 
>> Since it is a full duplex switched network, there should not be any
>> collisions happening.  Since the image is less than 1 MB total, I don't
> 
> There could be blocking ...  if one unit grabs the single network pipe 
> of the display node while the another node tries to send data, then the 
> late node will back off (well with TCP it will) in a pre-determined manner.
> 
>> think I am saturating the switch.  I have checked the contents of
>> /sbin/ifconfig and there are zero erroneous packets being reported.  
>> At this
> 
> You wouldn't see it there.  It would be on the switch, and even then it 
> wouldn't term it a collision.  It is a switch behaving normally.
> 
>> point I am really at a loss as to what is causing this.  Any input on 
>> things
>> to check would be greatly appreciated.
> 
> I assume you have a single gigabit from the display node to the switch. 
>  As you scale up the number of render nodes, you notice more of these 
> "hiccups" scaling about linearly with the number of nodes.
> 
> This suggests resource contention.  Each image would be fragmented into 
> units of 175  1500-byte packets.  This assumes 8 bit images.  If you are 
> using 8 bits per color, 3 colors and an alpha channel, then this is ~700 
> packets.  Each 1500 byte packet takes about 11us to transmit, and has a 
> non-trivial latency associated with it.  I will estimate the latency at 
> 30us (this is switch latency of ~ 5us + network stack latency on each 
> side of about 12.5us).  So for each packet, you have about 41us to 
> transfer it.   If you have 8 bit images, then this corresponds to 7.2 
> ms.  There may be some other caching effects that I am missing, or 
> mis-computed.  For 32 bits (3x 8bit color channels + 1 alpha channel), 
> this is looking like 28.8 ms for each image.  Best case you could do 
> with this is about 34.7 frames per second.
> 
> If on the other hand, you used jumbo frames with 9000 byte packets, you 
> would need 30 to transfer each image, which would require 67.1us to 
> move, and still 30 us of latency, for 97.1us per packet.  For 30 
> packets, this is 2.9ms.  For the 32 bit version as indicated previously 
> (3x 8 bit color channels, and one alpha channel) this would be about 
> 11.6ms.  Or 85.9 frames per second.
> 
> Based on this, I would suggest seeing if changing mtu to 9000 helps.
> 
>     ifconfig eth0 mtu 9000
> 
> on all your nodes (every one).
> 
> The argument for this is that you have less latency to pay for, even 
> though it takes longer to transfer the payload.
> 
> Another possibility is channel bonding on your display node.
> 
> 


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From lindahl at pbm.com  Tue Dec 18 15:41:17 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 18 Dec 2007 15:41:17 -0800
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <476855FF.4030902@myri.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<47680434.2050504@scalableinformatics.com>
	<476855FF.4030902@myri.com>
Message-ID: <20071218234117.GA29318@bx9.net>

On Tue, Dec 18, 2007 at 06:21:35PM -0500, Patrick Geoffray wrote:

> I don't know about the hardware flow-control implementation in the 
> Procurve 2848, and it may just be off by default like most Ethernet 
> switches. FWIW, there was no working hardware flow-control on the 10GigE 
> Procurve switch that I have played with, even when turned on.

If I do

ethtool -a eth0

and it says RX/TX pause are on, doesn't that mean that the switch
supports it? And ethtool -S eth0 will show if you've actually
had some pauses or flow-control events.

My dumb Netgear 24-port 1-gig switch supports hw flow control.  Sounds
like things are a bit more difficult with low-end 10gigE ports.

-- greg


From patrick at myri.com  Tue Dec 18 18:05:41 2007
From: patrick at myri.com (Patrick Geoffray)
Date: Tue, 18 Dec 2007 21:05:41 -0500
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <20071218234117.GA29318@bx9.net>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>	<47680434.2050504@scalableinformatics.com>	<476855FF.4030902@myri.com>
	<20071218234117.GA29318@bx9.net>
Message-ID: <47687C75.9090907@myri.com>

Hi Greg,

Greg Lindahl wrote:
> ethtool -a eth0
> 
> and it says RX/TX pause are on, doesn't that mean that the switch
> supports it?

No, it just means the NIC supports it. RX means that the NIC will send 
PAUSE packets if the host does not consume fast enough (rare) and TX 
means that the NIC will stop sending when receiving a PAUSE packet (more 
likely). It's independent of the switch flow control settings.

> My dumb Netgear 24-port 1-gig switch supports hw flow control.  Sounds
> like things are a bit more difficult with low-end 10gigE ports.

For RX hardware flow-control, you need enough buffer space to keep one 
full frame plus the latency on the longest wire, for every port. It is a 
bit more expensive to do with 10GigE, because you need faster memory and 
more of it. Some recent 10GigE chips use a shared SRAM buffer that is 
not big enough for the worst case with 9K packets: it works fine as long 
as a few ports are blocked, then it happily collapses and drops packets.

Flow-control is not for everyone, and that's why it is often turned off 
by default. When a sender is paused, it will stop sending anything, 
including packets for different destinations. Dropping packets is 
expensive to recover but it keeps things moving.

Patrick


From moloney.brendan at gmail.com  Tue Dec 18 20:14:49 2007
From: moloney.brendan at gmail.com (Brendan Moloney)
Date: Tue, 18 Dec 2007 20:14:49 -0800
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <47687C75.9090907@myri.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<47680434.2050504@scalableinformatics.com> <476855FF.4030902@myri.com>
	<20071218234117.GA29318@bx9.net> <47687C75.9090907@myri.com>
Message-ID: <204b3d180712182014j6df14f41l3909ad3cd99867a2@mail.gmail.com>

Ok guys, thanks for all the feedback.

I guess I should have provided some more specific details.  I am using
sockets with TCP/IP for the final gather stage.  I am doing real-time
(volume) rendering.  The images are 32-bit (RGBA with 8 bits per channel).
 The machines are running the 2.6 kernel and I have confirmed that the max
TCP send/recv buffer sizes are 4MB (more than enough to store the full
512x512 image).

I wrote two simple test programs to make sure that it was not something else
in my rather complex rendering pipeline (memory allocation etc.).  The
server side test program launches N nodes using mpich2, each of which
establishes a connection to the view client with a socket over TCP/IP. Then
I loop with the client side program sending a single integer to rank 0, then
rank 0 broadcasts this integer to the other nodes, and then all nodes send
back 1MB / N of data.

To make sure there was not an issue with the MPI broadcast, I did one test
run with 5 nodes only sending back 4 bytes of data each.  The result was a
RTT of less than 0.3 ms. Next I did a run with one node sending 1 MB back to
the client, the result was an RTT of less than 12ms.  Letting the test run
in a loop I saw that the first ~100 packets were a bit slower (~16 ms) and
then not a single packet took longer than 14 ms.  So the performance was
very consistent, as expected for a single node. Then I did a run with two
nodes sending back 1/2 MB each, the result was an RTT of ~16 ms on frames
without a hiccup.  About 0.2% of the frames were hiccups. On a run with 3
nodes sending back 1/3 MB each I got an RTT of ~19-20 ms and again about
0.2% of the frames were hiccups.
With 4 nodes sending 1/4 MB each I got an RTT of ~20-21 ms and about
3.5% of the frames were hiccups.
Finally with 5 nodes sending 1/5 MB each I got an RTT of ~21ms and about
13.5% of the frames were hiccups.  I could not test on more nodes as the
other computers were in use by other people.

One interesting pattern I noticed is that the hiccup frame RTTs, almost
without exception, fall into one of three ranges (approximately 50-60,
200-210, and 250-260). Could this be related to exponential back-off?

Tommorow I will experiment with jumbo frames and flow control settings (both
of which the HP Procurve claims to support).  If these do not solve the
problems I will start sifting through tcpdump.

Thanks,
Brendan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071218/aae520b7/attachment.html>

From hahn at mcmaster.ca  Tue Dec 18 20:52:25 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Tue, 18 Dec 2007 23:52:25 -0500 (EST)
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <204b3d180712182014j6df14f41l3909ad3cd99867a2@mail.gmail.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<47680434.2050504@scalableinformatics.com> <476855FF.4030902@myri.com>
	<20071218234117.GA29318@bx9.net> <47687C75.9090907@myri.com>
	<204b3d180712182014j6df14f41l3909ad3cd99867a2@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712182340010.6542@coffee.psychology.mcmaster.ca>

> The machines are running the 2.6 kernel and I have confirmed that the max
> TCP send/recv buffer sizes are 4MB (more than enough to store the full
> 512x512 image).

the bandwidth-delay product in a lan is low enough to not need 
this kind of tuning.

> I loop with the client side program sending a single integer to rank 0, then
> rank 0 broadcasts this integer to the other nodes, and then all nodes send
> back 1MB / N of data.

hmm, that's a bit harsh, don't you think?  why not have the rank0/master
as each slave for its contribution sequentially?  sure, it introduces a bit
of "dead air", but it's not as if two slaves can stream to a single master 
at once anyway (each can saturate its link, therefore the master's link is 
N-times overcommitted.)

> To make sure there was not an issue with the MPI broadcast, I did one test
> run with 5 nodes only sending back 4 bytes of data each.  The result was a
> RTT of less than 0.3 ms.

isn't that kind of high?  a single ping-pong latency should be ~50 us - 
maybe I'm underestimating the latency of the broadcast itself.

> One interesting pattern I noticed is that the hiccup frame RTTs, almost
> without exception, fall into one of three ranges (approximately 50-60,
> 200-210, and 250-260). Could this be related to exponential back-off?

perhaps introduced by the switch, or perhaps by the fact that the bcast
isn't implemented as an atomic (eth-level) broadcast.

> Tommorow I will experiment with jumbo frames and flow control settings (both
> of which the HP Procurve claims to support).  If these do not solve the
> problems I will start sifting through tcpdump.

I would simply serialize the slaves' responses first.  the current design
tries to trigger all the slaves to send results at once, which is simply
not logical if you think about it, since any one slave can saturate
the master's link.

regards, mark hahn.


From moloney.brendan at gmail.com  Tue Dec 18 21:40:48 2007
From: moloney.brendan at gmail.com (Brendan Moloney)
Date: Tue, 18 Dec 2007 21:40:48 -0800
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <Pine.LNX.4.64.0712182340010.6542@coffee.psychology.mcmaster.ca>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<47680434.2050504@scalableinformatics.com> <476855FF.4030902@myri.com>
	<20071218234117.GA29318@bx9.net> <47687C75.9090907@myri.com>
	<204b3d180712182014j6df14f41l3909ad3cd99867a2@mail.gmail.com>
	<Pine.LNX.4.64.0712182340010.6542@coffee.psychology.mcmaster.ca>
Message-ID: <204b3d180712182140y6bf7166ct5ae449f87b256b0f@mail.gmail.com>

On 12/18/07, Mark Hahn <hahn at mcmaster.ca > wrote:
>
> > The machines are running the 2.6 kernel and I have confirmed that the
> max
> > TCP send/recv buffer sizes are 4MB (more than enough to store the full
> > 512x512 image).
>
> the bandwidth-delay product in a lan is low enough to not need
> this kind of tuning.


I didn't actually do any tuning, I just checked the max buffer size that the
linux auto-tuning can use is sufficient.

> I loop with the client side program sending a single integer to rank 0,
> then
> > rank 0 broadcasts this integer to the other nodes, and then all nodes
> send
> > back 1MB / N of data.
>
> hmm, that's a bit harsh, don't you think?  why not have the rank0/master
> as each slave for its contribution sequentially?  sure, it introduces a
> bit
> of "dead air", but it's not as if two slaves can stream to a single master
> at once anyway (each can saturate its link, therefore the master's link is
>
> N-times overcommitted.)


I guess I figured that the data is relatively small compared to the
bandwidth, whereas the latency for ethernet is relatively high.  I also
thought the switch would be able to
efficiently buffer and forward the data.  I am not much of a
networking guy (more a graphics guy) so I realize I could be way off
base here.


> To make sure there was not an issue with the MPI broadcast, I did one test
> > run with 5 nodes only sending back 4 bytes of data each.  The result was
> a
> > RTT of less than 0.3 ms.
>
> isn't that kind of high?  a single ping-pong latency should be ~50 us -
> maybe I'm underestimating the latency of the broadcast itself.


This is quite a bit more than a single ping-pong. The viewer sends to the
master node (rank 0), and then the master node broadcasts to all other
nodes, and then all nodes send back to the viewer node.  I don't know if
this is still seems high?


> One interesting pattern I noticed is that the hiccup frame RTTs, almost
> > without exception, fall into one of three ranges (approximately 50-60,
> > 200-210, and 250-260). Could this be related to exponential back-off?
>
> perhaps introduced by the switch, or perhaps by the fact that the bcast
> isn't implemented as an atomic (eth-level) broadcast.
>

But the bcast is always just sending 4 bytes (a single integer), and as
mentioned above no hiccups occur until the size of the final gather packets
(from all nodes to the viewer) is increased.


>
> > Tommorow I will experiment with jumbo frames and flow control settings
> (both
> > of which the HP Procurve claims to support).  If these do not solve the
> > problems I will start sifting through tcpdump.
>
> I would simply serialize the slaves' responses first.  the current design
> tries to trigger all the slaves to send results at once, which is simply
> not logical if you think about it, since any one slave can saturate
> the master's link.
>

I still have the feeling that the switch should be able to handle this more
efficiently, but since your idea is relatively simple to implement I will
give it a try and see what the performance is like.

Thanks for your input.


>
> regards, mark hahn.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071218/2946b0f6/attachment.html>

From lindahl at pbm.com  Tue Dec 18 21:50:53 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Tue, 18 Dec 2007 21:50:53 -0800
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <47687C75.9090907@myri.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<47680434.2050504@scalableinformatics.com>
	<476855FF.4030902@myri.com> <20071218234117.GA29318@bx9.net>
	<47687C75.9090907@myri.com>
Message-ID: <20071219055052.GA8536@bx9.net>

On Tue, Dec 18, 2007 at 09:05:41PM -0500, Patrick Geoffray wrote:

> No, it just means the NIC supports it.

Well, then how about ethtool -S? That looks like an actual count of
flow control events, so rx flow control events means the switch
must support it in some fashion.

> For RX hardware flow-control, you need enough buffer space to keep one 
> full frame plus the latency on the longest wire, for every port. It is a 
> bit more expensive to do with 10GigE, because you need faster memory and 
> more of it. Some recent 10GigE chips use a shared SRAM buffer that is 
> not big enough for the worst case with 9K packets:

Well, we know it can be done perfectly, it's done in InfiniBand
switches, and that other 10 gig non-ethernet switch, what's it called?
Oh yeah, Myrinet. They do it, too.

> Flow-control is not for everyone, and that's why it is often turned off 
> by default. When a sender is paused, it will stop sending anything, 
> including packets for different destinations. Dropping packets is 
> expensive to recover but it keeps things moving.

Can Myrinet even disable flow control? Odd that Ethrernet is any
different; dropping any packets is an utter disaster for TCP.

-- greg


From hahn at mcmaster.ca  Tue Dec 18 21:55:51 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Wed, 19 Dec 2007 00:55:51 -0500 (EST)
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <204b3d180712182140y6bf7166ct5ae449f87b256b0f@mail.gmail.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com> 
	<47680434.2050504@scalableinformatics.com> <476855FF.4030902@myri.com> 
	<20071218234117.GA29318@bx9.net> <47687C75.9090907@myri.com> 
	<204b3d180712182014j6df14f41l3909ad3cd99867a2@mail.gmail.com> 
	<Pine.LNX.4.64.0712182340010.6542@coffee.psychology.mcmaster.ca>
	<204b3d180712182140y6bf7166ct5ae449f87b256b0f@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712190044000.6542@coffee.psychology.mcmaster.ca>

> I guess I figured that the data is relatively small compared to the
> bandwidth,

I agree, in principle.  and relatively small compared to the amount of ram
in the switch as well.

> whereas the latency for ethernet is relatively high.  I also

not _that_ high, though.  with a little tuning (coalesce parameters),
I think 30-40 us half-rtt is pretty common, even over a normal 
tcp stack.  yes, that's 2+ 1.5k packets, but it not _that_ much 
compared to 1M images.

>> To make sure there was not an issue with the MPI broadcast, I did one test
>>> run with 5 nodes only sending back 4 bytes of data each.  The result was
>> a
>>> RTT of less than 0.3 ms.
>>
>> isn't that kind of high?  a single ping-pong latency should be ~50 us -
>> maybe I'm underestimating the latency of the broadcast itself.
>
>
> This is quite a bit more than a single ping-pong. The viewer sends to the
> master node (rank 0), and then the master node broadcasts to all other
> nodes, and then all nodes send back to the viewer node.  I don't know if
> this is still seems high?

the first message should take <50 us.  the broadcast to 5 nodes should 
take 2-3 more 50 us times.  so at about 200 us, all the slaves will start
the DOS attack on the viewer node's nic...

> But the bcast is always just sending 4 bytes (a single integer), and as

no, afaik no mpi implementations actually utilize the eth-level bcast,
but rather implement bcast as a tree of (uni) sends.


From moloney.brendan at gmail.com  Tue Dec 18 22:24:39 2007
From: moloney.brendan at gmail.com (Brendan Moloney)
Date: Tue, 18 Dec 2007 22:24:39 -0800
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <Pine.LNX.4.64.0712190044000.6542@coffee.psychology.mcmaster.ca>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<47680434.2050504@scalableinformatics.com> <476855FF.4030902@myri.com>
	<20071218234117.GA29318@bx9.net> <47687C75.9090907@myri.com>
	<204b3d180712182014j6df14f41l3909ad3cd99867a2@mail.gmail.com>
	<Pine.LNX.4.64.0712182340010.6542@coffee.psychology.mcmaster.ca>
	<204b3d180712182140y6bf7166ct5ae449f87b256b0f@mail.gmail.com>
	<Pine.LNX.4.64.0712190044000.6542@coffee.psychology.mcmaster.ca>
Message-ID: <204b3d180712182224o362a564dgfa490ba624866992@mail.gmail.com>

> the first message should take <50 us.  the broadcast to 5 nodes should
> take 2-3 more 50 us times.  so at about 200 us, all the slaves will start
> the DOS attack on the viewer node's nic...
>

I am not sure why you compare this to a DOS attack.  The same amount of data
(and roughly the same amount of packets) should be arriving at the viewer
node.  Yes it is stressing the switch more, but this switch should be able
to handle much more traffic than this.


>
> > But the bcast is always just sending 4 bytes (a single integer), and as
>
> no, afaik no mpi implementations actually utilize the eth-level bcast,
> but rather implement bcast as a tree of (uni) sends.


I realize this.  I was just pointing out that the the amount of data I am
broadcasting is always 4 bytes.  Since I saw no hiccups when the final
gather packets were only 4 bytes, but I do when the final gather packets are
1MB / N -- then the hiccups must be coming from the final gather and not the
broadcast.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071218/e9f1bf3c/attachment.html>

From patrick at myri.com  Wed Dec 19 00:26:46 2007
From: patrick at myri.com (Patrick Geoffray)
Date: Wed, 19 Dec 2007 03:26:46 -0500
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <20071219055052.GA8536@bx9.net>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<47680434.2050504@scalableinformatics.com>
	<476855FF.4030902@myri.com> <20071218234117.GA29318@bx9.net>
	<47687C75.9090907@myri.com> <20071219055052.GA8536@bx9.net>
Message-ID: <4768D5C6.8010600@myri.com>

Greg Lindahl wrote:
> On Tue, Dec 18, 2007 at 09:05:41PM -0500, Patrick Geoffray wrote:
> 
>> No, it just means the NIC supports it.
> 
> Well, then how about ethtool -S? That looks like an actual count of
> flow control events, so rx flow control events means the switch
> must support it in some fashion.

If this counter is not null, then you can say the switch does support RX 
flow control, which is the most important. However, the NIC driver may 
not report these events to ethtool, and you eventually need to generate 
some contention in the switch. A simple test is to run a simple MPI code 
where several senders streams to a single receiver. If you see a 
cumulated bandwidth equal to the receiver link bandwidth, then flow 
control works. If you see that all senders have the same bandwidth, then 
the switch is fair on top of that.

> Well, we know it can be done perfectly, it's done in InfiniBand
> switches, and that other 10 gig non-ethernet switch, what's it called?
> Oh yeah, Myrinet. They do it, too.

In Ethernet, the sender has to finish sending the current packet before 
  stopping, so your switch buffers should be able to store a full frame 
in addition to the wire delay. In Myrinet (and I presume in IB), the 
hardware flow control can stop a sender in the middle of a packet, so 
you only have to buffered the wire delay. It's 4 KB per port versus 12 
to 16 KB per port. Not trivial and some corners may be cut to save 
space/money in the switch chips.

>> Flow-control is not for everyone, and that's why it is often turned off 
>> by default. When a sender is paused, it will stop sending anything, 
>> including packets for different destinations. Dropping packets is 
>> expensive to recover but it keeps things moving.
> 
> Can Myrinet even disable flow control? Odd that Ethrernet is any
> different; dropping any packets is an utter disaster for TCP.

I think it's technically possible to disable flow control in the switch 
crossbars in Myrinet, but you would not want to. The NICs can change 
routes quickly when they sense contention on a specific path (Quadrics 
does the same thing, others can't). That helps a lot for internal hot 
spots that are frequent in HPC, but it does nothing against the N->1 
communication pattern of death. As Mark pointed out, the best way around 
it is to not have it in the first place.

Ethernet switches are often used in more hostile environments where you 
can not prevent such N->1 traffic: I could flood a particular machine on 
a campus from a couple of host to produce contention, that would 
saturate some internal links in the switch that would propagate the 
contention to other ports, more links are blocked, etc. If you can 
sustain the contention a few seconds on a busy switch, then you can 
block the whole thing, complete meltdown.

That's why high-end switch/routers are super expensive, they are way 
over-dimensioned inside to be able to handle contentions. That's also 
why the FCoE folks are pushing for per-priority flow-control in 
Ethernet, so that untrusted/misbehaving traffic can be dropped to not 
affect trusted/important FCoE traffic that should not be dropped. And 
that's why switch flow-control is turned off by default most of the time.

Patrick


From hahn at mcmaster.ca  Wed Dec 19 06:09:37 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Wed, 19 Dec 2007 09:09:37 -0500 (EST)
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <204b3d180712182224o362a564dgfa490ba624866992@mail.gmail.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com> 
	<47680434.2050504@scalableinformatics.com> <476855FF.4030902@myri.com> 
	<20071218234117.GA29318@bx9.net> <47687C75.9090907@myri.com> 
	<204b3d180712182014j6df14f41l3909ad3cd99867a2@mail.gmail.com> 
	<Pine.LNX.4.64.0712182340010.6542@coffee.psychology.mcmaster.ca> 
	<204b3d180712182140y6bf7166ct5ae449f87b256b0f@mail.gmail.com> 
	<Pine.LNX.4.64.0712190044000.6542@coffee.psychology.mcmaster.ca>
	<204b3d180712182224o362a564dgfa490ba624866992@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712190856080.6542@coffee.psychology.mcmaster.ca>

>> the first message should take <50 us.  the broadcast to 5 nodes should
>> take 2-3 more 50 us times.  so at about 200 us, all the slaves will start
>> the DOS attack on the viewer node's nic...
>
> I am not sure why you compare this to a DOS attack.  The same amount of data
> (and roughly the same amount of packets) should be arriving at the viewer
> node.  Yes it is stressing the switch more, but this switch should be able
> to handle much more traffic than this.

it's the _timing_ of the data.  using bcast, you attempt to cause the 
render nodes to, as simultaneously as possible, saturate their own
links, and therefore (N-1)-times oversaturate the viewer link.
it's exactly what you'd do if you wanted to provoke the switch to see
how it deals with congestion.

some form of credit or backpressure-based flow control would solve 
the problem entirely, but ethernet doesn't have that.  pause frames
might well solve the problem, but since it's not universally implemented,
I would guess it doesn't work that well.  normal TCP flow-control 
(switch drops packet(s), sender eventually notices lack of ack, etc)
will work, but is probably too agressive in backing off.  do you happen
to know which TCP version your kernel is implementing (cubic? probably
listed in the boot messages or in /proc/sys/net/ipv4/).  it's hard to 
find a TCP congestion algorithm that handles both lan and wan rates 
sensibly...

> 1MB / N -- then the hiccups must be coming from the final gather and not the
> broadcast.

yes, that's the part I'm calling the DOS ;)


From gerry.creager at tamu.edu  Wed Dec 19 06:20:00 2007
From: gerry.creager at tamu.edu (Gerry Creager)
Date: Wed, 19 Dec 2007 08:20:00 -0600
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <476801B4.3090501@berkeley.edu>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<476801B4.3090501@berkeley.edu>
Message-ID: <47692890.5080400@tamu.edu>

One consideration is the size of the messages being exchanged.  Even 
today, small packets can markedly reduce switch performance.  RFC 2544 
compliance is not universal in the Layer 2 world.

gerry

Jon Forrest wrote:
> Brendan Moloney wrote:
>> Since it is a full duplex switched network, there should not be any 
>> collisions happening. 
> 
> I have a similar situation with a slightly larger cluster.
> At first I also thought it was a network performance
> problem. But then I ran the iftop program to watch
> the network in realtime and I saw that I wasn't
> even close to sending enough data to tax the switch.
> 
> I still don't know the cause of the problem but
> I'm pretty sure it's not cause by excessive
> network traffic. I suggest you try iftop to
> see what your program is really doing.
> 
> Cordially,

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843


From peter.st.john at gmail.com  Wed Dec 19 08:28:50 2007
From: peter.st.john at gmail.com (Peter St. John)
Date: Wed, 19 Dec 2007 11:28:50 -0500
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <Pine.LNX.4.64.0712190856080.6542@coffee.psychology.mcmaster.ca>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<476855FF.4030902@myri.com> <20071218234117.GA29318@bx9.net>
	<47687C75.9090907@myri.com>
	<204b3d180712182014j6df14f41l3909ad3cd99867a2@mail.gmail.com>
	<Pine.LNX.4.64.0712182340010.6542@coffee.psychology.mcmaster.ca>
	<204b3d180712182140y6bf7166ct5ae449f87b256b0f@mail.gmail.com>
	<Pine.LNX.4.64.0712190044000.6542@coffee.psychology.mcmaster.ca>
	<204b3d180712182224o362a564dgfa490ba624866992@mail.gmail.com>
	<Pine.LNX.4.64.0712190856080.6542@coffee.psychology.mcmaster.ca>
Message-ID: <e4d4fd070712190828x6f5f424dk89317544a1b0a6ae@mail.gmail.com>

Brendan,

I'm a day late but maybe not a dollar short :-) When I read the original
question, I was going to ask, "do the compute (render) nodes push their
results when ready, or does the head (view) node pull?" and from the
subsequent discussion and clarifications it seems to be the former. And yeah
what Mark said.

So if it were me, each compute node would send the (short) message, "I'm
ready"; the head node would maintain a list of ready nodes, and pull from
them sequentially ("Ok node number 7, upload now please"). That way the only
collision is word-sized and not image-sized, and the overhead is trivial.
But that would be me using FTP and tcsh :-)  Dunno what you'd do
specifically with the software you have.

Peter


> (Hahn)
> it's the _timing_ of the data.  using bcast, you attempt to cause the
> render nodes to, as simultaneously as possible, saturate their own
> links, and therefore (N-1)-times oversaturate the viewer link.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071219/6d698c40/attachment.html>

From jiteshbdundas at gmail.com  Tue Dec 18 09:01:09 2007
From: jiteshbdundas at gmail.com (jitesh dundas)
Date: Tue, 18 Dec 2007 22:31:09 +0530
Subject: [Beowulf] DOcumentation
Message-ID: <a609fe70712180901s28ce6823ied726d3868162384@mail.gmail.com>

Dear All,

Can i get the documentation of the Beowolf project release.
I wish to actively involve myself in this project.

If Any help needed, plz do tell me.

Regrds,
JItesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071218/dfb0d101/attachment.html>

From forum.san at gmail.com  Wed Dec 19 00:05:26 2007
From: forum.san at gmail.com (Sangamesh B)
Date: Wed, 19 Dec 2007 13:35:26 +0530
Subject: [Beowulf] Amber 8 Execution problem
Message-ID: <cb60cbc40712190005o59692763v192444db31001012@mail.gmail.com>

Hi All,

I installed AMBER8 on Opteron Dual core, dual processor, Rocks cluster with
the following options/libraries:

Compiler: Intel 9 Fortran and C++ compilers.
Blas Library: MKL8
MPI: MPICH2 compiled with Intel compiler.

During make serial, I got an error for xterm library(libXt.so). Then found
this lib's availability at /usr/lib64/. But the makefile of
amber8/src/leap/src/leap/ has /usr/lib -lXt .. . I changed it to /usr/lib64

After this I built both serial and parallel AMBER executables successfully.

But when I execute with the following:

mpirun -np 4  /usr/bin/numactl -c0-1 $AMBERHOME/amber8/exe/sander ...input
file parameters

Note: Numactl is used to bind the processes to particular processors for
better performance.

its giving:

symbol lookup error:       undefined symbol: __intel_cpu_indicator

I guessed, this might relate to MKL library.

During 'make' it was taking MKL lib*.so files from $MKL_HOME/lib/32 (in
config.h file).

Changed $MKL_HOME/lib/32 to $MKL_HOME/lib/64 since the arch is AMD64
opteron. But it is giving incompatible -libvml.so file.

I'm not getting why this error is coming.

Any help for this will be appreciated.

Thanks & Regards,
Sangamesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071219/41fb6126/attachment.html>

From peter.st.john at gmail.com  Wed Dec 19 11:06:47 2007
From: peter.st.john at gmail.com (Peter St. John)
Date: Wed, 19 Dec 2007 14:06:47 -0500
Subject: [Beowulf] DOcumentation
In-Reply-To: <a609fe70712180901s28ce6823ied726d3868162384@mail.gmail.com>
References: <a609fe70712180901s28ce6823ied726d3868162384@mail.gmail.com>
Message-ID: <e4d4fd070712191106h10f9d9f7vd8a3ba1721a99780@mail.gmail.com>

Hello Jltesh.
There is no single formal Beowulf Project, but many projects you can help.
 I'd start by reading the wiki entry,
http://en.wikipedia.org/wiki/Beowulf_%28computing%29  then skim over the 496
"clustering" projects  at SourceForge
http://sourceforge.net/softwaremap/trove_list.php?form_cat=141 . Those are
all open-source, user-contributed; you can join any of those projects.
Peter

On Dec 18, 2007 12:01 PM, jitesh dundas <jiteshbdundas at gmail.com> wrote:

> Dear All,
>
> Can i get the documentation of the Beowolf project release.
> I wish to actively involve myself in this project.
>
> If Any help needed, plz do tell me.
>
> Regrds,
> JItesh
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071219/59da1f85/attachment.html>

From tom.elken at qlogic.com  Wed Dec 19 11:20:06 2007
From: tom.elken at qlogic.com (Tom Elken)
Date: Wed, 19 Dec 2007 11:20:06 -0800
Subject: [Beowulf] Amber 8 Execution problem
In-Reply-To: <cb60cbc40712190005o59692763v192444db31001012@mail.gmail.com>
References: <cb60cbc40712190005o59692763v192444db31001012@mail.gmail.com>
Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A018F8505@AVEXCH1.qlogic.org>

 ________________________________

	From: beowulf-bounces at beowulf.org
[mailto:beowulf-bounces at beowulf.org] On Behalf Of Sangamesh B
	Sent: Wednesday, December 19, 2007 12:05 AM
	To: Beowulf ML
	Subject: [Beowulf] Amber 8 Execution problem

Hi Sangamesh,

Sounds like a linking problem rather than execution ...	
	
	Hi All,
	
	I installed AMBER8 on Opteron Dual core, dual processor, Rocks
cluster with the following options/libraries: 
	
	Compiler: Intel 9 Fortran and C++ compilers.
	Blas Library: MKL8
	MPI: MPICH2 compiled with Intel compiler.
	
	
	its giving: 
	
	symbol lookup error:       undefined symbol:
__intel_cpu_indicator
	
	I guessed, this might relate to MKL library. 

Sounds like a safe bet.
	
	During 'make' it was taking MKL lib*.so files from
$MKL_HOME/lib/32 (in config.h file).
	
	Changed $MKL_HOME/lib/32 to $MKL_HOME/lib/64 since the arch is
AMD64 opteron. 
      But it is giving incompatible - libvml.so file.
	
	I'm not getting why this error is coming. 

The MKL library is highly targetted at Intel CPUs and you are using AMD
Opteron.  

Unlike Intel, AMD offers a free math library tuned to their processors:
ACML.
http://developer.amd.com/acml3.jsp
is the free download page.  Choose a version compatible with your Intel
Fortran compiler.  

That is probably your easiest solution and the way to the best
performance on Opteron.

-Tom

	
	Any help for this will be appreciated. 
	
	Thanks & Regards,
	Sangamesh 
	
	
From henry.gabb at intel.com  Wed Dec 19 12:23:30 2007
From: henry.gabb at intel.com (Gabb, Henry)
Date: Wed, 19 Dec 2007 12:23:30 -0800
Subject: [Beowulf] RE: Amber 8 Execution problem
In-Reply-To: <200712192000.lBJK07lu029623@bluewest.scyld.com>
References: <200712192000.lBJK07lu029623@bluewest.scyld.com>
Message-ID: <4D97B70CF7F72144881F66DFF4BD7A12032FD38B@fmsmsx413.amr.corp.intel.com>

Hi Sangamesh,
It sounds like the executable can't find some shared objects. The
$MKL_HOME/tools/environment directory contains initialization scripts to
automatically update the $MKL_ROOT, $LD_LIBRARY_PATH, etc. Since you're
running on Opteron, you should use the mklvarsem64t.{sh|csh}. Likewise,
you should be using the libraries in $MKL_HOME/lib/em64t rather than the
IA-32 (lib/32) or Itanium (lib/64) libraries.

Best regards,

Henry Gabb
Intel Cluster Software and Technologies

----------------------------------------------------------------------
Message: 1
Date: Wed, 19 Dec 2007 13:35:26 +0530
From: "Sangamesh B" <forum.san at gmail.com>
Subject: [Beowulf] Amber 8 Execution problem
To: "Beowulf ML" <beowulf at beowulf.org>
Message-ID:
	<cb60cbc40712190005o59692763v192444db31001012 at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi All,

I installed AMBER8 on Opteron Dual core, dual processor, Rocks cluster
with
the following options/libraries:

Compiler: Intel 9 Fortran and C++ compilers.
Blas Library: MKL8
MPI: MPICH2 compiled with Intel compiler.

During make serial, I got an error for xterm library(libXt.so). Then
found
this lib's availability at /usr/lib64/. But the makefile of
amber8/src/leap/src/leap/ has /usr/lib -lXt .. . I changed it to
/usr/lib64

After this I built both serial and parallel AMBER executables
successfully.

But when I execute with the following:

mpirun -np 4  /usr/bin/numactl -c0-1 $AMBERHOME/amber8/exe/sander
...input
file parameters

Note: Numactl is used to bind the processes to particular processors for
better performance.

its giving:

symbol lookup error:       undefined symbol: __intel_cpu_indicator

I guessed, this might relate to MKL library.

During 'make' it was taking MKL lib*.so files from $MKL_HOME/lib/32 (in
config.h file).

Changed $MKL_HOME/lib/32 to $MKL_HOME/lib/64 since the arch is AMD64
opteron. But it is giving incompatible -libvml.so file.

I'm not getting why this error is coming.

Any help for this will be appreciated.

Thanks & Regards,
Sangamesh


From lindahl at pbm.com  Wed Dec 19 13:24:44 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Wed, 19 Dec 2007 13:24:44 -0800
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <425BE87409B6BA49954C7D375C69F9ED098B20C8@ms09.mse4.exchange.ms>
References: <425BE87409B6BA49954C7D375C69F9ED098B20C8@ms09.mse4.exchange.ms>
Message-ID: <20071219212444.GB5320@bx9.net>

On Sat, Dec 15, 2007 at 10:08:16PM -0500, Shai Fultheim (Shai at ScaleMP.com) wrote:

> -          Large memory applications can use all memory - for example
> running SANDIA CUBIT with meshes over 400GB in size.  3.08x faster than
> the customer large-scale Itanium NUMA system.

This is a classic apple-orange comparison. In order to be useful, you'd have
to mention what the other system was -- if it's an Itanium-1 at 800 Mhz
compared to a 3.0 Ghz modern system, that matters.

But, better yet, can you just point us at a set of SPEC OMP benchmark
results? I looked at your OEM partner pages and didn't see any
benchmark results.

-- greg


From moloney.brendan at gmail.com  Wed Dec 19 14:51:15 2007
From: moloney.brendan at gmail.com (Brendan Moloney)
Date: Wed, 19 Dec 2007 14:51:15 -0800
Subject: [Beowulf] Help with inconsistent network performance
In-Reply-To: <e4d4fd070712190828x6f5f424dk89317544a1b0a6ae@mail.gmail.com>
References: <204b3d180712171903u31802dacve0bd9c6ae6bf1e41@mail.gmail.com>
	<20071218234117.GA29318@bx9.net> <47687C75.9090907@myri.com>
	<204b3d180712182014j6df14f41l3909ad3cd99867a2@mail.gmail.com>
	<Pine.LNX.4.64.0712182340010.6542@coffee.psychology.mcmaster.ca>
	<204b3d180712182140y6bf7166ct5ae449f87b256b0f@mail.gmail.com>
	<Pine.LNX.4.64.0712190044000.6542@coffee.psychology.mcmaster.ca>
	<204b3d180712182224o362a564dgfa490ba624866992@mail.gmail.com>
	<Pine.LNX.4.64.0712190856080.6542@coffee.psychology.mcmaster.ca>
	<e4d4fd070712190828x6f5f424dk89317544a1b0a6ae@mail.gmail.com>
Message-ID: <204b3d180712191451jb3986ddr12c9484bed655be0@mail.gmail.com>

Well it turns out that flow control was disabled on the switch, and once we
enabled it the hiccups disappeared and the average RTT was cut in half.
 Even with an image size of 1920x1200 and 7 nodes sending to one, the RTTs
are the same as if there is one node sending the full image.

Thanks a lot for all the help, I really appreciate it.  I hope this info is
also helpful to others who are having similar problems.

Brendan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071219/e9b0b652/attachment.html>

From hahn at mcmaster.ca  Thu Dec 20 07:14:54 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Thu, 20 Dec 2007 10:14:54 -0500 (EST)
Subject: [Beowulf] how green is that?!?
Message-ID: <Pine.LNX.4.64.0712201011150.16417@coffee.psychology.mcmaster.ca>

http://www.computerworld.com.au/index.php?id=312084283

very amusing and effective stunt for SiCortex!  though I wonder 
whether the total carbon footprint winds up being bigger than when
using conventional power (ie, that the human+foodchain system is 
itself fairly high-carbon...)


From richard.walsh at comcast.net  Thu Dec 20 07:38:09 2007
From: richard.walsh at comcast.net (richard.walsh at comcast.net)
Date: Thu, 20 Dec 2007 15:38:09 +0000
Subject: [Beowulf] how green is that?!?
Message-ID: <122020071538.12891.476A8C61000159FC0000325B2200763692089C040E99D20B9D0E080C079D@comcast.net>


-------------- Original message -------------- 
From: Mark Hahn <hahn at mcmaster.ca> 

> http://www.computerworld.com.au/index.php?id=312084283 
> 
> very amusing and effective stunt for SiCortex! though I wonder 
> whether the total carbon footprint winds up being bigger than when 
> using conventional power (ie, that the human+foodchain system is 
> itself fairly high-carbon...) 
> 
Right.  Maybe running of a bunch of photo-voltaics would have been more hydrocarbon friendly (if not market grabbing), but the machine is more peak-efficient (there is little benchmark data out there) per watt than Blue Gene by my calculation, more balanced than Opteron (not than a Cray), and has a very cool custom interconnect.  It takes some baby steps in the direction of "many-core" with more and simpler cores than the current generation of x86-64 chips.  If it is possible for custom machines to regain traction in this space, I think SiCortex's systems are the best bet.
Regards,
rbw

--

"Making predictions is hard, especially about the future." 

Niels Bohr 

-- 

Richard Walsh 
Thrashing River Consulting-- 
5605 Alameda St. 
Shoreview, MN 55126 

Phone #: 612-382-4620
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071220/c3a073ef/attachment.html>

From larry.stewart at sicortex.com  Thu Dec 20 12:19:13 2007
From: larry.stewart at sicortex.com (Larry Stewart)
Date: Thu, 20 Dec 2007 15:19:13 -0500
Subject: [Beowulf] Stream numbers for SiCortex's MIPS based SOC ...
In-Reply-To: <121720072126.8135.4766E9750001CC7B00001FC72207021573089C040E99D20B9D0E080C079D@comcast.net>
References: <121720072126.8135.4766E9750001CC7B00001FC72207021573089C040E99D20B9D0E080C079D@comcast.net>
Message-ID: <476ACE41.4090901@sicortex.com>

richard.walsh at comcast.net wrote:

> All,
>  
> Anyone seem Stream numbers for one and/or more cores from SiCortx, say 
> a SiCortex
> Catapult System.  The chip has two memory controllers, and I have 
> heard provides:
>  
> "more than 10 Terabytesof bandwidth"
>  
> in the largest configuration, but have not seen any measured memory 
> bandwidth numbers
> for this box.  Come to think of it,  I have not seen measured number 
> for its interconnect
> performance either. Sustaining a reasonable ratio bytes delivered from 
> memory to flops
> should be easier on this processor with its lower clock, but is does 
> have 2 cores.  I am
> interested in how looks compared to Opteron, etc. It is supposed to 
> be a balanced
> design, but it seems there are few measured results available to 
> validate this.
>  
> As always your thoughts are appreciated ...
>  
> Regards,
>  
> rbw 
> -- 

The usual caveats apply: these are microbenchmarks, delivered
application peformance and scalability are what matter.  The metrics
of interest may include absolute performance, cost/performance, and
power/performance.

The SiCortex machines have a substantially different balance of
processing, memory, and communications than desktop machines.  And
don't forget they use about 600 milliwatts per core or 12 watts per
node including 4 GB memory and the interconnect.

Read on...

Regarding the interconnect, we've got some published results in the
2007 Euro/PVM conference last October.  I've just realized that that
paper is not on our website, but I'll get that fixed.

We've measured short message latency at 1.4 microseconds half-round
trip (ping pong). This isn't as fast as some ping pong results, but
when running at scale, the HPCC Random Ring latency is under 2
microseconds when all 648 cores of an SC648 are active at once.  The
fastest machine with 512 or more cores in the current HPCC results
reports 2.3 microseconds. 

For large messages, the point to point bandwidth off-node is about 1.1
gigabytes/sec.  That aggregate capacity seems to be shared fairly
among all cores reading and writing, so HPCC random ring gets about
600 MB/sec per node on 108 nodes (1 core/node) and about 100 MB/sec
per core when all 648 cores of the SC648 are running at once.  Looking
on the HPCC Results page for machines of that scale I find that the
NEC SX-8, the Cray XT-3's, Columbia (Altix) and the new Intel Endeavor
cluster are faster.

Stream Triad gives 360 megabytes/sec when one core is active, and 340
megabytes/sec per core when all six cores are active at once.  We're
pleased that we can run all six cores at once with little
degradation. The core we are currently using supports only a single
outstanding cache miss and does not have a prefetch unit.

The memory controllers themselves have enough bandwidth to supply all
six cores, the DMA engine running the interconnect and the PCI express
on I/O nodes, all at once.  (At SC07 we measured 1100 MB/sec to a
Myricom 10G running MX.)

The main memory latency is about 104 nanoseconds, load to use, so the
number of clock cycles to main memory is quite low.

As a consequence of this balance: moderate speed cores, reasonably
low latency memory (although not extreme bandwidth), and quite fast
communications, benchmarks like HPCC Random Access run very well.
The SC5832, for example, measures around 2.25 using the Sandia Labs
version of the code, putting it sixth in current rankings behind
the big BlueGene, the big XT3s, a Cray X1, and the Intel Endeavor
cluster.  Cost and power consumption comparisons are left as
an exercise for the reader.

-Larry


From mathog at caltech.edu  Thu Dec 20 13:03:18 2007
From: mathog at caltech.edu (David Mathog)
Date: Thu, 20 Dec 2007 13:03:18 -0800
Subject: [Beowulf] Re: how green is that?!?
Message-ID: <E1J5SY2-0003bE-0u@mendel.bio.caltech.edu>

Mark Hahn <hahn at mcmaster.ca> wrote

> http://www.computerworld.com.au/index.php?id=312084283
> 
> very amusing and effective stunt for SiCortex!

Stunt being the operative word.  

It was an interesting demo of how little power it took to run that
cluster, since people are notoriously "underpowered".  However
as green  policy it is perfectly silly. The energy
that went into growing and delivering the food which "powered" the
cyclists could have been more efficiently delivered directly to the
computer.  And unless these guys pedaled for a very long time
the amount of energy consumed during this stunt would have been
nothing compared to what went into building the computer, the
bikes, their clothes, moving all of it to wherever this stunt was
performed, etc. etc.

The article talks about Guinness records - no good can come of that.

The numbers that matter in terms of "green" policy are the power
consumption at peak computing load, sustained computing load, and
when idle; where those three load levels would ideally have some
standardized measure so that different machines could be compared for
their running efficiency, and end users could choose products
accordingly.

Also of interest for "green" policy is the amount of energy that goes
into building the machines.  I don't know how to estimate that directly,
but the energy _cost_ can be estimated.  Sadly the limits are both
obvious and rather far apart, so they don't tell us much: the cost
of the energy required is more than zero and less than or equal to
the price of the machine (unless that price was subsidized or the
manufacturer is trying to go out of business).  For instance, for
a $1000 node at $0.10 per kilowatt-hour the energy used lies
somewhere between 0 and 10,000 kilowatt-hours.  As I said, a pretty
wide estimate!  If the node has a lifetime of 3 years and uses 300W
on average, that's (300 * 24 * 365 * 3)/1000 = 7884 kilowatt-hours.
(I know, no AC or room lighting included.)  Which says that
the energy required to manufacture and deliver that node
is typically less than half of the total power consumed over the
machine's lifetime. My gut feeling is that it's a lot less than half,
which makes the power consumption at load numbers the place to look for
green computing policy.

It would be fun to see "energy content" stickers on computers, so that
green policy could include that variable, but I just don't see it happening.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From shai at scalemp.com  Thu Dec 20 20:04:35 2007
From: shai at scalemp.com (Shai Fultheim (Shai@ScaleMP.com))
Date: Thu, 20 Dec 2007 23:04:35 -0500
Subject: [Beowulf] ever heard of ScaleMP?
In-Reply-To: <20071219212444.GB5320@bx9.net>
References: <425BE87409B6BA49954C7D375C69F9ED098B20C8@ms09.mse4.exchange.ms>
	<20071219212444.GB5320@bx9.net>
Message-ID: <425BE87409B6BA49954C7D375C69F9ED09938F14@ms09.mse4.exchange.ms>

Greg,

It is a bit difficult to use classic apple-to-apples comparison, when
there is not much to compare to.  vSMPowered systems are the largest x86
systems from both memory and core count, which makes it a bit hard to
compare.  Some data below:

1. Apples-to-apples:
 - The fastest SUN 8-proc (x4600m2) AMD machine (2.6):  STREAM OMP 9GB
 - The fastest IBM 8-proc (x3950) Intel machine (3.0):  STREAM OMP 4GB
 - vSMPowered system has linear STREAM bandwidth...  so 8-proc STREAM
OMP 27GB
2. Apples-to-oranges:
 - Comparing Itanuim-2 (not 1) and x86 when it get to large memory is
the right comparison.  After all, finding another x86 system that has
>400GB RAM is quite difficult.

I don't have SPEC OMP results, but one application I can mention that
make use of OMP is Gaussian, which is running by several customers at
better performance than other (AMD, Itanuium) platform - after
customer's apple-to-apple comparison.

I'll be happy to discuss our technology in details if you have the time.
Note that we have several contacts in common at PathScale/Qlogic that
can share more light on ScaleMP's technology.

--Shai


-----Original Message-----
From: Greg Lindahl [mailto:lindahl at pbm.com] 
Sent: Wednesday, December 19, 2007 13:25
To: Shai Fultheim (Shai at ScaleMP.com)
Cc: Beowulf at beowulf.org
Subject: Re: [Beowulf] ever heard of ScaleMP?

On Sat, Dec 15, 2007 at 10:08:16PM -0500, Shai Fultheim
(Shai at ScaleMP.com) wrote:

> -          Large memory applications can use all memory - for example
> running SANDIA CUBIT with meshes over 400GB in size.  3.08x faster
than
> the customer large-scale Itanium NUMA system.

This is a classic apple-orange comparison. In order to be useful, you'd
have
to mention what the other system was -- if it's an Itanium-1 at 800 Mhz
compared to a 3.0 Ghz modern system, that matters.

But, better yet, can you just point us at a set of SPEC OMP benchmark
results? I looked at your OEM partner pages and didn't see any
benchmark results.

-- greg


From tcarroll at ursinus.edu  Fri Dec 21 07:06:14 2007
From: tcarroll at ursinus.edu (Thomas Carroll)
Date: Fri, 21 Dec 2007 10:06:14 -0500
Subject: [Beowulf] Building a new cluster - seeking some advice
Message-ID: <1198249574.6128.24.camel@Loki>

Hi,
  I'm new to this list but not entirely new to building clusters.  I've
participated in setting up two clusters in the past, but I'm about to
build my own and would like some advice before I start.  I've searched
through the archives, and there's a ton of great information.  Still, I
thought I'd throw out my specific questions/info.  Sorry in advance for
the long email.

1. I'd like to go diskless.  I've never done this before (the other two
clusters are...diskful?).  I've used Fedora on both of the previous
clusters.  Is this a good choice for diskless?  Any advice on where to
start with diskless or operating system choices?

2. Given my budget (about 20K), I plan on going with GigE on about 24
nodes.  Am I right in thinking that faster network interconnects are
just too expensive for this budget?

3. I'll be spending most of my cluster's time diagonalizing large
matrices.  I plan on using ScaLAPACK eventually; currently I just use
LAPACK/ATLAS and do individual matrices on each node.  The only thing
parallel about my code right now is using the nodes for monte carlo.
This is what I'm looking at right now for my compute nodes:
	* Intel Core 2 Duo E6850 Conroe 3.0GHz ($280)
	* 8 GB (4 X 2 GB) DDR2 800 (~$200)
	* Case/PSU combo ($60)
	* ATX motherboard w/GigE (~$100)
This comes out to about 17k for 24 nodes + spare parts + hard drives for
the head node.  I've already purchased the switch and cables and have
more than adequate cooling and shelving for the room.

  The motherboard does NOT have integrated video.  Will I need video
output?  Can you even build a node without it?  Problem is, the
motherboards with adequate support for 8GB memory and 1333 FSB don't
have video.  I could spend $10-20 per node for a video card, but that
seems like a waste.  From reading around, it seems like there is no
advantage really to DDR3 memory...is that right?  Any advice on the
video issue or my potential parts list would be greatly appreciated.

Thanks so much for any advice.  Feel free to offer unsolicited advice as
well :).  And I hope everyone has, or has already had, a good holiday!

-tom

-- 
------------------------------------------------------
        Prof. Thomas Carroll   (610) 409-3000 ext. 2121
        Ursinus College Dept. of Physics and Astronomy
        Pfahler 101K            tcarroll at ursinus.edu


From bill at cse.ucdavis.edu  Fri Dec 21 16:12:14 2007
From: bill at cse.ucdavis.edu (Bill Broadley)
Date: Fri, 21 Dec 2007 16:12:14 -0800
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <1198249574.6128.24.camel@Loki>
References: <1198249574.6128.24.camel@Loki>
Message-ID: <476C565E.7020200@cse.ucdavis.edu>

Thomas Carroll wrote:
> 1. I'd like to go diskless.  I've never done this before (the other two
> clusters are...diskful?).  I've used Fedora on both of the previous
> clusters.  Is this a good choice for diskless?  Any advice on where to
> start with diskless or operating system choices?

I know RHEL has support for diskless, I've talked to people who used it.
In general if you are familiar with PXE boot, DHCP, initrd, ram disks, and
related it's relatively straight forward.  If that kind of stuff scares
you I'd consider spending $40 per node on a cheap disk.  I've not tried
this recently but at the time I did see less stability without local
swap.  Swap over network can be a bit tricky, sometimes a network transfer
involves allocating a page, and if you are swapping you might not have
one.  I've seen a project or two for network block layers to handle this,
no idea if any of them are current.

No idea on diskless for fedora, I suspect someone will comment.

In any case my strategy was readonly share /, then a per machine read/write
/var.  So the head node had a /var/host1, /var/host2, ....  So things
like ntp.drift, ssh session keys, numerous tmp files in /tmp and /var/temp
wouldn't conflict across the cluster.  I ended up using this for lab
machines it was quite amusing to watch a hacker try to hack binaries
on a NFS client that the nfs server considered readonly.

My overhead per client was just a few 10's of MBs on the head node's disk.

So the head node basically had 2 installs, one for the head node, and
one for all the compute nodes.  As well as 2 RPM databases.  I didn't use
or maintain the RPM database on the client nodes since they didn't really
have their own filesystem.  Thankfully /dev is no longer on the local disk
so that is not a problem.

You will likely need more than the default 8 NFS daemons on the head node.


> 2. Given my budget (about 20K), I plan on going with GigE on about 24
> nodes.  Am I right in thinking that faster network interconnects are
> just too expensive for this budget?

I suspect so, I'll be interested to hear if others suggest something
where the switch, cables, and nics don't eat most of the budget.  I
guess the cheapest IB cards + switch might be low enough to let you
still buy more nodes than GigE would scale to for a certain code.  Any
idea if your code is communication intensive enough so that 12 IB with
quad core CPUs might be faster than 24 dual core nodes with GigE?

> 3. I'll be spending most of my cluster's time diagonalizing large
> matrices.  I plan on using ScaLAPACK eventually; currently I just use
> LAPACK/ATLAS and do individual matrices on each node.  The only thing
> parallel about my code right now is using the nodes for monte carlo.
> This is what I'm looking at right now for my compute nodes:
> 	* Intel Core 2 Duo E6850 Conroe 3.0GHz ($280)

Hmm, a Q6600 quad 2.4 GHz is the same price, at least for some codes I'd
expect it to have more throughput than a dual core 3.0 GHz.  Of course
if the network is the bottleneck it won't help.

> 	* 8 GB (4 X 2 GB) DDR2 800 (~$200)

So keep the same memory per code with the q660 you'd have to double that, do
you need 4GB per core?  I suspect cheap motherboards will not allow 8x2GB
(for the same memory per core with a quad core).

> 	* Case/PSU combo ($60)
> 	* ATX motherboard w/GigE (~$100)
> This comes out to about 17k for 24 nodes + spare parts + hard drives for
> the head node.  I've already purchased the switch and cables and have
> more than adequate cooling and shelving for the room.
> 
>   The motherboard does NOT have integrated video.  Will I need video
> output?  Can you even build a node without it?  Problem is, the

To get a single node running diskless most likely, of the next 3 years
after it runs, probably not.   I'd get two for debugging, it is kinda
nice to have a console when the kernel oops is, but there is a network
block layer in the kernel that can usually handle sending an oops remotely
(that would normally only go to the consoles screen).

> motherboards with adequate support for 8GB memory and 1333 FSB don't

I'm all for the faster FSB, but you might test to see if the performance
improves, from what I can see for the same $ the 1333 FSB often is the
same latency, but only somewhat higher performance.  If it adds much
cost or less flexibility I'd at least look at the 1066 FSB motherboards.
Alas I think the date where ddr2-533 is cheaper than ddr2-667 has past.

> have video.  I could spend $10-20 per node for a video card, but that
> seems like a waste.  From reading around, it seems like there is no
> advantage really to DDR3 memory...is that right?  Any advice on the

It's what I've read as well, not tried it myself yet.

> video issue or my potential parts list would be greatly appreciated.

Try to get motherboards without fans.

> Thanks so much for any advice.  Feel free to offer unsolicited advice as
> well :).  And I hope everyone has, or has already had, a good holiday!

Seems plausible, keep in mind even if your time might be cheap/free there's
plenty of other things to do usually.  If disk drives get your cluster in
production a week sooner is that worth $40 a node?  Is local swap of any
value? How about local disk I/O for anything disk I/O intensive?  I've done
diskless myself and not regretted it.  Yet clusters I build these days
(usually from a vendor not from a stack of parts) have disks.


From lindahl at pbm.com  Fri Dec 21 17:13:09 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Fri, 21 Dec 2007 17:13:09 -0800
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <1198249574.6128.24.camel@Loki>
References: <1198249574.6128.24.camel@Loki>
Message-ID: <20071222011309.GA3940@bx9.net>

On Fri, Dec 21, 2007 at 10:06:14AM -0500, Thomas Carroll wrote:

> 2. Given my budget (about 20K), I plan on going with GigE on about 24
> nodes.  Am I right in thinking that faster network interconnects are
> just too expensive for this budget?

Wrong question. The right question: given my application, what network
gives me the most bang for my buck?

For such a small system, the answer is likely gigE; but, if your code
doesn't scale to 24 nodes with gigE, you'll have to think about
alternatives. There are codes and data set sizes at which gigE doesn't
scale.

-- greg


From lindahl at pbm.com  Fri Dec 21 17:14:49 2007
From: lindahl at pbm.com (Greg Lindahl)
Date: Fri, 21 Dec 2007 17:14:49 -0800
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <476C565E.7020200@cse.ucdavis.edu>
References: <1198249574.6128.24.camel@Loki> <476C565E.7020200@cse.ucdavis.edu>
Message-ID: <20071222011449.GB3940@bx9.net>

On Fri, Dec 21, 2007 at 04:12:14PM -0800, Bill Broadley wrote:

> I suspect cheap motherboards will not allow 8x2GB
> (for the same memory per core with a quad core).

Having just bought some single-socket Intel nodes, I hear that there
are currently no single-socket mobos which support > 8 gigs.

-- greg 


From hahn at mcmaster.ca  Fri Dec 21 21:11:38 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sat, 22 Dec 2007 00:11:38 -0500 (EST)
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <1198249574.6128.24.camel@Loki>
References: <1198249574.6128.24.camel@Loki>
Message-ID: <Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>

> 1. I'd like to go diskless.  I've never done this before (the other two
> clusters are...diskful?).  I've used Fedora on both of the previous
> clusters.  Is this a good choice for diskless?  Any advice on where to
> start with diskless or operating system choices?

I prefer diskless installs:
 	- NFS root: fast, can be RO, no significant server load.
 	- node-specific files on tmpfs: hardly any - pidfiles mostly.
 	- local disk for swap, /tmp: disks are cheap and fast, why not?

such an approach is really nicely scalable and very pleasant to 
maintain.  a diskful cluster, by comparison, is often annoying:
disk failures actually matter, and it's not that hard for nodes 
to get out of sync.  systemimager does a good job of reimaging nodes,
but it's still not quite as "liberating" as just resetting a node,
knowing it's ephemeral...

> 2. Given my budget (about 20K), I plan on going with GigE on about 24
> nodes.  Am I right in thinking that faster network interconnects are
> just too expensive for this budget?

Greg's right: buy the right interconnect, not just the cheapest.

> 3. I'll be spending most of my cluster's time diagonalizing large
> matrices.  I plan on using ScaLAPACK eventually; currently I just use
> LAPACK/ATLAS and do individual matrices on each node.  The only thing

my experience with scalapack and diagonalization is with monster-sized 
sparse matrices, which seem to be fairly latency-sensitive.  if your 
workload is anything like that, gigabit isn't going to scale well,
at least with a conventional mpi+tcp stack.  (I'm looking forward to 
the OpenMX stack for this reason.)

> 	* Intel Core 2 Duo E6850 Conroe 3.0GHz ($280)
> 	* 8 GB (4 X 2 GB) DDR2 800 (~$200)

did you consider AMD?  "large matrices" makes me think of memory balance
(bandwidth per flop), where AMD normally leads Intel.

>  The motherboard does NOT have integrated video.  Will I need video
> output?  Can you even build a node without it?

this is a bios issue: will the board boot without a video card?
I guess you can try configuring it with the card, then remove the card
and see if it still boots.  I would make sure you can't get integrated
video - these days, such boards are often cheaper.

> motherboards with adequate support for 8GB memory and 1333 FSB don't
> have video.

I would also consider AMD, which has lots of integrated-video options.

> seems like a waste.  From reading around, it seems like there is no
> advantage really to DDR3 memory...is that right?  Any advice on the

power savings, probably some headroom in clock, but it's really at
the early-adopter stage, I think.

regards, mark hahn.


From davidbak at gmail.com  Fri Dec 21 22:12:10 2007
From: davidbak at gmail.com (David Bakin)
Date: Fri, 21 Dec 2007 22:12:10 -0800
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
Message-ID: <957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>

Speaking of nodes w/ disks vs. nodes without - I was thinking of equiping a
small cluster (microwulf style) with each node having a single USB
thumbdrive instead of a disk.  I thought it might be easier than trying to
get nodes to boot PXE style over the network.  And it seemed to me that
thumbdrives might be easier than disk-per-node to keep in sync: I'd just
unplug them from the nodes, plug them into to a USB hub on another computer
where I build my distribution, and copy files to them, then plug them back
into their nodes.  Also the USB drives would serve for any local filesystem
needs, e.g., for logging or whatever.  With a 1Gb key available for about
$12 it seemed a pretty easy and cheap and low power solution.  And no moving
parts means the "disks" won't die for mechanical reasons (and they won't be
written to enough to worry about flash-wear).

Does anyone have any thoughts on this?  Tried it?  Knows why it won't work?

Thanks!  -- David Bakin


On 12/21/07, Mark Hahn <hahn at mcmaster.ca> wrote:
>
> > 1. I'd like to go diskless.  I've never done this before (the other two
> > clusters are...diskful?).  I've used Fedora on both of the previous
> > clusters.  Is this a good choice for diskless?  Any advice on where to
> > start with diskless or operating system choices?
>
> I prefer diskless installs:
>        - NFS root: fast, can be RO, no significant server load.
>        - node-specific files on tmpfs: hardly any - pidfiles mostly.
>        - local disk for swap, /tmp: disks are cheap and fast, why not?
>
> such an approach is really nicely scalable and very pleasant to
> maintain.  a diskful cluster, by comparison, is often annoying:
> disk failures actually matter, and it's not that hard for nodes
> to get out of sync.  systemimager does a good job of reimaging nodes,
> but it's still not quite as "liberating" as just resetting a node,
> knowing it's ephemeral...
>
> > 2. Given my budget (about 20K), I plan on going with GigE on about 24
> > nodes.  Am I right in thinking that faster network interconnects are
> > just too expensive for this budget?
>
> Greg's right: buy the right interconnect, not just the cheapest.
>
> > 3. I'll be spending most of my cluster's time diagonalizing large
> > matrices.  I plan on using ScaLAPACK eventually; currently I just use
> > LAPACK/ATLAS and do individual matrices on each node.  The only thing
>
> my experience with scalapack and diagonalization is with monster-sized
> sparse matrices, which seem to be fairly latency-sensitive.  if your
> workload is anything like that, gigabit isn't going to scale well,
> at least with a conventional mpi+tcp stack.  (I'm looking forward to
> the OpenMX stack for this reason.)
>
> >       * Intel Core 2 Duo E6850 Conroe 3.0GHz ($280)
> >       * 8 GB (4 X 2 GB) DDR2 800 (~$200)
>
> did you consider AMD?  "large matrices" makes me think of memory balance
> (bandwidth per flop), where AMD normally leads Intel.
>
> >  The motherboard does NOT have integrated video.  Will I need video
> > output?  Can you even build a node without it?
>
> this is a bios issue: will the board boot without a video card?
> I guess you can try configuring it with the card, then remove the card
> and see if it still boots.  I would make sure you can't get integrated
> video - these days, such boards are often cheaper.
>
> > motherboards with adequate support for 8GB memory and 1333 FSB don't
> > have video.
>
> I would also consider AMD, which has lots of integrated-video options.
>
> > seems like a waste.  From reading around, it seems like there is no
> > advantage really to DDR3 memory...is that right?  Any advice on the
>
> power savings, probably some headroom in clock, but it's really at
> the early-adopter stage, I think.
>
> regards, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071221/d6b85466/attachment.html>

From laytonjb at charter.net  Sat Dec 22 04:41:02 2007
From: laytonjb at charter.net (Jeffrey B. Layton)
Date: Sat, 22 Dec 2007 07:41:02 -0500
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
References: <1198249574.6128.24.camel@Loki>	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
Message-ID: <476D05DE.7080809@charter.net>

David Bakin wrote:
> Speaking of nodes w/ disks vs. nodes without - I was thinking of 
> equiping a small cluster (microwulf style) with each node having a 
> single USB thumbdrive instead of a disk.  I thought it might be easier 
> than trying to get nodes to boot PXE style over the network.  And it 
> seemed to me that thumbdrives might be easier than disk-per-node to 
> keep in sync: I'd just unplug them from the nodes, plug them into to a 
> USB hub on another computer where I build my distribution, and copy 
> files to them, then plug them back into their nodes.  Also the USB 
> drives would serve for any local filesystem needs, e.g., for logging 
> or whatever.  With a 1Gb key available for about $12 it seemed a 
> pretty easy and cheap and low power solution.  And no moving parts 
> means the "disks" won't die for mechanical reasons (and they won't be 
> written to enough to worry about flash-wear).
>  
> Does anyone have any thoughts on this?  Tried it?  Knows why it won't 
> work?

It should work fine. Just be sure you're not putting file systems on the 
thumb drives
that do lots of IO (/var/log, swap partitions) because of rewrite issues 
(you can
easily send the log information to the master node and perhaps you could 
put a
very small swap file on the thumb drive if you think you need it). There 
is work
afoot in the Linux kernel to allow real swapping over the network. It's 
not quite
there yet (the last I looked), but it does look like Linus will allow it 
once it's
reaches some level of maturity.

You could also try it with SD cards or whatever flash media you like.

Jeff


From rgb at phy.duke.edu  Sat Dec 22 05:56:31 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sat, 22 Dec 2007 08:56:31 -0500 (EST)
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <476D05DE.7080809@charter.net>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
	<476D05DE.7080809@charter.net>
Message-ID: <Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>

On Sat, 22 Dec 2007, Jeffrey B. Layton wrote:

> very small swap file on the thumb drive if you think you need it). There is 
> work
> afoot in the Linux kernel to allow real swapping over the network. It's not 
> quite
> there yet (the last I looked), but it does look like Linus will allow it once 
> it's
> reaches some level of maturity.

I looked at this a LONG time ago (back when I was running diskless nodes
out of sheer necessity because the nodes we got as part of a giveaway
program had an unsupported SCSI controller).  There was a "move afoot"
then, too, but this was maybe 2000.  So don't hold your breath.

OTOH a) memory is dirt cheap.  Buy lots and then don't run jobs that
fill it.  In fact, since Linux uses a bunch to buffer/cache and smooth
performance, getting 2x what you need is only good.  Also b) I can't
remember what came of it my efforts -- I do recall trying to set up swap
to a swapfile mounted over NFS, and recall that there was an issue with
the systems calls required by swap deep in the kernel.  I KIND of
remember getting it working, though -- I don't remember if I hacked the
kernel or applied a patch or if somebody added a fix to the kernel about
then.

Swap over an NFS-mounted swapfile (as opposed to a remote mount swap
partition) still doesn't work?

     rgb

>
> You could also try it with SD cards or whatever flash media you like.
>
> Jeff
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From laytonjb at charter.net  Sat Dec 22 06:12:43 2007
From: laytonjb at charter.net (Jeffrey B. Layton)
Date: Sat, 22 Dec 2007 09:12:43 -0500
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
	<476D05DE.7080809@charter.net>
	<Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>
Message-ID: <476D1B5B.6030705@charter.net>

Robert G. Brown wrote:
> On Sat, 22 Dec 2007, Jeffrey B. Layton wrote:
>
>> very small swap file on the thumb drive if you think you need it). 
>> There is work
>> afoot in the Linux kernel to allow real swapping over the network. 
>> It's not quite
>> there yet (the last I looked), but it does look like Linus will allow 
>> it once it's
>> reaches some level of maturity.
>
> I looked at this a LONG time ago (back when I was running diskless nodes
> out of sheer necessity because the nodes we got as part of a giveaway
> program had an unsupported SCSI controller).  There was a "move afoot"
> then, too, but this was maybe 2000.  So don't hold your breath.

http://lwn.net/Articles/262379/

http://kerneltrap.org/Linux/Swap_Over_NFS

http://kerneltrap.org/Linux/Memory_Management_Improvements


From rgb at phy.duke.edu  Sat Dec 22 07:20:24 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sat, 22 Dec 2007 10:20:24 -0500 (EST)
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <476D1B5B.6030705@charter.net>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
	<476D05DE.7080809@charter.net>
	<Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>
	<476D1B5B.6030705@charter.net>
Message-ID: <Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>

On Sat, 22 Dec 2007, Jeffrey B. Layton wrote:

> Robert G. Brown wrote:
>> On Sat, 22 Dec 2007, Jeffrey B. Layton wrote:
>> 
>>> very small swap file on the thumb drive if you think you need it). There 
>>> is work
>>> afoot in the Linux kernel to allow real swapping over the network. It's 
>>> not quite
>>> there yet (the last I looked), but it does look like Linus will allow it 
>>> once it's
>>> reaches some level of maturity.
>> 
>> I looked at this a LONG time ago (back when I was running diskless nodes
>> out of sheer necessity because the nodes we got as part of a giveaway
>> program had an unsupported SCSI controller).  There was a "move afoot"
>> then, too, but this was maybe 2000.  So don't hold your breath.
>
> http://lwn.net/Articles/262379/
>
> http://kerneltrap.org/Linux/Swap_Over_NFS
>
> http://kerneltrap.org/Linux/Memory_Management_Improvements

Yeah, all that stuff.  Memory management, locking, page sizes, slow, and
if you pushed it it could lock up your system, but maybe adequate to
keep your system from locking up if you run barely over, rarely.

I personally hope they do it, although I do think that we're about to go
through yet another paradigm shift.  With flash coming down to around
$10/GB or less wholesale in sizes up to 16 GB, I think we'll start
seeing pure flash-boot systems appear any day now.  As in systems with
built-in 4 GB flash memory holding the basic OS installation, systems
with 8 GB flash memory holding the OS and several GB of userspace.  A
whole new kind of thin.

Built in would have certain advantages -- USB fobs are too easy to knock
off and are regrettably slow.  For clusters I'm not sure -- they are
quite slow compared to disk, and even slow compared to network disk.
Too slow?  For loading/reading they're not so bad, and Linux will get
the pages into memory if there is enough memory and work no better or
worse than network diskless.

The diffentiation then is management.  I'm not convinced that it will be
easier to install and manage a cluster with (say) 1 to 4 GB flash drives
used as boot compared to using e.g. warewulf to manage boot images.  Or
that it will be faster.  Or (really) cheaper -- $40 is still $40 more
than a diskless system, and $40 that would buy it more real memory that
is likely to ultimately be more valuable in terms of improved
performance and stability.

   rgb

>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From laytonjb at charter.net  Sat Dec 22 08:44:00 2007
From: laytonjb at charter.net (Jeffrey B. Layton)
Date: Sat, 22 Dec 2007 11:44:00 -0500
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
	<476D05DE.7080809@charter.net>
	<Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>
	<476D1B5B.6030705@charter.net>
	<Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>
Message-ID: <476D3ED0.8090005@charter.net>

Robert G. Brown wrote:
> On Sat, 22 Dec 2007, Jeffrey B. Layton wrote:
>> Robert G. Brown wrote:
>>> On Sat, 22 Dec 2007, Jeffrey B. Layton wrote:
>>>> very small swap file on the thumb drive if you think you need it). 
>>>> There is work
>>>> afoot in the Linux kernel to allow real swapping over the network. 
>>>> It's not quite
>>>> there yet (the last I looked), but it does look like Linus will 
>>>> allow it once it's
>>>> reaches some level of maturity.
>>>
>>> I looked at this a LONG time ago (back when I was running diskless 
>>> nodes
>>> out of sheer necessity because the nodes we got as part of a giveaway
>>> program had an unsupported SCSI controller).  There was a "move afoot"
>>> then, too, but this was maybe 2000.  So don't hold your breath.
>>
>> http://lwn.net/Articles/262379/
>>
>> http://kerneltrap.org/Linux/Swap_Over_NFS
>>
>> http://kerneltrap.org/Linux/Memory_Management_Improvements
>
> Yeah, all that stuff.  Memory management, locking, page sizes, slow, and
> if you pushed it it could lock up your system, but maybe adequate to
> keep your system from locking up if you run barely over, rarely.

I agree. I personally like the idea of stateless nodes (with or without
disks) or a large number of reasons. I know the memory footprint
of the apps I tend to run. But there are people who don't (I'm probably
in the minority actually) and there are times when you get the wrong
number of nodes in the job and it starts swapping. I can also think of
an application I've run that needed more memory on start up and shut
down than while running because the rank 0 did all of the IO for the
rest of the nodes.

The idea behind swapping over NFS is that in these cases when the
apps do swap, they don't die, they just run like molasses in winter.
Hopefully the user or the admin notice this and take action (At one
time I wrote a swap detector that would find apps that are swapping.
This could be added to something like Ganglia, Wulfware, or other
monitoring tool to detect swapping. But to be honest, I'm not sure what
I did with it - plus it may have been expensive in terms of CPU time).
You could just as easily swap to a hard drive, or flash or whatever,
but personally, I like the idea of not having any drives in my compute
nodes if I can help it.

Also, there are users in classified environments who would love to
get rid of as many disks as they can because of administrative as well
as security issues.
> The diffentiation then is management.  I'm not convinced that it will be
> easier to install and manage a cluster with (say) 1 to 4 GB flash drives
> used as boot compared to using e.g. warewulf to manage boot images.  Or
> that it will be faster.  Or (really) cheaper -- $40 is still $40 more
> than a diskless system, and $40 that would buy it more real memory that
> is likely to ultimately be more valuable in terms of improved
> performance and stability.

Good point. I agree :) Now, about teaching you Fortran90....

Jeff


From landman at scalableinformatics.com  Sat Dec 22 09:16:12 2007
From: landman at scalableinformatics.com (Joe Landman)
Date: Sat, 22 Dec 2007 12:16:12 -0500
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <476D3ED0.8090005@charter.net>
References: <1198249574.6128.24.camel@Loki>	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>	<476D05DE.7080809@charter.net>	<Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>	<476D1B5B.6030705@charter.net>	<Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>
	<476D3ED0.8090005@charter.net>
Message-ID: <476D465C.9050904@scalableinformatics.com>

Jeffrey B. Layton wrote:

> Also, there are users in classified environments who would love to
> get rid of as many disks as they can because of administrative as well
> as security issues.

We had proposed something like that about a year ago to a customer with 
those issues.  Running nodes pure diskless, and doing the air-gap bit on 
the head node (simple power on/off was not acceptable, there was an 
explicit "air-gap" requirement on the power cord and wall socket).

It seems that some other vendor has productized this as well.  Generally 
it is fairly simple to do (Tiburon has support for this, as does 
Perceus/WW, and others ... ).

>> The diffentiation then is management.  I'm not convinced that it will be
>> easier to install and manage a cluster with (say) 1 to 4 GB flash drives
>> used as boot compared to using e.g. warewulf to manage boot images.  Or
>> that it will be faster.  Or (really) cheaper -- $40 is still $40 more
>> than a diskless system, and $40 that would buy it more real memory that
>> is likely to ultimately be more valuable in terms of improved
>> performance and stability.

Keeping the nodes as clean, simple, cheap as possible is a good thing. 
Cluttering them up with lots of unneeded things doesn't help much.

One area which does have a cross-over point is IPMI.  After about 10-15 
nodes, it starts getting less expensive to get a switched PDU and a 
console server.  Gives you the same bios level access and power control. 
  You do lose a little going without IPMI ... it does give you more 
scriptable management flexibility ...  one of the nicer aspects is being 
  able to force stateful installs (the disk based ones) to pxe boot on 
next boot, from a script.

IPMI cards cost $60-$120 depending upon what you need them to do (kvm 
over ip or not).  Dell has DRAC cards, Sun's are integrated in, HPs and 
IBMs are extra, no idea on pricing.  At 15 of thise, you are looking at 
$1800, which more than pays for the console server and switchable PDU.

> Good point. I agree :) Now, about teaching you Fortran90....

F90?  We are at F200x (x==3 or 4) already ... Sheesh!


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615


From hahn at mcmaster.ca  Sat Dec 22 09:27:47 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sat, 22 Dec 2007 12:27:47 -0500 (EST)
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
References: <1198249574.6128.24.camel@Loki> 
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712221220550.20634@coffee.psychology.mcmaster.ca>

> small cluster (microwulf style) with each node having a single USB
> thumbdrive instead of a disk.  I thought it might be easier than trying to
> get nodes to boot PXE style over the network.  And it seemed to me that

PXE is dead simple these days, since you'll probably be using an integrated
nic, and it almost certainly has PXE support.  5 years ago, integrated nics
and PXE were less common but then again boot-from-USB was too...

> thumbdrives might be easier than disk-per-node to keep in sync: I'd just
> unplug them from the nodes, plug them into to a USB hub on another computer
> where I build my distribution, and copy files to them, then plug them back
> into their nodes.

even for fully diskful (heavyweight) installs, I'd use still PXE to 
give me a single point of management.  I suppose for a ~8-node cluster,
fiddling with USB sticks might be acceptable, but it's the kind of thing
that gives clusters a bad name (sorry!).  that is, it's easy to make 
clusters scale very sublinearly, so the effort to do a 200 node cluster
is only marginally more than a 100 node one.

> Also the USB drives would serve for any local filesystem
> needs, e.g., for logging or whatever.

I think people mistake how much this is used - you want your syslogs
to go to a central management server, for instance.  otherwise you'll
probably never look at them.

> $12 it seemed a pretty easy and cheap and low power solution.  And no moving
> parts means the "disks" won't die for mechanical reasons (and they won't be
> written to enough to worry about flash-wear).

the best thing about nfs-root, disk-for-swap+tmp is that you almost don't 
care whether the disk fails.  certainly such a node can still be used,
though it won't be quite as robust wrt high memory use.


From hahn at mcmaster.ca  Sat Dec 22 09:43:01 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sat, 22 Dec 2007 12:43:01 -0500 (EST)
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <476D05DE.7080809@charter.net>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
	<476D05DE.7080809@charter.net>
Message-ID: <Pine.LNX.4.64.0712221228120.20634@coffee.psychology.mcmaster.ca>

> You could also try it with SD cards or whatever flash media you like.

it's also worth realizing that disks are relatively reliable
if handled well, given some cool airflow, etc.  my organization
has been pleasantly surprised by the very low failure rate of 
sata disks in our clusters.  every compute node has 2x80G,
and we see a failure rate of maybe .5% annual (over 2-3 years 
of service).  in fact, we intended the 2 disks to be used in raid1,
but haven't bothered until recently - in retrospect, it probably
would have been better to get 1x160 or something.

using flash for node disks might also appeal for power savings,
but disks aren't actually that hot.  WD Caviar "green" disks are 
7.5W max, 4 idle, .3 standby.  if your IO is occasional enough to 
not wear out flash, I suspect you could safely auto-standby a disk.

-mark


From hahn at mcmaster.ca  Sat Dec 22 10:01:21 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sat, 22 Dec 2007 13:01:21 -0500 (EST)
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
	<476D05DE.7080809@charter.net>
	<Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>
	<476D1B5B.6030705@charter.net>
	<Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>
Message-ID: <Pine.LNX.4.64.0712221244180.20634@coffee.psychology.mcmaster.ca>

> if you pushed it it could lock up your system, but maybe adequate to
> keep your system from locking up if you run barely over, rarely.

we actually aim our cluster nodes to fail a job fast if the user code 
tries to use too much memory, rather than limp along at 50x slowdown.
I suppose that a bit harsh, but we also provide a wide range of 
GB/core configurations.

> with 8 GB flash memory holding the OS and several GB of userspace.  A
> whole new kind of thin.

certainly appealing for laptops and thin clients.  I think we're in a funny
stage wrt general desktops and up, though: if you want serious storage,
flash isn't even on the table.  but probably your serious storage should be 
over a fast network connection, rather than on the desktop or compute node.

> Built in would have certain advantages -- USB fobs are too easy to knock
> off and are regrettably slow.  For clusters I'm not sure -- they are

PATA/SATA-interface flash is accelerating, I think.  Intel just introduced
a building block for that, and other vendors have had somewhat obscure 
products out for a long time.  a flash-based "PATA-fob" seems reasonably
secure to me for this kind of minimal case.  2.5 and 3.5" form-factors for 
larger flash-based disks are also popular.


From rgb at phy.duke.edu  Sat Dec 22 10:34:53 2007
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sat, 22 Dec 2007 13:34:53 -0500 (EST)
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <476D3ED0.8090005@charter.net>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
	<476D05DE.7080809@charter.net>
	<Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>
	<476D1B5B.6030705@charter.net>
	<Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>
	<476D3ED0.8090005@charter.net>
Message-ID: <Pine.LNX.4.64.0712221333540.5940@lilith.rgb.private.net>

On Sat, 22 Dec 2007, Jeffrey B. Layton wrote:

>> The diffentiation then is management.  I'm not convinced that it will be
>> easier to install and manage a cluster with (say) 1 to 4 GB flash drives
>> used as boot compared to using e.g. warewulf to manage boot images.  Or
>> that it will be faster.  Or (really) cheaper -- $40 is still $40 more
>> than a diskless system, and $40 that would buy it more real memory that
>> is likely to ultimately be more valuable in terms of improved
>> performance and stability.
>
> Good point. I agree :) Now, about teaching you Fortran90....

AaaaaaAAAaaaaaahhhh... <runs screaming from room>

<fading in the distance>...the Ghost of Cluster Past...

    rgb

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977


From joelja at bogus.com  Sat Dec 22 14:17:00 2007
From: joelja at bogus.com (Joel Jaeggli)
Date: Sat, 22 Dec 2007 14:17:00 -0800
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <20071222011449.GB3940@bx9.net>
References: <1198249574.6128.24.camel@Loki> <476C565E.7020200@cse.ucdavis.edu>
	<20071222011449.GB3940@bx9.net>
Message-ID: <476D8CDC.3000906@bogus.com>

Greg Lindahl wrote:
> On Fri, Dec 21, 2007 at 04:12:14PM -0800, Bill Broadley wrote:
> 
>> I suspect cheap motherboards will not allow 8x2GB
>> (for the same memory per core with a quad core).
> 
> Having just bought some single-socket Intel nodes, I hear that there
> are currently no single-socket mobos which support > 8 gigs.

single socket intel chipsets don't support fb dimms (or registered for
that matter) so you don't get either huge capacity dimms or a large
number of sockets. not to say that one couldn't be built, but it isn't
(and fbdimms are needlessly expensive and power hungry for most desktop
applications).

> -- greg 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


From john.hearns at streamline-computing.com  Sun Dec 23 01:11:10 2007
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sun, 23 Dec 2007 09:11:10 +0000
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
	<476D05DE.7080809@charter.net>
	<Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>
	<476D1B5B.6030705@charter.net>
	<Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>
Message-ID: <1198401080.5734.8.camel@Vigor13>

On Sat, 2007-12-22 at 10:20 -0500, Robert G. Brown wrote:

> 
> I personally hope they do it, although I do think that we're about to go
> through yet another paradigm shift.  With flash coming down to around
> $10/GB or less wholesale in sizes up to 16 GB, I think we'll start
> seeing pure flash-boot systems appear any day now.  As in systems with
> built-in 4 GB flash memory holding the basic OS installation, systems
> with 8 GB flash memory holding the OS and several GB of userspace.  A
> whole new kind of thin.

http://www.asus.com/products.aspx?l1=24&l2=0&l3=0&l4=0&model=1907&modelmenu=1

A friend has the all black model. He brought it forth in the pub the
other night, to the universal admiration of assembled geeks. Shiny.
Nice. No wi-fi in the Jerusalem Tavern to test it with though:
http://www.stpetersbrewery.co.uk/london/default.htm
at 300 years old the electric light is a new-fangled invention.


Maybe a bit too late for the North Pole order tracking system to get it
into the sleigh loading bill for your neck of the woods though.
And have you been a good boy this year?


ps. if anyones interested, these are being rebadged by Research Machines
in the UK, which are a major supplier to the schools market.


From john.hearns at streamline-computing.com  Sun Dec 23 01:21:13 2007
From: john.hearns at streamline-computing.com (John Hearns)
Date: Sun, 23 Dec 2007 09:21:13 +0000
Subject: [Beowulf] Building a new cluster - seeking some advice
In-Reply-To: <Pine.LNX.4.64.0712221244180.20634@coffee.psychology.mcmaster.ca>
References: <1198249574.6128.24.camel@Loki>
	<Pine.LNX.4.64.0712212128510.18390@coffee.psychology.mcmaster.ca>
	<957c99090712212212v3717234fwfe78ec2b015f54df@mail.gmail.com>
	<476D05DE.7080809@charter.net>
	<Pine.LNX.4.64.0712220836340.5940@lilith.rgb.private.net>
	<476D1B5B.6030705@charter.net>
	<Pine.LNX.4.64.0712220945520.5940@lilith.rgb.private.net>
	<Pine.LNX.4.64.0712221244180.20634@coffee.psychology.mcmaster.ca>
Message-ID: <1198401683.5734.15.camel@Vigor13>

On Sat, 2007-12-22 at 13:01 -0500, Mark Hahn wrote:

> PATA/SATA-interface flash is accelerating, I think.  Intel just introduced
> a building block for that, and other vendors have had somewhat obscure 
> products out for a long time.  a flash-based "PATA-fob" seems reasonably
> secure to me for this kind of minimal case.  2.5 and 3.5" form-factors for 
> larger flash-based disks are also popular.

Talking about 'PATA fobs' a project I have in mind over Christmas is to
use a compact-flash to IDE adapter to boot Damn Small Linux 
http://damnsmalllinux.org/  on a mini-ITX board.
It has xmms built in.
I would like to build a wif-fi radio from scratch, as I have all the
bits. Won't be as slick as buying one off the shelf, but will be more
fun.


From jiteshbdundas at gmail.com  Sun Dec 23 08:14:21 2007
From: jiteshbdundas at gmail.com (jitesh dundas)
Date: Sun, 23 Dec 2007 21:44:21 +0530
Subject: [Beowulf] How to handle multi and seperate processors?
Message-ID: <a609fe70712230814x3dc203sb983084086a0673c@mail.gmail.com>

Dear All,

Can u tell me how to connect different processors of different types,
running in parallel and serial modes
individually?
Each of these machines handles its own tasks and also handles a module or
part of the larger group task.

DO u think we can deploy parallel computing or use Beowulf in this case.

Thanks,
Regards,
Jitesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071223/e6d5cc23/attachment.html>

From hahn at mcmaster.ca  Sun Dec 23 14:23:53 2007
From: hahn at mcmaster.ca (Mark Hahn)
Date: Sun, 23 Dec 2007 17:23:53 -0500 (EST)
Subject: [Beowulf] How to handle multi and seperate processors?
In-Reply-To: <a609fe70712230814x3dc203sb983084086a0673c@mail.gmail.com>
References: <a609fe70712230814x3dc203sb983084086a0673c@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0712231557250.13742@coffee.psychology.mcmaster.ca>

> Can u tell me how to connect different processors of different types,

it is unusual to connect processors of different types.  but it doesn't
really change anything if processors differ in type, model, speed.

> running in parallel and serial modes
> individually?

I'm not sure what you mean by "parallel mode" - processors are inherently
serial devices.  (processors often temporally switch between different 
threads of serial execution, and it's increasingly common to put multiple
processors on the same chip or in the same package.  even the still exotic
SMT type of processor is executing multiple serial threads, though with 
some (still temporally exclusive) sharing of processor resources.)

> Each of these machines handles its own tasks and also handles a module or
> part of the larger group task.

but that is the norm.  the closest to "parallel mode" would be when a 
processor decides to run a particular thread of execution which differs
from other threads by either a thread-id or MPI rank id.  it will normally
also be operating on at least partially different data from other threads.
the basic distinction in parallel programming is whether the threads 
assume a shared/common memory address space, or whether they only interact
by sending messages to each other.  you can implement one using the other,
but both hardware and aspects of the workload can make one or the other 
more appealing/efficient/etc.

> DO u think we can deploy parallel computing or use Beowulf in this case.

beowulf just means "a message-passing cluster composed of commodity parts,
(usually Linux + PCs + MPI)".  nothing that really implies that all the 
processors have to be identical, even the same architecture.  you could 
build a multi-architecture shared-memory system as well, but it would be 
a significant challenge (consider gp-gpu as a multi-arch shm machine...)

regards, mark hahn.


From mwill at penguincomputing.com  Mon Dec 24 12:15:55 2007
From: mwill at penguincomputing.com (Michael Will)
Date: Mon, 24 Dec 2007 12:15:55 -0800
Subject: [Beowulf] How to handle multi and seperate processors?
Message-ID: <433093DF7AD7444DA65EFAFE3987879C33DEDF@orca.penguincomputing.com>

Yes you can.

The more diverse you make it the more involved adminsitrating and configuring the job scheduler becomes.

What did you have in mind in terms of cpu types?

It is not uncommon to have a few fat nodes with lots of ram and cores for the smp-style apps and more cost effective nodes for applications that are using mpi for messagepassing between nodes as well as for serial jobs.

You might find the term 'embarassingly pallel' which means you have a serial program that you want to run many times with different input data - you can batch queue a thousand of them up and a cluster with say 32 dual cpu dual core cpus will be able to process 128 at a time.

The only catch to look out for is that you don't want to schedule an mpi job that runs across several machines some of which are slower and make the others wait. So you are best off to group same nodes together and allow scheduling/queueing up jobs to those groups instead of the whole cluster.

You also can have infiniband only on part of your cluster that way if you shy the expense for all machines.

Michael Will

Sent from my GoodLink synchronized handheld (www.good.com)


 -----Original Message-----
From: 	jitesh dundas [mailto:jiteshbdundas at gmail.com]
Sent:	Sunday, December 23, 2007 12:24 PM Pacific Standard Time
To:	Beowulf at beowulf.org
Subject:	[Beowulf] How to handle multi and seperate processors?

Dear All,

Can u tell me how to connect different processors of different types,
running in parallel and serial modes
individually?
Each of these machines handles its own tasks and also handles a module or
part of the larger group task.

DO u think we can deploy parallel computing or use Beowulf in this case.

Thanks,
Regards,
Jitesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071224/439479d5/attachment.html>

From kalpana0611 at gmail.com  Thu Dec 27 09:33:49 2007
From: kalpana0611 at gmail.com (Kalpana Kanthasamy)
Date: Fri, 28 Dec 2007 01:33:49 +0800
Subject: [Beowulf] Building a 2 node cluster using mpich
Message-ID: <b05971d10712270933q1364bacei762993e1d02acfc6@mail.gmail.com>

Hi guys, I am a beginner in linux and also for cluster, but I really
need to experiment this for my project. Anyway I have documented what
I have done so far, but I got stuck after a certain point... Let me
explain what I have done

After searching through the internet for a few days, I decided to use

http://blizzard.rwic.und.edu/~nordlie/deuce/
http://www.mcsr.olemiss.edu/bookshelf/articles/how_to_build_a_cluster.html

1.Installed a Linux distribution (I am using Open Suse on each
computer in both computers in the cluster).

2.During the installation process, assign hostnames and of course,
unique IP addresses for each node in your cluster, gateway is the
router. Hostname ? localhost, domain - localdomain


3.Cluster is private. I have used IP address 192.168.0.190 for the
master node and 192.168.0.191 for the slave node.

4.Finally, create identical user accounts on each node. In our case,
we create the user DevArticle on each node in our cluster. You can
either create the identical user accounts during installation, or you
can use the adduser command as root.


Configuration on all nodes

On all nodes
5.We now need to configure rsh on each node in our cluster. Create
.rhosts files in the user and root directories. Our .rhosts files for
the DevArticle users are as follows:
Master DevArticle
Slave DevArticle

The .rhosts files for root users are as follows:

Master root
Slave root


On all nodes
6.Next, I modified the etc/hosts.equiv file, the same thing both in
Master and Slave

192.168.0.190 Master.localhost.localdomain Master
127.0.0.1          localhost
192.168.0.191  Slave.localhost.localdomain Slave


7.Do not remove the 127.0.0.1 localhost line. The hosts.allow files on
each node was modified by adding ALL+ as the only line in the file.
This allows anyone on any node permission to connect to any other node
in our private cluster.


On all nodes
8.To allow root users to use rsh, I had to add the following lines to
the /etc/securetty file:

rsh
rlogin
rexec

pts/0
pts/1


On all nodes
9.Also, I modified the /etc/pam.d/rsh file:
#%PAM-1.0
# For root login to succeed here with pam_securetty, "rsh" must be
# listed in /etc/securetty.
auth       sufficient   /lib/security/pam_nologin.so
auth       optional     /lib/security/pam_securetty.so
auth       sufficient   /lib/security/pam_env.so
auth       sufficient   /lib/security/pam_rhosts_auth.so
account  sufficient   /lib/security/pam_stack.so service=system-auth
session   sufficient   /lib/security/pam_stack.so service=system-auth

On all nodes
Rsh, rlogin, Telnet and rexec are disabled by default. To change this,
I navigated to the /etc/xinetd.d directory and modified each of the
command files (rsh, rlogin, telnet and rexec), changing the disabled =
yes line to disabled = no.

Once the changes were made to each file (and saved), I closed the
editor and issued the following command:

Turn on the rsh daemon using the chkconfig command: chkconfig rsh on
1.To check the rsh daemon's status, run the chkconfig command:
chkconfig --list rsh
2.Run the /etc/rc.d/xinetd restart command.
3.Restart xinetd with /sbin/service xinetd restart


The Mounting Process

On the Master node
I edited the etc/exports

This is how my file is, I used the YAST ? NFS server tool.I then
double checked my etc/exports file, this is how it looks

/home		192.168.1.190/255.255.255.0(rw,no_root_squash)
/usr/local	192.168.1.190/255.255.255.0(rw,no_root_squash)


On the Slave node
I edited the etc/fstab

This is how my file is, I used the YAST ? NFS client tool.I then
double checked my etc/fstab file, this is how it looks
----------------------------------------------------------------------------------------------------------
/dev/disk/by-id/scsi-SATA_WDC_WD800VE-00H_WD-WXEZ06F66679-part6	/	ext3	acl,user_xattr
1 1
/dev/disk/by-id/scsi-SATA_WDC_WD800VE-00H_WD-WXEZ06F66679-part1	/windows/C	ntfs-3g	users,gid=users,fmask=133,dmask=022,locale=en_US.UTF-8
0 0
/dev/disk/by-id/scsi-SATA_WDC_WD800VE-00H_WD-WXEZ06F66679-part5	swap	swap	defaults
0 0
proc	/proc	proc	defaults 0 0
sysfs	/sys	sysfs	noauto 0 0
debugfs	/sys/kernel/debug	debugfs	noauto 0 0
usbfs	/proc/bus/usb	usbfs	noauto 0 0
devpts	/dev/pts	devpts	mode=0620,gid=5 0 0
/dev/fd0	/media/floppy	auto	noauto,user,sync 0 0
Master:/home	/home	nfs	rw 0 0
Master:/usr/local	/usr/local	nfs	ro 0 0


I also changed this etc/mtab file, according to the mpich documentation

----------------------------------------------------------------------------------------------------------
/dev/sda5 / ext3 rw,acl,user_xattr 0 0
proc /proc proc rw 0 0
sysfs /sys sysfs rw 0 0
debugfs /sys/kernel/debug debugfs rw 0 0
udev /dev tmpfs rw 0 0
devpts /dev/pts devpts rw,mode=0620,gid=5 0 0
/dev/sda1 /windows/C fuseblk
rw,noexec,nosuid,nodev,noatime,allow_other,default_permissions,blksize=4096
0 0
securityfs /sys/kernel/security securityfs rw 0 0
nfsd /proc/fs/nfsd nfsd rw 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0

Master:/home /rmt/Master/home nfs noac 0 0

Master:/usr/local /rmt/Master/usr/local nfs noac 0 0 0
-----------------------------------------------------------------------------------------------------------

After that I did this

On each node, type ifconfig and make sure that the machine has its
appropriate interior IP address. (Such as 192.168.0.X).
On each node, go to /etc/rc.d and type ./network stop.
On the master node, also type ../nfs stop
On the master node, type ../nfs start On each node, type ../network start.


I guess I mounted properly rite, cause I made sure I followed the
websites..I could access the files from the slave machines also


 I could ping both machines, and if I type
Master:/ # rsh Slave
Master:/ # ls -a
or
Slave:/ # rsh Master
Slave:/ # ls -a


works on both the machine, and then when I type ls -a, I get to see
the files, but its when I type a full command like this, it fails, and
permission denied appears. I emptied my host. allow and host. deny
files in both Master and Slave.


But when I type commands like
Master:/ # rsh Slave date
Master:/ # permission denied

or

Master:/ # rsh Master pwd
Master:/ # permission denied

Ok, here is where I am stuck, cause I tried installing mpich but
during both rsh and ssh were not detected during configuration,
permission denied, I think its something to with my NFS, any idea
guys....


From james.p.lux at jpl.nasa.gov  Thu Dec 27 11:31:17 2007
From: james.p.lux at jpl.nasa.gov (Jim Lux)
Date: Thu, 27 Dec 2007 11:31:17 -0800
Subject: [Beowulf] Building a 2 node cluster using mpich
In-Reply-To: <b05971d10712270933q1364bacei762993e1d02acfc6@mail.gmail.com>
References: <b05971d10712270933q1364bacei762993e1d02acfc6@mail.gmail.com>
Message-ID: <20071227113117.t32cnxwm808gow08@webmail.jpl.nasa.gov>

Quoting Kalpana Kanthasamy <kalpana0611 at gmail.com>, on Thu 27 Dec 2007  
09:33:49 AM PST:

>
> 3.Cluster is private. I have used IP address 192.168.0.190 for the
> master node and 192.168.0.191 for the slave node.

I prefer to avoid any .zero addresses... 192.168.1.190 would be my choice..
Just because a lot of consumer equipment uses the 192.168.1.x range by  
default (e.g. that firewall, wireless access point, etc.)


Although, looking over some notes, I see I use 10.0.0.x a lot, too.


Jim Lux


From reuti at staff.uni-marburg.de  Sun Dec 30 14:33:27 2007
From: reuti at staff.uni-marburg.de (Reuti)
Date: Sun, 30 Dec 2007 23:33:27 +0100
Subject: [Beowulf] Building a 2 node cluster using mpich
In-Reply-To: <b05971d10712270933q1364bacei762993e1d02acfc6@mail.gmail.com>
References: <b05971d10712270933q1364bacei762993e1d02acfc6@mail.gmail.com>
Message-ID: <2D1ECDD5-85D7-4A02-B49F-3BEE4D9CCB93@staff.uni-marburg.de>

Hi,

Am 27.12.2007 um 18:33 schrieb Kalpana Kanthasamy:

> Hi guys, I am a beginner in linux and also for cluster, but I really
> need to experiment this for my project. Anyway I have documented what
> I have done so far, but I got stuck after a certain point... Let me
> explain what I have done
>
> After searching through the internet for a few days, I decided to use
>
> http://blizzard.rwic.und.edu/~nordlie/deuce/
> http://www.mcsr.olemiss.edu/bookshelf/articles/ 
> how_to_build_a_cluster.html
>
> 1.Installed a Linux distribution (I am using Open Suse on each
> computer in both computers in the cluster).
>
> 2.During the installation process, assign hostnames and of course,
> unique IP addresses for each node in your cluster, gateway is the
> router. Hostname ? localhost, domain - localdomain
>
>
> 3.Cluster is private. I have used IP address 192.168.0.190 for the
> master node and 192.168.0.191 for the slave node.
>
> 4.Finally, create identical user accounts on each node. In our case,
> we create the user DevArticle on each node in our cluster. You can
> either create the identical user accounts during installation, or you
> can use the adduser command as root.

better use NIS (or LDAP). So you only have to define the users once.

>
>
> Configuration on all nodes
>
> On all nodes
> 5.We now need to configure rsh on each node in our cluster. Create
> .rhosts files in the user and root directories. Our .rhosts files for
> the DevArticle users are as follows:
> Master DevArticle
> Slave DevArticle
>
> The .rhosts files for root users are as follows:
>
> Master root
> Slave root
>
>
>
> On all nodes
> 6.Next, I modified the etc/hosts.equiv file, the same thing both in
> Master and Slave
>
> 192.168.0.190 Master.localhost.localdomain Master
> 127.0.0.1          localhost
> 192.168.0.191  Slave.localhost.localdomain Slave

There is only the hostname to put there, hence only two lines:

Master
Slave

>
> 7.Do not remove the 127.0.0.1 localhost line. The hosts.allow files on
> each node was modified by adding ALL+ as the only line in the file.
> This allows anyone on any node permission to connect to any other node
> in our private cluster.
>
>
>
> On all nodes
> 8.To allow root users to use rsh, I had to add the following lines to
> the /etc/securetty file:
>
> rsh
> rlogin
> rexec
>
> pts/0
> pts/1
>
>
> On all nodes
> 9.Also, I modified the /etc/pam.d/rsh file:
> #%PAM-1.0
> # For root login to succeed here with pam_securetty, "rsh" must be
> # listed in /etc/securetty.
> auth       sufficient   /lib/security/pam_nologin.so
> auth       optional     /lib/security/pam_securetty.so

You can try to comment-out the line above.

> auth       sufficient   /lib/security/pam_env.so
> auth       sufficient   /lib/security/pam_rhosts_auth.so
> account  sufficient   /lib/security/pam_stack.so service=system-auth
> session   sufficient   /lib/security/pam_stack.so service=system-auth
>
> On all nodes
> Rsh, rlogin, Telnet and rexec are disabled by default. To change this,
> I navigated to the /etc/xinetd.d directory and modified each of the
> command files (rsh, rlogin, telnet and rexec), changing the disabled =
> yes line to disabled = no.
>
> Once the changes were made to each file (and saved), I closed the
> editor and issued the following command:
>
> Turn on the rsh daemon using the chkconfig command: chkconfig rsh on
> 1.To check the rsh daemon's status, run the chkconfig command:
> chkconfig --list rsh
> 2.Run the /etc/rc.d/xinetd restart command.
> 3.Restart xinetd with /sbin/service xinetd restart
>
>
>
> The Mounting Process
>
> On the Master node
> I edited the etc/exports
>
> This is how my file is, I used the YAST ? NFS server tool.I then
> double checked my etc/exports file, this is how it looks
>
> /home		192.168.1.190/255.255.255.0(rw,no_root_squash)
> /usr/local	192.168.1.190/255.255.255.0(rw,no_root_squash)
>
>
> On the Slave node
> I edited the etc/fstab
>
> This is how my file is, I used the YAST ? NFS client tool.I then
> double checked my etc/fstab file, this is how it looks
> ---------------------------------------------------------------------- 
> ------------------------------------
> /dev/disk/by-id/scsi-SATA_WDC_WD800VE-00H_WD-WXEZ06F66679-part6	/	 
> ext3	acl,user_xattr
> 1 1
> /dev/disk/by-id/scsi-SATA_WDC_WD800VE-00H_WD-WXEZ06F66679-part1	/ 
> windows/C	ntfs-3g	 
> users,gid=users,fmask=133,dmask=022,locale=en_US.UTF-8
> 0 0
> /dev/disk/by-id/scsi-SATA_WDC_WD800VE-00H_WD-WXEZ06F66679-part5	 
> swap	swap	defaults
> 0 0
> proc	/proc	proc	defaults 0 0
> sysfs	/sys	sysfs	noauto 0 0
> debugfs	/sys/kernel/debug	debugfs	noauto 0 0
> usbfs	/proc/bus/usb	usbfs	noauto 0 0
> devpts	/dev/pts	devpts	mode=0620,gid=5 0 0
> /dev/fd0	/media/floppy	auto	noauto,user,sync 0 0
> Master:/home	/home	nfs	rw 0 0
> Master:/usr/local	/usr/local	nfs	ro 0 0
>
>
>
>
> I also changed this etc/mtab file, according to the mpich  
> documentation

I would never change the /etc/mtab by hand, as it's maintained by the  
kernel. Where is this stated in the mpich documentation to touch it?

> ---------------------------------------------------------------------- 
> ------------------------------------
> /dev/sda5 / ext3 rw,acl,user_xattr 0 0
> proc /proc proc rw 0 0
> sysfs /sys sysfs rw 0 0
> debugfs /sys/kernel/debug debugfs rw 0 0
> udev /dev tmpfs rw 0 0
> devpts /dev/pts devpts rw,mode=0620,gid=5 0 0
> /dev/sda1 /windows/C fuseblk
> rw,noexec,nosuid,nodev,noatime,allow_other,default_permissions,blksize 
> =4096
> 0 0
> securityfs /sys/kernel/security securityfs rw 0 0
> nfsd /proc/fs/nfsd nfsd rw 0 0
> rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
>
> Master:/home /rmt/Master/home nfs noac 0 0
>
> Master:/usr/local /rmt/Master/usr/local nfs noac 0 0 0
> ---------------------------------------------------------------------- 
> -------------------------------------
>
> After that I did this
>
> On each node, type ifconfig and make sure that the machine has its
> appropriate interior IP address. (Such as 192.168.0.X).
> On each node, go to /etc/rc.d and type ./network stop.
> On the master node, also type ../nfs stop
> On the master node, type ../nfs start On each node, type ../network  
> start.
>
>
> I guess I mounted properly rite, cause I made sure I followed the
> websites..I could access the files from the slave machines also
>
>
>  I could ping both machines, and if I type
> Master:/ # rsh Slave
> Master:/ # ls -a
> or
> Slave:/ # rsh Master
> Slave:/ # ls -a
>
>
> works on both the machine, and then when I type ls -a, I get to see
> the files, but its when I type a full command like this, it fails, and
> permission denied appears. I emptied my host. allow and host. deny
> files in both Master and Slave.
>
>
>
> But when I type commands like
> Master:/ # rsh Slave date
> Master:/ # permission denied
>
> or
>
> Master:/ # rsh Master pwd
> Master:/ # permission denied
>
> Ok, here is where I am stuck, cause I tried installing mpich but
> during both rsh and ssh were not detected during configuration,
> permission denied, I think its something to with my NFS, any idea.

There is no need to allow it for root at all. Is it working for a  
normal user? Then you can already run parallel programs.

-- Reuti