[Beowulf] s_update() missing from AFAPI ?

Sat Oct 16 15:15:14 PDT 2004

Hello Andrew (and the Beowulf list as well),
You ask some very good questions, and you asked the
person who should know the answers.  Hopefully my
answers below make sense.

On Sat, 16 Oct 2004 15:01:34 -0400, Andrew Piskorski <atp at piskorski.com> wrote:
> The old 1997 paper by Dietz, Mattox, and
> Krishnamurthy, "The Aggregate Function API: It's
> Not Just For PAPERS Anymore", briefly mentions
> that their AFAPI library also supports, "fully
> coherent, polyatomic, replicated shared memory". 
> It even gives a little chart showing how many
> microseconds their s_update() function takes to
> update that shared memory.
> 
> That sounds interesting (even given the extremely
> low bandwith of the PAPERS hardware, etc.), but,
> no such function exists in the last 1999-12-22
> AFAPI release!  s_update() just isn't in there at
> all. Why?  Tim M., I know you follow the Beowulf
> list, so could you fill us in a bit on what what
> happened there?

The s_update() function went away because we changed
the underlying implementation of the "asyncronous"
s_ routines.  The new approach had a hardware limit
of 3 "fast" signals, and we deemed that it was best
to not hard code any of those for this rarely used
shared memory functionality. We had intended to
supply a routine that replaced the functionality of
s_update() that you could install as one of the 3
signal handlers if you chose to.  I'm not sure why
that code wasn't released.

But, over time, it became a moot point, since the
speed of the processors improved so much, that the
busy-wait/polling scheme we were using for the s_
routines made it very difficult to get any speedup
using the equivalent of the s_update() routine.

With the parallel port not actively causing an
interrupt, all the nodes had to poll for pending s_
operations.  Going from the 486 to the Pentium was a
dramatic change on the relative overheads for this
polling operation and general computations.

Basically, the Pentium and later processors were
slowed down so dramatically whenever you would do a
single IO space read (the polling function) to see
if any pending shared memory operations needed to be
dealt with, that it was difficult to get any
speedup, even with only two processors.  On the
testing codes I wrote at the time, it was hard to
find the right balance for how frequently to poll. 
If you polled too frequently, the Pentium was slowed
down to a crawl on purely local operations.  We
speculated that the IO instruction caused a flush of
the Pentium's pipeline, but we didn't explore it to
great detail.  Also, if you polled too infrequently,
the shared memory operations were stalled for long
periods of time, causing the other processor(s) to
sit idle waiting to get their shared memory writes
processed.

Yes, the performance numbers in the LCPC 1997 paper
are measured on a 4 node Pentium cluster, but I
don't think we had time yet to play with "real"
codes that used the s_update routine on a Pentium
cluster.  That was a long time ago, so I might not
be remembering this part very well.  But I do
remember that once we had more time to play with it
on Pentiums, it was clear that no performance
critical codes would be using the s_update routine,
much less any of the s_ routines as far as we could
tell. So, that is why the s_update routine was
pulled from the library, to free up the signaling
slot for potentially more useful things.

>   http://aggregate.org/TechPub/lcpc97.html
>   http://aggregate.org/AFAPI/AFAPI_19991222.tgz
> 
> While I'm at it I might as well ask this too: 
> That same old PAPERS papers says "UDPAPERS", using
> Ethernet and UDP, was implemented, but it doesn't
> seem to be in the AFAPI release either.  What
> happened with that?

The UDPAPERS code was being worked on by a colleague
of mine for his parallel file system work, and
unfortunately for the rest of us, he only
implemented the minimum amount of functionality that
he needed for his project, not the full AFAPI.  Back
in 1999 I had hoped to have time to finish it off
myself, but it wasn't my top priority, and if you
have followed our work, the KLAT2 cluster in the
spring of 2000 brought in some much more interesting
new ideas with the FNN stuff.

> Did it work?

Yes, to some degree, but there were still some
important corner cases (certain packet loss
scenarios) that hadn't been dealt with, and as I
said, the full AFAPI wasn't implemented, just a few
basic routines.

>  As well as the custom PAPERS hardware?

No, not as well as the custom hardware.  Speaking of
which: The custom PAPERS hardware has had some
additional work since we last published on it.  But
due to changing priorities, it has been sitting
waiting for the next bright student or two to revive
it for more modern IO ports (USB, Firewire, ???).
You can see the last parts list and board layouts
here: http://aggregate.org/AFN/000601/
Unfortunately, the assembly documentation for that
board was never written.  It's a "small change" from
the PAPERS 960801 board, but enough that if you
don't know what each thing is intended for, you
might not get it right.  That's why we haven't
posted a public link to the 000601 board design
(until now).  We almost made a 12 port version of
the PCB, but again, the student involved on that
finished their project, and the design hasn't been
validated, so it's not been sent out to a PCB fab to
be built.  As a group we decided it would be better
to find students interested in doing a new design
that used more modern IO ports than the parallel
printer port.  Know anyone interested in a Masters
project were they have to build hardware that
actually works? ;-) Academically, it's hard to make
such a thing be for a Ph.D. due to the fact that
it's mostly just "implementation/development" at
this point, with little "academic" research.

> If so, how?  Dirt cheap 10/100 cards and UTP cable
> would certainly be a lot more convenient than
> custom PAPERS hardware for anyone wanting to
> experiment with the AFAPI stuff, but I'm confused
> about what part of the ethernet network could be
> magically made to act as the NAND gate for the
> aggregate operations.

Yep, no NAND gate in the ethernet...

>  Did it need to use some particular programmable
>  ethernet switch?  Or the aggregate operations
>  were actually done on each of the nodes?

Yeah, the aggregate operations were actually
performed within each node on local copies of the
data from all the nodes. The basic idea was to have
each node send its new data along with all the known
data from anyone else for the current (and previous)
operation with a UDP broadcast/multicast.

Just this semester we finally have a new student
working on a UDP/Multicast implementation of
AFAPI... or something like it. They are just now
getting up to speed on things, so don't hold your
breath.  Also, it's unlikely we would actually
target a new AFAPI release.  With the dominance of
MPI, it would only make sense to build such a thing
for use as a module for LAM-MPI or the new OpenMPI.

I hope this answers your questions, but if not, feel
free to ask more.  I am busy with my own FNN
dissertation work now (plus Warewulf), so I won't be
working on AFN/AFAPI/PAPERS stuff to any degree
until my Ph.D. is finished.
-- 
Tim Mattox - tmattox at gmail.com
http://homepage.mac.com/tmattox/