Network RAM for Beowulf

Sat Aug 25 16:04:27 PDT 2001

> > > > >    we as a group of four students are *also* thinking
> > > > > of implementing Network RAM for a beowulf cluster
> > > > > (assuming 100Mbps Ethernet ) whereby each node in the
> > > 
> > > I'm always puzzled why people want to keep trying this:
> > > have you considered the fairly breathtaking latency of 
> > > "sharing" pages over a net?  do you really have apps that can 
> > > tolerate that kind of latency?
> > 
> > What latency ?   ;)
> > 
> > I have 100 usec ping latencies on my network.  Bandwidth ~ 8-12 MB/sec

I don't think you have 100 us latency for any nontrivial-sized packet.
my net (FD 100bT, nothing special), shows >.4ms us latency for 1k pings.
so you'll clearly be into the multiple ms for a useful-sized page cluster,
say, 32K.

now, I assumed from the original context that someone wanted 
to do network shared memory.  this is mostly a nutty idea,
since the granularity is necessarily 4 or 8K, therefore latency
is nontrivial unless your code is somehow incredibly asynchronous.

network swapping is a somewhat different story, since there's 
no huge urgency in pushing pages out, and often not that much
to get them back.  of course, in the former case, you're 
short of ram, and it's not exactly nice to have to allocate 
more ram to accomplish the tx, and really you shouldn't free
the page until you get an ack from the server.  in the latter 
case, someone will simply sleep until their page arrives, which
is at least not positive feedback.

> > I have 8 ms seek latencies on my harddrives. Bandwidth ~ 12-16 MB/sec

that's pretty miserable bandwidth.  it's basically impossible to buy a modern
IDE disk, for instance, that sustains less than 20 MB/s on inner/slow tracks,
and most peak close to 40 MB/s.  not to mention how easy it is to 
stripe them, or just throw 15krpm scsi at the problem.

> I'm referring to network swap here.  Swapping over fast local networks
> can absolutely make sense.

networks/nics suck, relative even to ide controllers.  there's just
no getting around that.  it's routine to launch off a 128K scatter-gather,
busmaster command to a cheap UDMA controller - yes, read latency can be 
unpleasant, but the bandwidth is great, and the overhead is minimal.

you simply can't say that about networks, which is a shame.  the world
would be a far better place if I could at least send 8-9K jumbos,
preferably with a sane, zero-copy-friendly s-g interface.

yes, gigE helps, as do smarter nics, zcopy, jumbograms, even exotica
like STP.

but have you looked at how much CPU is eaten by a gigE card 
streaming, say, 50 MB/s, versus a cheap dual-channel IDE doing so?

in summary: swapping or shmem over a net is attractive at the surface,
but crunch the numbers and you find it's only a win in very special cases.
(that said, I'll go back to tuning up my 112-cpu SC40 that has ~200 MB/s
very smart interconnect ;)

regards, mark hahn.