diskless nodes? (was Re: Xbox clusters?)

Wed Dec 5 22:30:57 PST 2001

On Wed, Dec 05, 2001 at 10:44:19PM -0700, Art Edwards's all...
> On Wed, Dec 05, 2001 at 11:10:15PM -0500, Velocet wrote:
> > On Wed, Dec 05, 2001 at 03:40:01PM -0800, Brian LaMere's all...
> > > I spent less than $100,000 on a 48 node dual-1ghz p3 cluster with 2gb ram
> > > each, and hot-swappable 18gb 15krpm drives (all in a 1-U rackmounted
> > > footprint!).  Of course, I cheated at the time (VA linux firesale)...but I
> > > could do the same thing again for about $125,000 in today's prices with
> > > systems that someone else builds, and has a year warranty.
> > > 
> > > advancedclustering.com, for instance.  For $2537, you get dual p3- 1ghz cpu,
> > > 18gb 10krpm hotswap scsi, and 2 gb ram.  All in a 1-U package, all with a 1
> > > year warranty.  Extended warranties available.
> > 
> > Wondering why everyone gets local drives here - do you people have
> > computational software that needs to write scratch or swap faster than
> > 100Mbps (12.5MB/s) or even GBE (125MB/s theoretical, 35-50MB/s actual)? Isnt
> > that fast enough?
> > 
> > Why arent diskless netbooting clusters more popular?
> > 
> > /kc

> 
> After working on a diskless cluster I went screaming into the night.
>
> Granted, the software did not allow for networked swap. Still, scaling could
be > a real problem if every disk write requires a network connection. We have
found > that for very large systems, it is simply impossible for the system to
keep up > with I/O demands. This architecture makes every job seem to be
tightly coupled.  > There are many embarrasingly parallel jobs that would slow
down significanlty > in a diskless architecture.  > > Art Edwards

(nasty >80 char email)

every disk write does require a few PACKETS to be sent, but its connectionless
if you use UDP nfs. tcp nfs can be done, but there's more overhead. why
not use nfsv3 as it is? it works great for me, even mixing freebsd and
linux as either client or server.

impossible to keep up with IO demands? We have an extremely modest setup and
yet we get 58MB/s (megaBYTES per second) as per bonnie. with our raid 0 setup
on our LVD drives. For 30 nodes this rate of writing is fine. If our jobs in
G98 were writing that much scratch we might want to reconsider what kinds of
jobs were running. eahc node doesnt write to scratch all the time, in fact we
find it scratches 5-10% max of its computation time at 11-12MB/s write speed.
5-10% * 30 nodes = equiv of 1.5 to 3 nodes.  is 58MB/s enough for full speed
writing of 1.5 to 3 nodes?  yes.

Are you using raid5 for scratch files? THAT will be slow, yes.

if you wanted it to be faster, just add more networks to the server
(100mbps for all 30 nodes is what we use right now since they never go
above 50Mbps at any time, but we can get it down to 100Mbps/5 nodes
or even 100MBps/3 nodes - again they're not writing that often, so we dont
need full speed bandwidth available always). LVD 160 controllers are
PRETTY damn fast. Put it all on a PCI-X board 64bit/133MHz pci and you can
stick 3-4 controllers on there for throughput of a 250MB/s to 4
different arrays of disks.

And if THATS not enough (remember these are scratch files that dont need
to be written for other nodes to read), then just start using multiple
servers: n servers even at a modest 58MB/s = n x 58. With n = 8 
you have huge speed.

Distributing your load is what the clustering concept is all about, why
not distribute your disk accesses?

Hell even put 2 or 3 NICs in each node and get more network speed that way
- dont even need channel bonding! How to get RAID 5 access via 2-3 differnet
networks without channel bonding is left as an excercise to the reader.

Granted my application is special - we're not message passing through a file
system which seems insane to me. I spose there may be specific instances where
thats useful - 1 node writes 500MB of data and 2 other nodes only need
access to a specific 5Mb of that 500 each - that may happen, but again,
with enough seperate 100MBps networks, you get speed. Hell, if you need
to splurge, use GB gear. Its expensive but not impossibly so if you
really can justify it.

We have no problems with diskless nodes. Gaussian is a pig on scratching
(curse: it WRITES lots and reads little) but at 58MB/s for 30 nodes, and
average of 2MB/s (16Mbps) each s a maximum fits our needs.

I'd go insane configuring things if I had disk-full nodes. Diskless node
installations require me to run one cp -r command, sed 2 files with a new ip
address and mac address and power the thing up - presto, node n+1 is online.
I wouldnt do with the headaches of having to install stuff on n disks.
Been through that where n = 18, its a big pain. Not to mention that 5
of the nodes were down for servicing when I had to do the work. Now
I gotta schedule a context-switch to do the SAME wokr over again and
recall how I did it for those nodes when they finally come back. If
I had a node down on a diskless system they'd boot up with all the current
changes.

In fact, as jobs finish we reboot nodes just for the hell of it - clears
memory leaks and loads all current configs, kernels, etc. Its just too
sweet to avoid!

Does NO ONE use diskless clusters?

/kc
-- 
Ken Chase, math at velocet.ca * Velocet Communications Inc.  * Toronto, CANADA