diskless node + g98?

Thu Jan 23 14:28:04 PST 2003

I don't know g98, but could it be of any help considering NBD
(http://www.xss.co.at/linux/NBD/) or ENBD (http://www.it.uc3m.es/~ptb/nbd/)?
Also for some other uses in beowulf clusters?
Anyone good experiences with them?

Bye,
Gianluca

----- Original Message -----
From: "Ken Chase" <math at velocet.ca>
To: <beowulf at beowulf.org>
Sent: Thursday, January 23, 2003 9:17 PM
Subject: Re: diskless node + g98?

> On Thu, Jan 23, 2003 at 10:29:31AM -0800, Martin Siegert's all...
> > On Thu, Jan 23, 2003 at 11:43:29AM -0600, lmathew at okstate.edu wrote:
> > > Beowulf list readers:
> > >
> > > I have a Beowulf cluster (12 diskless nodes, 1 fileserver/master) with
> > > 26 processors (total) that is configured to run computational
simulations
> > > in both parallel and serial (pretty standard for this list).  I am
> > > interested in utilizing my cluster to run a series of serial g98
> > > calculations on each node.  These calcualtions (as many of you know)
> > > require a "scratch" space.  How can this scratch space be provided to
a
> > > diskless node?  Here are a few options that I have identified.
> >
> > I am running a 96 node (192 processor) cluster as a multi-purpose
> > research facility for a university. I have a lot of g98 jobs running
> > on that cluster. All of my nodes have /tmp on a local disk with 15GB of
> > scratch space.
>
> We started with no local disk on our clusters for G98, and it really
depends
> on a bunch of things (as always).
>
> At the time, they werent giving hardrives away in cereal boxes, so we
didnt
> put them in every node. Our per node cost was only 2x the cost of a drive
> at the time.
>
> So we could buy N nodes with N disks, or we could buy N*1.5 nodes and put
> everything on a big nfs server (granted, the NFS server was already
available
> at no cost to the cluster, which doenst make this a fair comparison).
>
> We would normally lose 10-20% of performance for the types of jobs we were
> running waiting for 'slow' (12 MB/s across 100Mbps) scratching to occur.
> (We found the jobs would want to scratch at that speed for about 5-15% of
> their runtime, so sharing 6 nodes per 100Mbps works fine for total
> throughput on 100Mbps networks).
>
> 2 things:
>
> 1) 1.5N * 0.9 = 1.35 diskless throughput vs 1N for putting drives in.
>
> 2) 1.5N * 0.9 + 1.5N * 0.1 = 1.5 N vs 1N
>
> (2) stems from the fact that while processes are in NFSIO state, we can
> actually run other jobs on that same node in the meantime, and recover the
CPU
> otherwise lost. (In our situation there are *ALWAYS* large low priority
jobs
> that have to run for a long time around for us to slam onto a node at idle
> priority level 20 (freebsd idle priority ROCKS), so we can do this.)
Perhaps
> all your jobs are at equal priority so you cant do this.  Or perhaps you
run
> linux so you must give up 5% of your CPU for nice 19'd jobs and thrash
your
> cache and get hammered on context switches. So this might not work for
some.
> Idle priority solves this when you have any job that can use 100% of the
cpu
> for LONG PERIODS OF TIME (anything over a second is a VERY LONG TIME on a
> cpu)).
>
> But, since hardrives are at worst 1/4 the cost of a node (regardless
> of what kinds of nodes you use) this all might require a reanalysis of
> the situation. (As always, tho, as RGB and I keep repeating, people keep
> leaving the $ signs out of their 'performance' calculations.)
>
> In fact, for a specific number of nodes on a cluster we built last year,
the
> upgrade was to put drives on half the nodes. (A superlinear moore's law on
> hardrive sizes/$ from the last year has helped alot to underline how
important
> *WAITING* to purchase and install parts of your cluster can be! $50USD for
> 40Gb drives?! haha!) This allows us to run frequency scan jobs on g98 much
> faster on those nodes. It really depends on what kinds of jobs you run
under
> G98 as well, and what models of theory you use. Frequency and scan jobs
are
> the worst for thrashing scratch - just buy a hardrive per node.
>
> But use it only as scratch - manage the cluster from a share NFS root for
> sure.
>
> > > 1).  Mount a LARGE ram drive?  (1GB in size if possible??)
> > Almost certainly not good enough: most of the g98 jobs that I see on
> > my cluster need more than 1GB of scratch space.
>
> ram drives dont work so well because they have fixed size. plus, ram
> is more expensive than disk. A better solution: swap-backed ramdisk.
>
> We swap over the network as required (g98 runs in wired core), and its
> extremely fast (in fact, I find it faster than whatever method g98 uses to
> write its scratch files).  And, you dont need to hammer the network til
you
> run out of ram. Perfect solution.
>
> However, linux's swap backed ramdisk stuff is far less mature than
FreeBSD's
> md device. We have had alot of success with it on fbsd >4.5
>
> 512 Megs of ram on these boards when the jobs really only want (and only
> seem to take, even when forced) 128 or 256 megs means small scratch files
> can be dealt with quick, and only when they get large do you go to the
network.
>
> > > 2).  Install hard disk drives in each of the slave nodes?
(unattractive)
> > By far the best solution.
>
> Yes, by far the FASTEST PERFORMANCE PER NODE. I have a box of extra $
signs
> for your calculations here if you'd like to use them. Then we can all do a
> FASTEST AGGREGATE THROUGHPUT PER DOLLAR calculation. (Doesnt anyone care
> about this? Why? Are we all building ASCI colour superclusters by using
> backhoes to dig into our gravel pit full of money?)
>
> > > 3).  Use a drive mounted via NFS/PVNFS?  (large amount of
communication)
> > Very bad. I first (because I did not know anything about g98) had g98
> > configured such that it would write its scratch files to the user's home
> > directory over NFS. This did not only drive the performance of the g98
> > towards 0, but what is worse it made life miserable for everybody on the
> > cluster (NFS timeouts, etc.).
>
> We found no NFS timeouts. We designed things properly with g98 scratching
in
> mind. The NFS server has 4 raid 0 striped drives that are specifically
setup
> to handle this scratch work. No problems at all, once everyone was warned
"we
> have Nx1.5 nodes *BECAUSE* we have no disk, so do not bitch that you only
see
> 90% cpu usasge for your jobs! submit another low priority job and soak it
up
> if you really care." -- worked well.
>
> Furthermore, since you are running jobs singly per node, total throughput
of
> all nodes is obviously very important to you (you have sequential jobs you
> mentioned but obviously you dont have just 1 node - you have parallel
streams
> of work to perform across your # of nodes). So in this case, if there's
any
> way you can split streams up into more than the 26 cpus you have and
> prioritize them differently, then you can soak up the extra cpu you have
> left over from not having disks.
>
> At least that was the philosophy behind it all when disks were alot more
> than they are now. Again, they're so cheap, it might not make sense
anymore.
> As I said, get out the $ signs and do the math. It makes less sense with
> faster and more expensive nodes. (The ratio of node cost per disk cost is
> a key part of the calculation).
>
> Another solution we're actually using on that cluster that now has the
drives
> is, since its been hard to hammer into people's heads to use SPECIFIC
nodes
> for SPECIFIC types of jobs (following the concept of a 'cluster tuned
> specifically for the jobs it runs), we've just mounted every odd node with
> disk onto every even diskless node. So you only have 1 nfs client per
node,
> and it gets full 100Mbps performance. Yes, it uses a bit of CPU on the
disk'ed
> node, but its worth it in the long run.
>
> THe big loss of hardrives is that we've probably instantly doubled the
> failure rate of any component on the cluster - more to go wrong now. :(
>
> > > Has anyone encountered this?  If so...what was the workaround that was
> > > implemented?  I am open to any suggestions and comments.   :)
>
> Can do 1 disk per n nodes if you want, but it makes sense to raid 1 them
to
> avoid downtime/bitching. Maintenance on a raid 1 can wait hours or days
> (weeks?) before being critical. And, with a few simple scripts that can
> chomp .log outputs from g98 and restart jobs where they left off, without
> paying the 3-10% performance hit for the 'checkpoint' feature in g98,
which
> hammers the disk even harder, downtime hardly matters anymore anyway as
> long as the node comes back up eventually - you dont lose all work to date
> on that node. (This works extremely well with my feelings about designing
> clusters that *CAN* have nodes fail without major impact, allowing one to
> use very cheap parts without any throughput loss.)
>
> > I am going to stick my head out here: configuring a multi-purpose
> > cluster with diskless nodes is a misconfiguration. Only if you know
> > that you'll never run a job with significant I/O on your cluster
> > you could consider going diskless. Otherwise: stay away from that.
>
> No, just get out your $ signs and do appropriate calculations. Just be
> a bloody hardnosed cold realist and calcuate the numbers. Look at total
> throughput per dollar from each design.
>
> I even did some GNUplots of disk usage vs CPU usage as well as network
> bandwidth used (MRTG) to win the cluster contract that had no local
scratch.
> Yes it was slower per node, but we had far more nodes to more than make up
for
> it. (And as I keep saying, the extra cpu that's idle can be reabsorbed.)
>
> > (you could install a high-performance file server on your cluster - we
> > actually have a Netapp NFS server - but for g98 your network becomes
> > the bottleneck. Furthermore, this is definitely more expensive than
> > installing local disks ...)
>
> It can become a bottleneck at really large numbers of jobs, yes, thats
quite
> true. The bottleneck is transactions on disk per second, not the raw disk
> bandwidth. I'd suggest a disk array per 20-30 CPUs for what we do, but its
> hard to compare what we do with what your doing.
>
> [ Besides, even if the disk is being hammered, until the disk usage
> reaches a plateau (such that its hammered equally all the time) *AND* the
> performance loss is not worth the extra nodes (regardless of soaking up
> CPU with other jobs that may (or not) hit disk), it may still be worth it.
> Again, it depends on your applications - if you have other non g98 jobs
> to run that hardly touch disk at all, you're laughing here - you'll always
> be able to soak up extra cpu caused by slow disk or network. ]
>
> G98 has many different job parameters and uses the disk in very different
> ways. It _REALLY DEPENDS_. Run your tests now on a few nodes and then plot
> your results vs dollars spent.
>
> /kc
>
> > Just my $0.02
> >
>
> >
>
> > Martin
> >
> > ========================================================================
> > Martin Siegert
> > Academic Computing Services                        phone: (604) 291-4691
> > Simon Fraser University                            fax:   (604) 291-4242
> > Burnaby, British Columbia                          email: siegert at sfu.ca
> > Canada  V5A 1S6
> > ========================================================================
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto,
CANADA
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf