diskless node + g98?

Thu Jan 23 12:17:57 PST 2003

On Thu, Jan 23, 2003 at 10:29:31AM -0800, Martin Siegert's all...
> On Thu, Jan 23, 2003 at 11:43:29AM -0600, lmathew at okstate.edu wrote:
> > Beowulf list readers:
> > 
> > I have a Beowulf cluster (12 diskless nodes, 1 fileserver/master) with
> > 26 processors (total) that is configured to run computational simulations
> > in both parallel and serial (pretty standard for this list).  I am
> > interested in utilizing my cluster to run a series of serial g98
> > calculations on each node.  These calcualtions (as many of you know)
> > require a "scratch" space.  How can this scratch space be provided to a
> > diskless node?  Here are a few options that I have identified.
> 
> I am running a 96 node (192 processor) cluster as a multi-purpose
> research facility for a university. I have a lot of g98 jobs running
> on that cluster. All of my nodes have /tmp on a local disk with 15GB of
> scratch space.

We started with no local disk on our clusters for G98, and it really depends
on a bunch of things (as always).

At the time, they werent giving hardrives away in cereal boxes, so we didnt
put them in every node. Our per node cost was only 2x the cost of a drive
at the time.

So we could buy N nodes with N disks, or we could buy N*1.5 nodes and put
everything on a big nfs server (granted, the NFS server was already available
at no cost to the cluster, which doenst make this a fair comparison).

We would normally lose 10-20% of performance for the types of jobs we were
running waiting for 'slow' (12 MB/s across 100Mbps) scratching to occur.
(We found the jobs would want to scratch at that speed for about 5-15% of
their runtime, so sharing 6 nodes per 100Mbps works fine for total 
throughput on 100Mbps networks).

2 things:

1) 1.5N * 0.9 = 1.35 diskless throughput vs 1N for putting drives in.

2) 1.5N * 0.9 + 1.5N * 0.1 = 1.5 N vs 1N

(2) stems from the fact that while processes are in NFSIO state, we can
actually run other jobs on that same node in the meantime, and recover the CPU
otherwise lost. (In our situation there are *ALWAYS* large low priority jobs
that have to run for a long time around for us to slam onto a node at idle
priority level 20 (freebsd idle priority ROCKS), so we can do this.) Perhaps
all your jobs are at equal priority so you cant do this.  Or perhaps you run
linux so you must give up 5% of your CPU for nice 19'd jobs and thrash your
cache and get hammered on context switches. So this might not work for some.
Idle priority solves this when you have any job that can use 100% of the cpu
for LONG PERIODS OF TIME (anything over a second is a VERY LONG TIME on a
cpu)).

But, since hardrives are at worst 1/4 the cost of a node (regardless
of what kinds of nodes you use) this all might require a reanalysis of
the situation. (As always, tho, as RGB and I keep repeating, people keep
leaving the $ signs out of their 'performance' calculations.)

In fact, for a specific number of nodes on a cluster we built last year, the
upgrade was to put drives on half the nodes. (A superlinear moore's law on
hardrive sizes/$ from the last year has helped alot to underline how important
*WAITING* to purchase and install parts of your cluster can be! $50USD for
40Gb drives?! haha!) This allows us to run frequency scan jobs on g98 much
faster on those nodes. It really depends on what kinds of jobs you run under
G98 as well, and what models of theory you use. Frequency and scan jobs are
the worst for thrashing scratch - just buy a hardrive per node.

But use it only as scratch - manage the cluster from a share NFS root for
sure.

> > 1).  Mount a LARGE ram drive?  (1GB in size if possible??) 
> Almost certainly not good enough: most of the g98 jobs that I see on
> my cluster need more than 1GB of scratch space.

ram drives dont work so well because they have fixed size. plus, ram
is more expensive than disk. A better solution: swap-backed ramdisk.

We swap over the network as required (g98 runs in wired core), and its
extremely fast (in fact, I find it faster than whatever method g98 uses to
write its scratch files).  And, you dont need to hammer the network til you
run out of ram. Perfect solution.

However, linux's swap backed ramdisk stuff is far less mature than FreeBSD's
md device. We have had alot of success with it on fbsd >4.5

512 Megs of ram on these boards when the jobs really only want (and only
seem to take, even when forced) 128 or 256 megs means small scratch files
can be dealt with quick, and only when they get large do you go to the network.

> > 2).  Install hard disk drives in each of the slave nodes?  (unattractive)
> By far the best solution.

Yes, by far the FASTEST PERFORMANCE PER NODE. I have a box of extra $ signs
for your calculations here if you'd like to use them. Then we can all do a
FASTEST AGGREGATE THROUGHPUT PER DOLLAR calculation. (Doesnt anyone care
about this? Why? Are we all building ASCI colour superclusters by using
backhoes to dig into our gravel pit full of money?)

> > 3).  Use a drive mounted via NFS/PVNFS?  (large amount of communication)
> Very bad. I first (because I did not know anything about g98) had g98
> configured such that it would write its scratch files to the user's home
> directory over NFS. This did not only drive the performance of the g98
> towards 0, but what is worse it made life miserable for everybody on the
> cluster (NFS timeouts, etc.).

We found no NFS timeouts. We designed things properly with g98 scratching in
mind. The NFS server has 4 raid 0 striped drives that are specifically setup
to handle this scratch work. No problems at all, once everyone was warned "we
have Nx1.5 nodes *BECAUSE* we have no disk, so do not bitch that you only see
90% cpu usasge for your jobs! submit another low priority job and soak it up
if you really care." -- worked well.

Furthermore, since you are running jobs singly per node, total throughput of
all nodes is obviously very important to you (you have sequential jobs you
mentioned but obviously you dont have just 1 node - you have parallel streams
of work to perform across your # of nodes). So in this case, if there's any
way you can split streams up into more than the 26 cpus you have and
prioritize them differently, then you can soak up the extra cpu you have
left over from not having disks.

At least that was the philosophy behind it all when disks were alot more
than they are now. Again, they're so cheap, it might not make sense anymore.
As I said, get out the $ signs and do the math. It makes less sense with
faster and more expensive nodes. (The ratio of node cost per disk cost is
a key part of the calculation).

Another solution we're actually using on that cluster that now has the drives
is, since its been hard to hammer into people's heads to use SPECIFIC nodes
for SPECIFIC types of jobs (following the concept of a 'cluster tuned
specifically for the jobs it runs), we've just mounted every odd node with
disk onto every even diskless node. So you only have 1 nfs client per node,
and it gets full 100Mbps performance. Yes, it uses a bit of CPU on the disk'ed
node, but its worth it in the long run.

THe big loss of hardrives is that we've probably instantly doubled the
failure rate of any component on the cluster - more to go wrong now. :(

> > Has anyone encountered this?  If so...what was the workaround that was
> > implemented?  I am open to any suggestions and comments.   :)

Can do 1 disk per n nodes if you want, but it makes sense to raid 1 them to
avoid downtime/bitching. Maintenance on a raid 1 can wait hours or days
(weeks?) before being critical. And, with a few simple scripts that can
chomp .log outputs from g98 and restart jobs where they left off, without
paying the 3-10% performance hit for the 'checkpoint' feature in g98, which
hammers the disk even harder, downtime hardly matters anymore anyway as
long as the node comes back up eventually - you dont lose all work to date
on that node. (This works extremely well with my feelings about designing
clusters that *CAN* have nodes fail without major impact, allowing one to
use very cheap parts without any throughput loss.)

> I am going to stick my head out here: configuring a multi-purpose
> cluster with diskless nodes is a misconfiguration. Only if you know
> that you'll never run a job with significant I/O on your cluster
> you could consider going diskless. Otherwise: stay away from that.

No, just get out your $ signs and do appropriate calculations. Just be
a bloody hardnosed cold realist and calcuate the numbers. Look at total
throughput per dollar from each design.

I even did some GNUplots of disk usage vs CPU usage as well as network
bandwidth used (MRTG) to win the cluster contract that had no local scratch.
Yes it was slower per node, but we had far more nodes to more than make up for
it. (And as I keep saying, the extra cpu that's idle can be reabsorbed.)

> (you could install a high-performance file server on your cluster - we
> actually have a Netapp NFS server - but for g98 your network becomes
> the bottleneck. Furthermore, this is definitely more expensive than
> installing local disks ...)

It can become a bottleneck at really large numbers of jobs, yes, thats quite
true. The bottleneck is transactions on disk per second, not the raw disk
bandwidth. I'd suggest a disk array per 20-30 CPUs for what we do, but its
hard to compare what we do with what your doing. 

[ Besides, even if the disk is being hammered, until the disk usage
reaches a plateau (such that its hammered equally all the time) *AND* the
performance loss is not worth the extra nodes (regardless of soaking up
CPU with other jobs that may (or not) hit disk), it may still be worth it.
Again, it depends on your applications - if you have other non g98 jobs
to run that hardly touch disk at all, you're laughing here - you'll always
be able to soak up extra cpu caused by slow disk or network. ]

G98 has many different job parameters and uses the disk in very different
ways. It _REALLY DEPENDS_. Run your tests now on a few nodes and then plot
your results vs dollars spent.

/kc

> Just my $0.02
> 

> 

> Martin
> 
> ========================================================================
> Martin Siegert
> Academic Computing Services                        phone: (604) 291-4691
> Simon Fraser University                            fax:   (604) 291-4242
> Burnaby, British Columbia                          email: siegert at sfu.ca
> Canada  V5A 1S6
> ========================================================================
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA