[Beowulf] first cluster

Gus Correa gus at ldeo.columbia.edu
Thu Jul 15 20:01:03 PDT 2010

Hi Douglas

Douglas Guptill wrote:
> Hello Gus:
> On Mon, Jul 12, 2010 at 03:02:40PM -0400, Gus Correa wrote:
>> Hi Doug
>> Consider disk for:
>> A) swap space (say, if the user programs are large,
>> or you can't buy a lot of RAM, etc);
>> I wonder if swapping over NFS would be efficient for HPC.
>> Disk may be a simple and cost effective solution.
> We have bought enough RAM (6 GB /core) that will I hope prevent swapping.

Sure, of course swapping is a disaster for HPC, and for MPI.
Your memory configuration sounds great,
specially compared to my meager 2GB/core, the most we could afford.  :)

>> B) input/output data files that your application programs may require
>> (if they already work in stagein-stageout mode,
> Now there you have me.  What is stagein-stageout?

Old fashioned term for copying input files to the compute nodes before 
the program starts, then the output files back to the head node after 
the program ends.  You can still find this service in Torque/PBS,
maybe other resource managers, but it can also be done through scripts.

>> or if they do I/O so often that a NFS mounted file system
>> may get overwhelmed, hence reading/writing on local disk may be preferred).
> I am hoping to do that - write to local disk.  

Actually, we seldom do this here.
Most programs we run are ocean/atmosphere/climate, with other Earth 
Science applications also.
Since you are in oceanography (am I right?) I would guess you would
be running ocean models, and they tend to do a moderate amount of I/O,
or better, to have a moderate I/O-to-computation ratio.
Hence, they normally don't require local disk for I/O,
which can be done in a central NFS mounted directory.

We don't have a big cluster, so we use Infinband (in the past it was 
Myrinet) for MPI and Gigabit Ethernet for control and I/O.
We have a separate file server with a RAID array,
where the home directories and scratch file systems live,
and are NFS mounted on the nodes.
I think this setup is more or less standard for small clusters.

I mentioned local disk for scratch space because this was common
when Ethernet 100Mb/s was the interconnect, and would barely handle
MPI, so it was preferred to do I/O locally, and 'stagein/stageout' the 
On the other hand, as per several postings in this and other mailing 
lists, some computational chemistry and genome sequencing programs
apparently do I/O so often that they cannot live without local disk,
or a more expensive parallel file system.

> Each node has a 1 TB
> disk, which I would like to split between the OS and user space.  

We have much less, 80GB or 250GB disk on compute nodes,
which is more than enough for the OS and the scratch space
(seldom used).
Somebody mentioned that you also need the local disk for /tmp,
besides possible (not desirable) swap.
And of course you can have local /scratch, if you want.

> How
> to do that is still an unsolved problem at this point.  
 > The head node
> will have (6) 2 TB disks.

Have you considered a separate storage node, NAS, whatever, with RAID,
to put home directories, scratch space, and mount them on the nodes via NFS.
The head node can also play this role, hosting the storage.

Given your total investment, this may not be so expensive.
Since you have only a few users, you could even use the head node for 
this, to avoid extra cost.
Buy a decent Gigabit Ethernet switch (or switches), and connect this
storage to it via 10Gbit Ethernet card.  Most good switches have modules
for that.

>> C) Would diskless scaling be a real big advantage for
>> a small/medium size cluster, say up to ~200 nodes?
> Good question.  The node count is 16 (not 124, as I said previously -
> brain fart - 124 is the core count), 

OK, with 16 nodes you could certainly centralize home and scratch
directories in a single server (say the head node) with RAID (say, 
RAID6), for better performance, and mount them on the nodes via NFS,
even on a Gigabit Etherenet network.
(I would suggest having one network for control & I/O, another for MPI).

I would rather put smaller disks on the nodes, save the money to buy
a decent RAID controller, a head node chassis with hot-swappable
disk bays, enterprise class SATA disks of 2TB, and you would have
a central storage in the head node with, say 16-24TB (nominal),
with RAID6, xfs file system, for /home, /scratch[1,2,3...],
all NFS mounted on the nodes.
Easier to administer than separate home directories for each user
on the nodes, and probably not noticeably
slower (from the user standpoint) than the local disks.
I suppose this is a very common setup.

You could still create local /scratch on the compute nodes,
for those users that like to read/write on local disk,
and perhaps have a cleanup cron script to wipe off the
excess of old local /scratch files.

> and seems to me just over the
> border of what can be easily maintained as separate, diskful installs.
> Our one user has expressed a preference for "refreshing" the nodes
> before a job runs.  By that, he means re-install the operating system.

I reinstall when I detect a problem.
Rocks (which you already declined to use :) ) reinstalls on any
hard reboot or power failure, assuming that those can lead to 
inconsistencies across the compute nodes.
This is default, but you can change that.
I think that even this is too much.

However, reinstalling before every new job  starts
sounds like washing your hands before you strike any new key
on the keyboard. You can't write an email this way,
and you cant extract useful work from the cluster if you have to
reinstall the nodes so often.

Even rebooting the node before a job starts is already too much.
You can do it periodically, to refresh the system, but before every job,
I never heard of anybody that does this.

>> E) booting when the NFS root server is not reachable
>> Disks don't prevent one to keep a single image and distribute
>> it consistently across nodes, do they?
> I like that idea.

That has been working fine here and in many many places.

>> I guess there are old threads about this in the list archives.
> I looked in the beowulf archives, and only found very old (+years)
> articles.  Is there another archive I should be looking at?

In general, since many discussions in this list go astray,
the subject/title may have very little relation to the actual arguments
in the thread.
I am not criticizing this, I like it.
Some of the best discussions here started with a simple question
that was hijacked for a worthy cause,
and turned into a completely new dimension.

It is going to be hard to find anything searching the subject line.
You can try to search the message bodies with keywords like
"diskless", "ram disk", etc.
Google advanced search may help in this regard.

Unlikely that you will find much about diskless clusters
in the Rocks archive, as they are
diskfull clusters.
However, there may have been some discussions there too.

>> Just some thoughts.
> Much appreciated,
> Douglas.

Best of luck with your new cluster!

More information about the Beowulf mailing list