[Beowulf] cluster building advice?

Bill Broadley bill at cse.ucdavis.edu
Mon Sep 17 23:42:58 PDT 2012

On 09/16/2012 02:52 PM, Jeffrey Rossiter wrote:> The intention is for
the system to be
> used for scientific computation.

That doesn't narrow it down much.

> I am trying to decide on a linux
> distribution to use.

I suggest doing it yourself based on whatever popular linux distro you
have experience with.  Assuming general linux systems administrator
proficiency, it's not particularly hard.  I'd suggest starting with
Scientific linux (especially if your applications assume it) or
Debian/Ubuntu (which seem to have larger repositories).  I'd lean
towards Ubuntu if you are running new hardware since Sandy Bridge (new
intel) and Bulldozer (new AMD) seem to benefit from the latest kernels.

Then add:
* Cobbler for PXE installing (or functionally similar software), network
  configuration, dhcp, dns, mac address, IP addresses, etc.
* Puppet/Chef for configuration management (everything post-install)
* Torque/Slurm for batch queue
* Environmental modules or similar to help let users easily load the
  needed libraries/apps/environment they need in a reproducible way.
* Ganglia/cacti/munin for graphing resource utilization.
* /share/apps/<application name>-<version number> for anything you
  install that's not in the the repositories.

Get nodes to netboot, netinstall, and mount a shared /home.  Once users
start using it listen to their needs and adapt accordingly.

Some suggestions:
* If your campus has a standard username for each user, use it.
* Use ssh certs for user authentication, you really don't want your
  user's passwords, nor do they want to type it often.
* start a wiki for documentation, allow users to edit it.
* Have environmental modules output the name/version on module load,
  much easier to figure out what a user has done when you have the
  exact info to reproduce a run in the run's output.
* set hardware physically to always netboot, then depend on the
  central server to decide if it should be from local disk or a new
* Have compute nodes use host based ssh keys for auth (not user ssh
* Have head node use user based keys for login, do not allow
* Allow exactly one ssh key per user.
* Keep your configuration files in git or similar version control.  Or
  if managed by puppet/chef, keep puppet/chef files in version control.
* Strongly encourage any users writing source code to use a distributed
  version control system like git.
* Be very very clear on the status/lack of backups.  Be clear that loss
  of files will happen and it's only a matter of time.
* Use software RAID.

> Does it matter all that much?

Not particularly.  Random commercial software seems to assume RHEL based
distros.  Ubuntu/Debian seems to have the largest repositories (read
that as the most likely to have a user request handled by apt-get install).

> Any advice would be
> greatly appreciated.

You didn't mention your current experience, if the above sounds daunting
then start with warewulf.

More information about the Beowulf mailing list