[Beowulf] Introduction and question
Bill Broadley
bill at cse.ucdavis.edu
Thu Feb 28 00:41:57 PST 2019
Yes you belong! Welcome to the list.
There's many different ways to run a cluster. But my recommendations:
* Making the clusters as identical as possible.
* setup ansible roles for head node, NAS, and compute node
* avoid installing/fixing things with vi/apt-get/dpkg/yum/dnf, use ansible
whenever possible. Eventually you'll have to reinstall and it's painful
to manually apply months of changes.
* Use environment modules, never have users manually running "export
LD_LIBRARY_PATH=..."
* Use slurm partitions to keep significantly different hardware in different
pools so users have an easy time of knowing what to run where.
* Set ALL compute nodes to netboot, then configure cobbler to tell them to
boot from local disk normally. That way you don't have to manually power on,
wait for bios, select netboot 30 times to install 30 nodes.
* enable/configure IPMI at least for power on/off (if available). Write wrapper
scripts called pon and poff or similar.
* Keep working on getting cobbler+ansible can reinstall a compute node and it
will power off, enable netboot, power on, pxe install, reboot, run ansible,
enable automount, and run slurmd. Write a wrapper script for netboot-enable
and netboot disable, I used bon and boff.
The above isn't the only way to do it, but it's a reasonable starting point.
It's really nice for users to just be able to browse apps and say "module load
<app>. As a SysAdmin it's nice to be able to reinstall any wonky nodes and not
have to play the "what X things do I need to do before it can run jobs" game.
Good luck, have fun, and keep us posted.
More information about the Beowulf
mailing list