[Beowulf] Fault tolerance & scaling up clusters (was Re: Bright Cluster Manager)

Christopher Samuel chris at csamuel.org
Thu May 17 05:18:17 PDT 2018

On 14/05/18 21:53, Michael Di Domenico wrote:

> Can you expand on "image stored on lustre" part?  I'm pretty sure i 
> understand the gist, but i'd like to know more.

I didn't set this part of the system up, but we have a local chroot
on the management nodes disk that we add/modify/remove things from
and then when we're happy we have a script that will sync that out
to the master copy on a Lustre filesystem.

The compute nodes boot a RHEL7 kernel with custom initrd, that
includes the necessary OPA and Lustre kernel modules & config
to get the networking working and access the Lustre filesystem,
the kernel then pivots its root filesystem from the initrd to
the master copy on Lustre via overlayfs2 to ensure the compute
node sees it as read/write but without the possibility of it
modifying the master (as the master is read-only in overlayfs2).

It's more complicated than that, but that's the gist..

Does that help?

All the best!
  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

More information about the Beowulf mailing list