[Beowulf] non-stop computing

Christopher Samuel samuel at unimelb.edu.au
Tue Oct 25 21:07:05 PDT 2016

On 26/10/16 14:45, John Hanks wrote:

> I'd suggest making NFS mounts hard, so processes can recover from an NFS
> server reboot.

...plus set the NFS fsid for each export server side so they come back
reproducibly each time...

PS: I endorse what John said (now I've finished laughing), I'd suggest
making sure you've at least got ECC memory though and RAID as those are
the two parts that can go bad.  When we had clusters with disks in
compute nodes those were the most frequent failures, now we run diskless
nodes it's memory DIMMs. :-)

All the best,
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

More information about the Beowulf mailing list