[Beowulf] Heterogeneous, intermitent beowulf cluster administration

Thu Sep 26 06:00:28 PDT 2013

Hi folks,

I have access to a bunch (around 20) machines in our lab, each one with a
particular configuration, usually some combination of Core i5/i7 and
4GB/8GB/16GB RAM (the "heterogeneous" part), connected by a 24 ports Cisco
switch with reasonable backplane. They're end user machines, but with the
current lab occupation only a fraction of them are used constantly, but
which ones change every day. They are all running Debian stable. I got an
idea: why not use the downtime to run some parallel simulations, instead of
using the university cluster?

They main problems now are:

1) System administration: for now I'm doing the clusterssh way to
update/configure/install new software, but this can be very cumbersome, as
one of the machines can be being used and so I can't change its
configuration, so I have to keep track of which ones have changed. Maybe
puppet can help here?

2) Managing resources: knowing which machine is up and available withou
having to shout, and knowing the available configuration to allocate jobs
that can fit in that particular machine, etc. There are extreme cases when
the machine needs to be rebooted to run some Windows program.

3) Migrating jobs (the intermitent part): any machine can be requested by a
user at any time, so if I have a parallel job running I would have to
migrate the job to another machine, preferably without stopping the other
jobs. We are running mostly ROMS over MPI and some in-house simulations
that use a combination of OpenMP and MPI.

Does anyone have any experience or pointers on how to address these issues?
It seems a waste not to use those idle machines...

Ivan Marinhttp://scholar.google.com.br/citations?user=faM0PCYAAAAJ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20130926/79d8fd0b/attachment.html>