[Beowulf] Heterogeneous, intermitent beowulf cluster administration

Thu Sep 26 06:25:42 PDT 2013

Hi, Ivan.

I'm a nay-sayer in this kind of scenario.  I believe your staff time,
and the time of your lab users, is too valuable to spend on
dual-classing desktop lab machines.

If your lab is underutilized, I would spend staff time on figuring out
why and how to make the lab more effective as a destination for
prospective users.  If you need more cluster compute time, I would
invest funds into additional compute nodes, not into micromanaging the
lab machines.  Skilled sysadmin time is valuable.

Let's also consider the cost of electricity and cooling.  I doubt that
the lab machines and climate control are the most efficient in terms of
full-throttle HPC/HTC computing.  Electricity and cooling should be at
the top of your list for cost effective and green computing.  I would
instead have the lab machines suspend/sleep until they require automated
patching or desktop login.

Also, let's consider the user experience.  Cluster users will see jobs
killed and restarted; they will not be happy.  Lab users will see slow
and/or hung machines; they will stop coming to the lab.

Don't get me wrong, this is an interesting project, but one riddled with
pitfalls.  If the job is to support a computing lab, that should be goal
number one.

Cheers.

On Thu, Sep 26, 2013 at 10:00:28AM -0300, Ivan M wrote:
> Hi folks,
> 
> I have access to a bunch (around 20) machines in our lab, each one with a
> particular configuration, usually some combination of Core i5/i7 and
> 4GB/8GB/16GB RAM (the "heterogeneous" part), connected by a 24 ports Cisco
> switch with reasonable backplane. They're end user machines, but with the
> current lab occupation only a fraction of them are used constantly, but
> which ones change every day. They are all running Debian stable. I got an
> idea: why not use the downtime to run some parallel simulations, instead of
> using the university cluster?
> 
> They main problems now are:
> 
> 1) System administration: for now I'm doing the clusterssh way to
> update/configure/install new software, but this can be very cumbersome, as
> one of the machines can be being used and so I can't change its
> configuration, so I have to keep track of which ones have changed. Maybe
> puppet can help here?
> 
> 2) Managing resources: knowing which machine is up and available withou
> having to shout, and knowing the available configuration to allocate jobs
> that can fit in that particular machine, etc. There are extreme cases when
> the machine needs to be rebooted to run some Windows program.
> 
> 3) Migrating jobs (the intermitent part): any machine can be requested by a
> user at any time, so if I have a parallel job running I would have to
> migrate the job to another machine, preferably without stopping the other
> jobs. We are running mostly ROMS over MPI and some in-house simulations
> that use a combination of OpenMP and MPI.
> 
> Does anyone have any experience or pointers on how to address these issues?
> It seems a waste not to use those idle machines...
> 
> Ivan Marinhttp://scholar.google.com.br/citations?user=faM0PCYAAAAJ

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gavin W. Burris
Senior IT Project Leader
Research Computing
Wharton Computing
University of Pennsylvania