[Beowulf] While the knives are out... Wulf Keepers

Tony Travis ajt at rri.sari.ac.uk
Mon Aug 21 10:43:11 PDT 2006

Mike Davis wrote:
> [...]
> For the most part, I think that if a cluster is run correctly, it is an 
> appliance for the scientists. Their job is to produce research, mine is 
> to manage clusters and smp machines.

Hello, Mike.

I think problems can occur when you enforce such a strict demarcation 
boundary between your role and the role of the scientists you support: 
If communication between you and the scientists breaks down and you do 
not understand what they want to do then you cannot support their work 
effectively. The bottom line for me is that it is the objective of the 
organisation as a whole to produce research, and your role within the 
organisation is to facilitate the work of the scientists who do it.

> A problem that sometimes crops up is that these days, everyone thinks 
> that they can manage a cluster (or large smp for that matter), because 
> they have a linux box or maybe a 4-1p nodes at their house. Sometimes 
> its a real issue getting these people to understand that managing a 
> machine for 1 person and managing it for 5,50,500 are entirely different.

I'm a scientist with a Linux box at my house: I also built and manage a 
small (64-1p node) openMosix Beowulf cluster for bioinformatics work at 
RRI and for the bioinformatics/mathematical work of our sister 
organisation BioSS. I don't think I'm exceptional in doing this, but I 
do think that having a Linux box at home has been very useful to me in 
gaining the experience I needed to manage our Beowulf cluster.

Not 'everyone' like me is as stupid or naive as you imply. I have the 
support of an excellent IT department and an electronics workshop who 
talk to me and understand very well what I want to do with the Beowulf. 
We have about 400 user accounts, which are registered and managed by IT 
centrally. I just enable NIS. The IT department also manage the central 
filers where precious data files are stored. I manage 3.2 TB of local 
RAID on the Beowulf. In my opinion this type of cooperation is a lot 
more effective than strict job demarcation...

> For example, on friday, one of our applications analysts wanted to 
> upgrade a piece of software on one of the clusters. He didn't know what 
> it would affect (libraries, other installed software, users already 
> using that software). After a bit of investigation it turned out that 
> the PI in question could use the version already installed (which is 
> about 6 months old).

Seems to me that it would be straight-forward to know this if you use a 
package management system like apt or rpm, which keeps track of what's 
installed and what the dependencies are. However, I also think that it's 
quite right that you should know more about this than him. In an ideal 
world, you should both make the decision about what to do on a rational 
basis. I doubt that he asked you to do it for no reason at all.

> I guess that I'm rather "old school" but upgrades have to be for a 
> reason other than there's a new version. Maybe they are needed for 
> features, or security, or stability. But IMO, they are seldom needed 
> because they are new.

Most of the problems I've come accross like this arise from a lack of 
communication. I believe it's quite important for you to know why he 
wanted to do the upgrade, and for you to inform him about any problems 
or conflicts of interest that would result from the upgrade. Presumably, 
that is exactly what you did. My only complaint here is the impression 
you give that scientists like me want to upgrade software just for the 
sake of doing it. Please ask yourself why did the upstream maintainers 
release a new version?  Was it just for the sake of upgrading it?

I keep our software up-to-date because I want to ensure that all known 
bugs fixes and security upgrades are applied. I don't do it just because 
they are new. I rely on the package repository maintainers to decide 
when software should be upgraded, but I also 'pin' critical packages 
that I know are required to be held at a particular revision locally for 
some reason. I do advocate upgrading unless there is a reason *not* to 
do it. You seem to recommend the opposite of not upgrading unless there 
*is* a reason to do it. I wonder which strategy results in less work?

Best wishes,

Dr. A.J.Travis,                     |  mailto:ajt at rri.sari.ac.uk
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687

More information about the Beowulf mailing list