[Beowulf] While the knives are out... Wulf Keepers

Mike Davis jmdavis1 at vcu.edu
Mon Aug 21 13:42:16 PDT 2006

In those famous words from "Cool Hand Luke," "What we have here is a 
failure to communicate." For my role in that failure I apologize.

Tony Travis wrote:

> I think problems can occur when you enforce such a strict demarcation 
> boundary between your role and the role of the scientists you support: 
> If communication between you and the scientists breaks down and you do 
> not understand what they want to do then you cannot support their work 
> effectively. The bottom line for me is that it is the objective of the 
> organisation as a whole to produce research, and your role within the 
> organisation is to facilitate the work of the scientists who do it.
Agreed. The point that I'm making is that managing a resource for one 
individual or group is different than managing it for multiple 
individuals or groups. It is my role within the organization to support 
the work of ALL of the scientists. That means that for instance that we 
use a batch system. It means that we have limits on the number of jobs 
anyone can submit. It does not mean that I don't listen to what other 
people need.

> I'm a scientist with a Linux box at my house: I also built and manage a 
> small (64-1p node) openMosix Beowulf cluster for bioinformatics work at 
> RRI and for the bioinformatics/mathematical work of our sister 
> organisation BioSS. I don't think I'm exceptional in doing this, but I 
> do think that having a Linux box at home has been very useful to me in 
> gaining the experience I needed to manage our Beowulf cluster.
> Not 'everyone' like me is as stupid or naive as you imply. I have the 
> support of an excellent IT department and an electronics workshop who 
> talk to me and understand very well what I want to do with the Beowulf. 
> We have about 400 user accounts, which are registered and managed by IT 
> centrally. I just enable NIS. The IT department also manage the central 
> filers where precious data files are stored. I manage 3.2 TB of local 
> RAID on the Beowulf. In my opinion this type of cooperation is a lot 
> more effective than strict job demarcation...
For the record, I implied no stupidity and no naivete. I don't manage a 
machine for an individual or even a department. I manage it for the 
institution. I work with individuals to meet their needs. Meeting these 
needs has helped us to grow our resources from 10's of processors to 
hundreds. It has provided researchers the resources to win millions of 
dollars in grants that we couldn't have competed for without the 
cooperation that we've built.

> Seems to me that it would be straight-forward to know this if you use a 
> package management system like apt or rpm, which keeps track of what's 
> installed and what the dependencies are. However, I also think that it's 
> quite right that you should know more about this than him. In an ideal 
> world, you should both make the decision about what to do on a rational 
> basis. I doubt that he asked you to do it for no reason at all.
I don't work with much scientific software that is available in rpm 
form. Some of it is binary. But most is compiled specifically for our 
machines using high performance compilers from absoft, intel, pathscale 
or the Portland Group. So apt and rpm don't solve the problem. After 
discussing the issue with the PI, we discovered that she didn't need the 
most recent version (as I think that I noted in my original reply). The 
version that was installed would work.

> Most of the problems I've come accross like this arise from a lack of 
> communication. I believe it's quite important for you to know why he 
> wanted to do the upgrade, and for you to inform him about any problems 
> or conflicts of interest that would result from the upgrade. Presumably, 
> that is exactly what you did. My only complaint here is the impression 
> you give that scientists like me want to upgrade software just for the 
> sake of doing it. Please ask yourself why did the upstream maintainers 
> release a new version?  Was it just for the sake of upgrading it?

The person who wanted the upgrade was not the PI it was a member of our 
staff. When I asked him to research the needed changes and talked with 
the PI, the upgrade was not necessary.
I do advocate upgrading unless there is a reason *not* to
> do it. You seem to recommend the opposite of not upgrading unless there 
> *is* a reason to do it. I wonder which strategy results in less work?

Finally, in general, the uptime of my clusters are measured in years. My 
originial cluster purchased from Paralogic ran with downtime of less 
than 3 hours in 5 years. I think that works out to 99.9999% uptime.

More information about the Beowulf mailing list