[Beowulf] Please help to setup Beowulf

Chris Dagdigian dag at sonsorol.org
Wed Feb 18 03:42:54 PST 2009

Hi Mark,

On Feb 18, 2009, at 1:32 AM, Mark Hahn wrote:

>> searches. Without array task scheduling this would require 500,000  
>> individual job submissions. The fact that I never met a serious PBS  
>> shop that had not
> what's wrong with 500k job submissions?  to me, the existence of  
> "array jobs"
> is an admission that the job/queueing system is inefficient.  if  
> you're saying that the issue is not per-job overhead of submission,  
> but rather that jobs are too short, well, I think that's a user  
> problem.  I think it's entirely reasonable to require user jobs to  
> consume some minimum cpu time
> (say, few minutes).

Job length can sometimes be an issue but training users to make sure  
their jobs at least take a few minutes to complete is pretty easy.  
It's really an issue of having really large batch or serial workflows  
to get through. These are people who are using a cluster not because  
they are computer scientists or people interested in parallel coding  
methods - they are scientists trying to get a ton of work done in a  
reasonable amount of time and with minimal effort.

500K job submissions would put a non-trivial load on just about any  
scheduler, especially a few years ago. The act of actually submitting  
the 500K jobs can be a pain  from a usage perspective. Not to mention  
that the user/system now has 500K individual jobIDs to track. With an  
array task I get a single jobID that I can use to track status of all  
the sub tasks and I can kill the job with a single bkill/qdel command.  
It's also a single bsub/qsub submission command to get the ball rolling.

 From a user, usability and scheduler efficiency perspective, array  
jobs are a massive win for large sequential workflows, especially  
those that consist of running the same application over and over again  
with only minor differences in command line arguments or input files.

Array tasks may be distasteful from a technical or elegance  
perspective but they are a big usability and throughput win in the  
real world, especially for end users interested in productivity.

>> - Policy and resource allocation features are very important to  
>> people deploying these systems
> so I'm curious what that means.  things like "dept A needs to be  
> guaranteed
> N cpus, but dept B gets to use whatever is left over"?  or node  
> choice based on amount of free disk?  I don't really see why these  
> sorts of issues
> would be less important to more parallel environments.

Resource allocation policies and the tools to implement such are  
extremely important and are often a significant part of the selection  
criteria when trying to figure out what distributed resource manager  
to use. Way more important than anything involving parallel  
environments simply due to the fact that there are relatively few MPI- 
aware applications in our field.

FIFO scheduling or rewarding the dude who got to work earliest and  
submitted 500K jobs first is not the answer. People needed to be able  
to let scientific or business priorities drive and influence how  
cluster resources are allocated among competing users, projects and  
departments. For some people it may be as simple as carving up the  
cluster on a percentage basis among 4 departments and for others the  
key criteria  may be the ease of integration with an external flexLM  
license server.

The majority may just want simple fairshare-by-user scheduling  
behavior without having to drop in some external metascheduler or  
third party product.

The quality and capability of the knobs for adjusting these sorts of  
behavior is important in commercial environments and in places where  
the cluster has been sold as a shared resource for groups that may  
have competing needs for resources.

Platform LSF is excellent at this sort of thing and among the freely  
available offerings Grid Engine had good flexibility and capability  
out of the box without requiring additional plugin products. Just  
another reason why there was SGE uptake in our field over the years.  
Now, since SGE 6.1 with the addition of the resource quota framework  
SGE is quite powerful in this regard.

>> - Storage speed is often more important than network speed or  
>> latency in many cases
> which makes me wonder: do bio types consider using map-reduce-like
> frameworks?  that is, basically distributing the work to the data.

map-reduce gets added to the same bin as hardware based FPGA  
acceleration, GPU computing and other newish techniques. Modern  
algorithms and new efforts by people with real scientific software  
development and HPC skills are all looking at these techniques and  
you'll see slow uptake over time.

Real progress is being made, see Joe's efforts regarding HMMER running  
on GPUs these days etc.

This does not quite address the older legacy codes though. You have to  
remember that our core applications were written in the early 90s by  
biologists who had to teach themselves to code simply to get their  
science done. Few if any people had real skills in HPC software  
development or high efficiency coding.  These are the people (like  
myself) who started using Perl on large memory 64bit systems simply  
because perl was loose enough to let us do dumb things like read a  
full genome into a string and run regex operations on it.

If you approached a biologist and said "I re-wrote your blast  
application to use map-reduce!", most would turn around and ask you  
for the citation of your peer reviewed paper where you published and  
proved that your map-reduce version produces identical results and  
output (including reproducing known bugs) to the old inefficient code  
that it was meant to replace.

There is a huge resistance to improved/updated codes simply due to the  
fact that the scientists want to use the exact method cited in the  
paper that they are trying to reproduce. It's been a hassle to deal  
with but the block is real - just ask all of the FPGA hardware  
acceleration box makers out there (those that still exist).


More information about the Beowulf mailing list