[Beowulf] Please help to setup Beowulf

Joe Landman landman at scalableinformatics.com
Wed Feb 18 05:32:30 PST 2009

I didn't respond yesterday ... been ridiculously busy ... really :(

Chris Dagdigian wrote:

briefly on the "FUD":  As one of the users/admin types who watched jobs 
go into an unkillable (without power-cycle) state, I can assure you that 
the problems with PBS (at the time) were legendary and not FUD.  More in 
a moment, but Chris's other comments are spot on.  Every lab had their 
own tweaked version.  Heck, we at the time (I was an SGI-er then) had 
our own drop of code and I/we had a few patches we used.

The blackhole was as a result of the design of the PBS moms, which 
opened a socket for each and every job.  Which, if you have 1k jobs 
waiting in queue, was a problem.  At 4k jobs you had to rebuild the 
Linux kernel.  Again, not fun in that time period.  This was a design 
flaw in PBS that persisted until at least recent times.

CT-BLAST, the code I wrote at SGI (which became SGI-GenomeCluster for 
the short while as a product, and parts of it either became or 
influenced HTC-BLAST) to turn BLAST into a cluster-ized app took input 
query sets, broke them up, and launched jobs.  Many many jobs.  Tens to 
hundreds of thousands of jobs in some cases.

At that time, the only scheduler we tried which worked with this, was 
LSF.  Everything else fell over and started convulsing.  The system was 
unresponsive during job submission.  Took about 2 minutes to light off. 
  After that, the scheduler node was sluggish, but still usable.

> 500K job submissions would put a non-trivial load on just about any 
> scheduler, especially a few years ago. The act of actually submitting 
> the 500K jobs can be a pain  from a usage perspective. Not to mention 

CT-BLAST hid the pain from users, but your point is important.  Without 
array jobs, you need a real automated way to submit, and delete jobs 
from the queue.  Deleting a 10 job run?  Pretty easy.  Deleting a 100k 
job run?  Not so much easy.

The PBS users I spoke with when we had CT-BLAST up and going told me I 
was abusing the scheduler.  Odd, I thought.  With 100 machines, I ought 
to have a scheduler that can handle a throughput workflow.

> that the user/system now has 500K individual jobIDs to track. With an 
> array task I get a single jobID that I can use to track status of all 
> the sub tasks and I can kill the job with a single bkill/qdel command. 
> It's also a single bsub/qsub submission command to get the ball rolling.

Array jobs are somewhat of a misnomer, they are really one master job 
with many scheduled sub-jobs which are handled at the scheduler level. 
Which is where we need to schedule them.

>  From a user, usability and scheduler efficiency perspective, array jobs 
> are a massive win for large sequential workflows, especially those that 

and large parallel/throughput workflows, where we exploit embarrassingly 
parallel aspects of computing with few/no per job dependencies.

> consist of running the same application over and over again with only 
> minor differences in command line arguments or input files.
> Array tasks may be distasteful from a technical or elegance perspective 
> but they are a big usability and throughput win in the real world, 
> especially for end users interested in productivity.


>>> - Policy and resource allocation features are very important to 
>>> people deploying these systems
>> so I'm curious what that means.  things like "dept A needs to be 
>> guaranteed
>> N cpus, but dept B gets to use whatever is left over"?  or node choice 
>> based on amount of free disk?  I don't really see why these sorts of 
>> issues
>> would be less important to more parallel environments.
> Resource allocation policies and the tools to implement such are 
> extremely important and are often a significant part of the selection 
> criteria when trying to figure out what distributed resource manager to 
> use. Way more important than anything involving parallel environments 
> simply due to the fact that there are relatively few MPI-aware 
> applications in our field.
> FIFO scheduling or rewarding the dude who got to work earliest and 
> submitted 500K jobs first is not the answer. People needed to be able to 
> let scientific or business priorities drive and influence how cluster 
> resources are allocated among competing users, projects and departments. 
> For some people it may be as simple as carving up the cluster on a 
> percentage basis among 4 departments and for others the key criteria  
> may be the ease of integration with an external flexLM license server.
> The majority may just want simple fairshare-by-user scheduling behavior 
> without having to drop in some external metascheduler or third party 
> product.

Heh ... I make the case for being successful in implementing policy when 
every user is equally pissed off that they don't have more of the 
machine :)  We had one group where one user was getting 90+ % of the 
machine for months running high throughput calculations, then along came 
another user who run long running parallel (single node) jobs. 
Balancing this was hard.  Getting pre-emption going (at process level) 
would have been great, but they didn't want to spend more money.  And 
the pre-emption capability costs nearly an additional node, so the 
likelihood of them spending that money on software to manage capacity 
versus hardware to increase capacity was about nil.

> The quality and capability of the knobs for adjusting these sorts of 
> behavior is important in commercial environments and in places where the 
> cluster has been sold as a shared resource for groups that may have 
> competing needs for resources.
> Platform LSF is excellent at this sort of thing and among the freely 
> available offerings Grid Engine had good flexibility and capability out 
> of the box without requiring additional plugin products. Just another 
> reason why there was SGE uptake in our field over the years. Now, since 
> SGE 6.1 with the addition of the resource quota framework SGE is quite 
> powerful in this regard.

I agree the features are getting better.  What I am finding driving many 
of the wins here is the "free" as in "zero acquisition cost" aspect. 
People don't mind paying reasonable support costs.  They just want to 
avoid the acquisition costs if possible.

>>> - Storage speed is often more important than network speed or latency 
>>> in many cases
>> which makes me wonder: do bio types consider using map-reduce-like
>> frameworks?  that is, basically distributing the work to the data.

As data sets get larger, you are going to *have* to do this.  Data 
motion gets very hard on large machines.  We wrote some tools a while 
back to help with this, but at the end of the day, moving 1+TB from 100 
machines around your network is not going to make anyone very happy.

(not a commercial) JackRabbit is in part designed to help solve this 
problem ... or at least move it back until a better solution emerges. 
Start out with huge pipes to disk and huge pipes to the rest of the 
cluster.  Now data motion takes a while, but since we can pump out > 1 
GB/s, 1TB->1000s, and lots of data motion is not that hard (for a while, 
until we start moving 10TBs around per machine)

> map-reduce gets added to the same bin as hardware based FPGA 
> acceleration, GPU computing and other newish techniques. Modern 
> algorithms and new efforts by people with real scientific software 
> development and HPC skills are all looking at these techniques and 
> you'll see slow uptake over time.

We are seeing (outside of Bio-IT) much more interest in all of these. 
In Bio-IT, we are seeing rapid uptake on the GPU side.  Less so on the 
FPGA side, but this is due to the difficulties in programming and costs 
associated with this direction.  We are seeing fast uptake on the 
cluster-ized apps side ... people have realized that fast single 
machines may not be able to crunch all their data.  This is also 
curiously what is driving the GPU side, as they want a fast single 
machine to crunch their data.

> Real progress is being made, see Joe's efforts regarding HMMER running 
> on GPUs these days etc.
> This does not quite address the older legacy codes though. You have to 
> remember that our core applications were written in the early 90s by 
> biologists who had to teach themselves to code simply to get their 
> science done. Few if any people had real skills in HPC software 
> development or high efficiency coding.  These are the people (like 
> myself) who started using Perl on large memory 64bit systems simply 
> because perl was loose enough to let us do dumb things like read a full 
> genome into a string and run regex operations on it.
> If you approached a biologist and said "I re-wrote your blast 
> application to use map-reduce!", most would turn around and ask you for 
> the citation of your peer reviewed paper where you published and proved 
> that your map-reduce version produces identical results and output 
> (including reproducing known bugs) to the old inefficient code that it 
> was meant to replace.

Hmmm.... true 10 years ago (which was part of the reason I wrote 
CT-BLAST, the md5 hashes on our output versus NCBI BLAST output were 
identical, modulo hash collisions).  End users don't mind drop in 
replacements as long as it is the same basic algorithm being used.

The problem is that some of the past acceleration schemes relied upon a 
pre-filtration or other step which couldn't reproduce the results 
exactly.  This didn't burn the researchers, but did make them skeptical. 
  Pure biologists want the same algorithm, the hardware differences 
largely don't matter (as long as it is better cheaper faster).

So mpiHMMer and GPU-HMMer both are the identical algorithm, just 
modified to run on an mpi and GPU computing substrate respectively.

> There is a huge resistance to improved/updated codes simply due to the 
> fact that the scientists want to use the exact method cited in the paper 
> that they are trying to reproduce. It's been a hassle to deal with but 
> the block is real - just ask all of the FPGA hardware acceleration box 
> makers out there (those that still exist).

Hmmm... Every customer we spoke over the last 5 years to said "faster is 
better, but we don't want a single purpose box, or one we can't 
reprogram".  They all expressed interest in a programmable accelerator 
platform that they could write code for in C/C++/... .  We are not 
seeing resistance to GPUs using Cuda (which is C), but we have seen 
resistance to using non-open languages/platforms.  End users want their 
colleagues to be able to reproduce the results, which means they want 
the right to redistribute their software.  Which they can do with Cuda. 
  Not so much with other techniques.

The big issues with FPGAs are a) pain of programming them (non-trivial) 
b) bit files are *not* portable (I cant get an FPGA board from Nallatech 
and run the bitfile from another FPGA board vendor on it).  This is one 
of the two things that killed FPGAs.  c) cost of boards/tools versus 
expected speedup was the other thing that effectively killed FPGAs.  As 
late as 2 years ago, we were being pitched $70k+ "compiler" tools, with 
$15k boards.  Best speedups we could hope for were 20-30x on various 
apps.  Compare that with GPUs, where for ~$1500 entry fee plus a little 
free C work, we can get 10-100x.  I know some groups with good ideas, 
and am waiting for their tools to get to a point we can play with, but 
we still need the board prices to approach reasonable levels, *AND* 
bitfiles to become portable.  The last aspect isn't likely to happen 
ever though.

We aren't seeing resistance to acceleration, we are seeing resistance to 
  single computing task systems.

FWIW: we (and I know others) are spec'ing/quoting/building GPU clusters. 
  They seem to be of significant interest.  We are speaking with folks 
buying them, and folks who want to buy them.  Bio-IT is well represented 
in these groups.


ps:  driving down to Columbus OH today for a GPU seminar at OSC 
tomorrow.  Will have a 3GPU unit with me (same unit JP did the GPU-HMMer 
tests on as it turns out)

> -Chris
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

More information about the Beowulf mailing list