[Beowulf] Digital Image Processing via HPC/Cluster/Beowulf - Basics

Mark Hahn hahn at mcmaster.ca
Sat Nov 3 16:42:29 PDT 2012

> Thanks, infoative: p
> I'll consider your advice.
> If i read correctly, it seems the answer to the question about programming
> was: yes, a program must be written to accommodate a cluster. Did i get you
> right?

it depends what you mean.  if you have a program which is written
so that it can be run from a script, then a cluster can immediately
let you run lots of them.  if you're expecting a cluster to speed
up a single instance, then you'll probably be disappointed.

in short, clustering doesn't speed up any of the computers in the cluster.
it just makes it more convenient to get multiple computers working.
if you want multiple computers to work on the same program, then 
someone has to make it happen: divide up the work so each computer
and put together the results.

suppose you're trying to detect a particular face in all your images.
you could have once machine searching an image, then going onto the next.
basically, that one node is running a simple scheduler that runs jobs:
 	lookforface face.png image0.png
 	lookforface face.png image1.png
 	lookforface face.png image2.png

if you want, you can divide up the work - send every other image to a 
second machine.  in general, this would mean that a scheduler reads 
from that same list and dispatches one line (job) at a time to any
node that isn't already busy.  when a job completes, that node gets 
another job, and eventually all the work is done.

"embarassingly parallel" just means you have enough images to keep all your
machines busy this way.

if you don't have that many images, you might want to try to get more than
one machine working on the same image.  a simple way to do that would be 
to (imaginarily) divide each image into, say, quadrants, so 4 machines can
work on the same image (each getting a quarter of the image - with some
overlap so targets along the border don't get missed.)  to be specific,
your list of jobs could be like this:
 	lookforface face.png image0.png 0
 	lookforface face.png image0.png 1
 	lookforface face.png image0.png 2
 	lookforface face.png image0.png 3
 	lookforface face.png image1.png 0
 	lookforface face.png image1.png 1
where 'lookforface' only looks for the face in the specified quadrant of 
the input image.  the most obvious problem with this approach is that
1-quadrant search may take too little time relative to the overhead of 
setting up each job.  which includes accessing face.png and image0.png,
even if only a quadrant of the latter is used.  in general, this kind of 
issue is called "load balance", and is really the single most fundamental 
issue in HPC.

if you wanted to pursue this direction, you could optimize by reducing
the cost of distributing the images.  if image0.png is quite large,
then access through a shared filesystem might be efficient (if the FS 
block size is comparable to 1/2 the width of one image row.)  if image0.png
is smaller, then you could distribute that information "manually" by running
a job which reads the image one one node and distributes quadrants to 
other nodes.  the obvious way to do this would be via MPI, which is 
pretty friendly to matrices like decompressed images.  this could even
operate on pieces smaller than a quadrant - in fact, you could divide the 
work however finely you like.  though as before, divide it too fine, and the
per-chunk overhead dominates your cost, destroying efficiency.

note that this refinement has merely changed who/how the work is being 
divided and data being communicated.  in the simple case, work was divided
at the command/job/scheduler level and data transmitted by file.  the more 
fine-grained approach has subsumed some scheduling into your program, and
is communicating the data explicitly over MPI.

basically, someone has to divide up work, and data has to flow to where it's
used.  you could take this further: a single MPI program that runs on all 
nodes of the cluster at once and distributes work among MPI ranks.  this
would be the most programming effort, but would quite possibly be the most 
efficient.  often, the amount of time needed to perform one unit of work
is not constant - this can cause problems if your division of labor is too
rigid.  (consider the MPI-searches-4-quadrants approach: if one quadrant 
takes very little time, then the CPU associated with that quadrant will be 
twiddling its thumbs while the other quadrants get done.)

I have, of course, completely fabricated this whole workflow.  it becomes 
more interesting when the work has other dimensions - for instance, if you
are searching 1M images for any of 1k faces.  or if you are really hot to 
use a convolution approach so will be fourier-transforming all the images
before performing any matching.  or if you want to use GPUs, etc.

TL;DR it's a good thing I type fast ;)

in any case, your first step should be to look at the time taken to get
inputs to a node, and then how long it takes to do the computation.
life is easy if setup is fast and compute is long.  that stuff is far more
important than choosing a particular scheduler or cluster package.

regards, mark hahn.

> ? 2012-11-4 ??6:11?"Mark Hahn" <hahn at mcmaster.ca>???
>> I am currently researching the feasibility and process of establishing a
>>> relatively small HPC cluster to speed up the processing of large amounts
>>> of
>>> digital images.
>> do you mean that smallness is a goal?  or that you don't have a large
>> budget?
>>  After looking at a few HPC computing software solutions listed on the
>>> Wikipedia comparison of cluster software page (
>>> http://en.wikipedia.org/wiki/**Comparison_of_cluster_software<http://en.wikipedia.org/wiki/Comparison_of_cluster_software>) I still have
>>> only a rough understanding of how the whole system works.
>> there are several discrete functionalities:
>> - shared filesystem (if any)
>> - scheduling
>> - intra-job communication (if any; eg MPI)
>> - management/provisioning/**monitoring of nodes
>> IMO, anyone who claims to have "best practices" in this field is lying.
>> there are particular components that have certain strengths, but none of
>> them are great, and none universally appropriate.  (it's also common
>> to conflate or "integrate" the second and fourth items - for that matter,
>> monitoring is often separated from provisioning.)
>>  1. Do programs you wish to use via HPC platforms need to be written to
>>> support HPC, and further, to support specific middleware using parallel
>>> programming or something like that?
>> "middleware" is generally a term from the enterprise computing environment.
>> it basically means "get someone else to take responsibility for hard bits",
>> and is a form of the classic commercial best practice of CYA.  from an HPC
>> perspective, there's the application and everything else.  if you really
>> want, you can call the latter "middleware", but doing so is uninformative.
>> HPC covers a lot of ground.  usually, people mean jobs will execute in a
>> batch environment (started from a commandline/script).  OTOH HPC sometimes
>> means what you might call "personal supercomputing", where an interactive
>> application runs in a usually-dedicated cluster (shared clusters tend to
>> have scheduling response times that make interactive use problematic.)
>> (shared clusters also give rise to the single most important value of
>> clusters: that they can interleave bursty demand.  if everyone in your
>> department shares a cluster, it can be larger than any one group can
>> afford, and therefore all groups will be able to burst to higher capacity.
>> this is why large, shared clusters are so successful.  and, for that
>> matter,
>> why cloud services are successful.)
>> you can do HPC with very little overhead.  you will generally want a shared
>> filesystem - potentially just a NAS box or existing server.  you may not
>> bother with scheduling at all - let users pick which machine to run on,
>> for instance.  that sounds crazy, but if you're the only one using it, why
>> bother with a scheduler?  HPC can also be done without inter-job
>> communication - if your jobs are single-node serial or threaded, for
>> instance.  and you may not need any sort of management/provisioning,
>> depending on the stability of your nodes, environment, expected lifetime,
>> etc.
>> in short, slapping linux onto a few boxes, set up ssh keys or hostbased
>> trust, have one or more of them NFS out some space, and you're cooking.
>>  OR
>>> Can you run any program on top of the HPC cluster and have it's workload
>>> effectively distributed? --> How can this be done?
>> this is a common newbie question.  a naive program (probably serial or
>> perhaps
>> multithreaded) will see no benefit from a cluster.  clusters are just plain
>> old machines.  the benefit comes if you want throughput (jobs per time) or
>> specifically program for distributed computation (classically with MPI).
>> it's common to use infiniband to accelerate this kind of job (as well as
>> provide the fastest possible IO.)
>>  2. For something like digital image processing, where a huge amount of
>>> relatively large images (14MB each) are being processed, will network
>> the main question is how much work a node will be doing per image.
>> suppose you had an infinitely fast fileserver and gigabit connected nodes:
>> transferring the image would take 10-15ms, so you would ideally spend
>> about the same amount of time processing an image.  but in this case, you
>> should probably ask whether you can simply store images on the nodes in the
>> first place.  if you haven't thought about where the inputs are and how
>> fast they
>> can be gotten, then that will probably be your bottleneck.
>>  speed, or processing power be more of a limiting factor? Or would a
>>> gigabit
>>> network suffice?
>> how long does a prospective node take to complete one work unit,
>> and how long does it take to transfer the files for one?
>> your speedup will be limited by whatever resource saturates first
>> (possibly your fileserver.)
>>  3. For a relatively easy HPC platform what would you recommend?
>> they are all crap.  you should try not to spend on crap you don't need,
>> but ultimately it depends on how much expertise you have and/or how much
>> you value your time.  any idiot can build a cluster from scratch using
>> fundamental open-source components, eventually.  but if said idiot has to
>> learn filesystems, scheduling, provisioning, etc from scratch, it could
>> take quite a while.  when you buy, you are buying crap, but it's crap
>> that may save you some time.
>> don't count on commercial support being more than crappy.
>> you should probably consider using a cloud service - this is just
>> commercial
>> outsourcing - more crap, but perhaps of value if, for instance, you don't
>> want to get your hands dirty hosting machines (amazon), etc.
>> anything commercial in this space tends to be expensive.  the license to
>> cover a crappy scheduler for a few hundred nodes, for instance will be
>> pretty
>> close to an FTE-year.  renting a node from a cloud provider for a year
>> costs
>> about as much as buying a new node each year, etc.
>>  Again, I hope this is an ok place to ask such a question, if not please
>> this is the place.  though there are some fringe sects of HPC who tend to
>> subsist on more and/or different crap (such as clusters running windows.)
>> beowulf tends towards the low-crap end of things (linux, open packages.)
>> regards, mark hahn.

operator may differ from spokesperson.	            hahn at mcmaster.ca

More information about the Beowulf mailing list