[Beowulf] Digital Image Processing via HPC/Cluster/Beowulf - Basics
hahn at mcmaster.ca
Sat Nov 3 15:10:51 PDT 2012
> I am currently researching the feasibility and process of establishing a
> relatively small HPC cluster to speed up the processing of large amounts of
> digital images.
do you mean that smallness is a goal? or that you don't have a large budget?
> After looking at a few HPC computing software solutions listed on the
> Wikipedia comparison of cluster software page (
> http://en.wikipedia.org/wiki/Comparison_of_cluster_software ) I still have
> only a rough understanding of how the whole system works.
there are several discrete functionalities:
- shared filesystem (if any)
- intra-job communication (if any; eg MPI)
- management/provisioning/monitoring of nodes
IMO, anyone who claims to have "best practices" in this field is lying.
there are particular components that have certain strengths, but none
of them are great, and none universally appropriate. (it's also common
to conflate or "integrate" the second and fourth items - for that matter,
monitoring is often separated from provisioning.)
> 1. Do programs you wish to use via HPC platforms need to be written to
> support HPC, and further, to support specific middleware using parallel
> programming or something like that?
"middleware" is generally a term from the enterprise computing environment.
it basically means "get someone else to take responsibility for hard bits",
and is a form of the classic commercial best practice of CYA. from an HPC
perspective, there's the application and everything else. if you really
want, you can call the latter "middleware", but doing so is uninformative.
HPC covers a lot of ground. usually, people mean jobs will execute in a
batch environment (started from a commandline/script). OTOH HPC sometimes
means what you might call "personal supercomputing", where an interactive
application runs in a usually-dedicated cluster (shared clusters tend to
have scheduling response times that make interactive use problematic.)
(shared clusters also give rise to the single most important value of
clusters: that they can interleave bursty demand. if everyone in your
department shares a cluster, it can be larger than any one group can
afford, and therefore all groups will be able to burst to higher capacity.
this is why large, shared clusters are so successful. and, for that matter,
why cloud services are successful.)
you can do HPC with very little overhead. you will generally want a shared
filesystem - potentially just a NAS box or existing server. you may not
bother with scheduling at all - let users pick which machine to run on,
for instance. that sounds crazy, but if you're the only one using it, why
bother with a scheduler? HPC can also be done without inter-job
communication - if your jobs are single-node serial or threaded, for
instance. and you may not need any sort of management/provisioning,
depending on the stability of your nodes, environment, expected lifetime,
in short, slapping linux onto a few boxes, set up ssh keys or hostbased
trust, have one or more of them NFS out some space, and you're cooking.
> Can you run any program on top of the HPC cluster and have it's workload
> effectively distributed? --> How can this be done?
this is a common newbie question. a naive program (probably serial or perhaps
multithreaded) will see no benefit from a cluster. clusters are just plain
old machines. the benefit comes if you want throughput (jobs per time) or
specifically program for distributed computation (classically with MPI).
it's common to use infiniband to accelerate this kind of job (as well as
provide the fastest possible IO.)
> 2. For something like digital image processing, where a huge amount of
> relatively large images (14MB each) are being processed, will network
the main question is how much work a node will be doing per image.
suppose you had an infinitely fast fileserver and gigabit connected nodes:
transferring the image would take 10-15ms, so you would ideally spend about
the same amount of time processing an image. but in this case, you should
probably ask whether you can simply store images on the nodes in the first
place. if you haven't thought about where the inputs are and how fast they
can be gotten, then that will probably be your bottleneck.
> speed, or processing power be more of a limiting factor? Or would a gigabit
> network suffice?
how long does a prospective node take to complete one work unit,
and how long does it take to transfer the files for one?
your speedup will be limited by whatever resource saturates first
(possibly your fileserver.)
> 3. For a relatively easy HPC platform what would you recommend?
they are all crap. you should try not to spend on crap you don't need,
but ultimately it depends on how much expertise you have and/or how much
you value your time. any idiot can build a cluster from scratch using
fundamental open-source components, eventually. but if said idiot has to
learn filesystems, scheduling, provisioning, etc from scratch, it could
take quite a while. when you buy, you are buying crap, but it's crap
that may save you some time.
don't count on commercial support being more than crappy.
you should probably consider using a cloud service - this is just commercial
outsourcing - more crap, but perhaps of value if, for instance, you don't
want to get your hands dirty hosting machines (amazon), etc.
anything commercial in this space tends to be expensive. the license to
cover a crappy scheduler for a few hundred nodes, for instance will be pretty
close to an FTE-year. renting a node from a cloud provider for a year costs
about as much as buying a new node each year, etc.
> Again, I hope this is an ok place to ask such a question, if not please
this is the place. though there are some fringe sects of HPC who tend to
subsist on more and/or different crap (such as clusters running windows.)
beowulf tends towards the low-crap end of things (linux, open packages.)
regards, mark hahn.
More information about the Beowulf