[Beowulf] transcode Similar Video Processing on Beowulf?

Thu Apr 17 08:53:33 PDT 2014

On 16/04/14 17:54, Mark Hahn wrote:
> I'm trying to understand this from a perspective of conventional
> HPC.
> 
>> cop-out but we're not keen to reinvent the wheel. It provides 
>> statekeeping and job queues in one package; replacing it wouldn't
>> be
> 
> "statekeeping" is just tracking queued/running/done jobs, right?

That and metadata around jobs - number of times a task is retried,
what machine ran what where, failure logs and traces, etc. It does
also give you some guarantees about message delivery and receipt,
timing, etc, which negates the need for an external process to handle
that (eg job timeouts - if I expect a task to be done in 5 hours I can
say so and after that time it will issue a failure decision so the
workflow can decide what to do next, eg try again).

> 
>> trivial but wouldn't be a massive task; the cost of using it is
>> tiny, though, and it made our life a lot easier. It's all written
>> in terms of deciders, which make decisions based on a list of
>> events associated with an event (eg a "finished activity" event
>> will have the details about the activity starting, being
>> scheduled, and being completed, output status etc),
> 
> is the workflow complicated - a directed graph with complicated 
> structure, rather than a series of discrete jobs, each a simple 
> chain/pipeline in structure?
> 

It's a simple pipeline structure _at present_, but making it more
complex (ie a directed graph with parallel processing and joins/locks
etc) is relatively trivial to do; SWF does provide signalling and lock
'primitives' so to speak. We've not found the need for this just yet.
You can easily take whole pipelines and run them as children of a
master pipeline, eg "run these 5 things against this input", then
combine them, to track overall success/failure - but we have a layer
above this which provides this sort of batch management, so this
wasn't needed for us.

>> maintained by passing JSON blobs around as messages; there'll be
>> a blog post or two explaining things on our website soonish and
>> I'll post them across if there's interest.
> 
> a reference would be interesting.

Soonish, though we're not talking academic papers!

> 
>> It's being used in production on a regular basis and has had
>> quite a lot of content processed through it so far; these tasks
>> on average run for 2-6 hours and involve ~1GB of data going in
>> and a few megabytes out.
> 
> that's unexceptional from an HPC perspective.
> 

Absolutely; I wouldn't claim we're playing in the same ballpark as
traditional HPC from a perspective of data or throughput or
timing/latency requirements! We're only a humble R&D department
playing with small datasets since we're in quite early stages of
deploying this. We have ~15PB of data for starters to process with a
number of tools once we've kicked the tyres a bit.

>> The APIs are all simple HTTPS RESTful ones, storage can be cloud 
>> provider storage or local shared drive storage.
> 
> one premise usually found in HPC is that the job, at least the main
> part, should be compute-bound.  how do you ensure that your compute
> resources are not idle or starved by external IO bottlenecks?
> 

Generally we're loading in a few GB of data which takes a few minutes;
beyond that it's then hours of compute-bound work. We've got
monitoring on those machines to ensure the machines aren't stuck. We
have machines which load in content from remote sources and preload it
in a (network-local) cache so that the IO bottlenecks are limited to
local network bandwidth; generally we're running 4 or 8 workers per
machine so the machines are only ever starved fully for a few minutes
at the start of each piece of work when the machine starts, which is
acceptable losses for us. There is no remote data after the initial
chunk of time fetching the content to the local store. The machines
are automagically killed by a very small script if they're idle for
any significant amount of time, so failure conditions where the
machines end up idle aren't really a concern - it costs us nothing but
time, and minor delays are acceptable in our processing.

>> interprocess communication performance is less important and 
>> robustness and dynamic scalability plays a major role.
> 
> well, I think that's a bit disingenuous, since HPC is highly tuned 
> for robustness and dynamic scalability...
> 

In a typical HPC setting you have n nodes and n does not necessarily
change frequently, unless I've gotten wholly the wrong end of the
stick, though that's not -always- the case eg condor. I know HPC is
focused on robustness - same problem space, in terms of "Lots of
machines == lots of failures" in general. But where people get twitchy
about OpenMPI taking a few more uS in setting A versus setting B, we
don't have that - every unit of work is isolated, stand-alone and CPU
bound locally without any external dependencies until it is complete,
and reporting results is a tiny amount of network load.

This is a very lightweight system by most people's standards here, I'm
sure - the more interesting thing from our perspective than the 'HPC'
elements is that this is a generic system for our tasks and we've got
a quite complex image/build system that lets us just drop in new code
- even quite complex projects - with nearly no work, and run it all at
more or less arbitrary scale. The generic nature of the system and low
barrier to entry is the fun bit.

-- 
Cheers,
James