[Beowulf] cluster for doing real time video panoramas?

Thu Dec 22 20:40:52 PST 2005

I'll take a stab at some of this... the parts that appear intuitively
obvious to me.

On 12/21/05, Bogdan Costescu <Bogdan.Costescu at iwr.uni-heidelberg.de> wrote:
>
> [ I think that most of what I write below is quite OT for this list...
> Apologies to those that don't enjoy the subject! ]
>
> On Wed, 21 Dec 2005, Jim Lux wrote:
>
> > I've got a robot on which I want to put a half dozen or so video
> > cameras (video in that they capture a stream of images, but not
> > necessarily that put out analog video..)
>
> It's not entirely clear to me what you want to say above... How is the
> video coming to the computer ? You are later mentioning 1394 cameras,
> so I assume something similar to the DV output from common camcorders.

Streaming video from 'n' cameras, or streams of interleaved images
(full frames).

> > I've also got some telemetry that tells me what the orientation of
> > the robot is.
>
> Does it also give info about the orientation of the assembly of
> cameras ? I'm thinking of the usual problem: is the horizon line going
> down or is my camera tilted ? Although if you really mean spherical
> (read below), this might not be a problem anymore as you might not
> care about what horizon is anyway.

The only sane, rational way to do this I can see is if the camera
reference frame is appropriately mapped and each camera's orientation
is well-known.  Telemetry then provides the camera frame orientation
if delta-x, -y, -z from the neutral reference.  It's also possible to
incorporate a delta-D (distance) from origin which, when coupled with
camera and lens info will start to yield information on target
distance and can apply to synthetic vision results.

> > I want to take the video streams and stitch them (in near real time)
> > into a spherical panorama
>
> Do you really mean spherical or only circular (the example that you
> gave being what I call circular) ? IOW: are the focal axes of the
> cameras placed only in a plane (or approximately, given alignment
> precision) ?

Er... with that many cameras, I'd not planar-align 'em.  I'd also have
some off-axis.  Makes the immersive result better/less granular.

> Given that I have a personal interest in video, I thought about
> something similar: not a circular or spherical coverage, but at least
> a large field of view from which I can choose when editing the
> "window" that is then given to the viewer - this comes from a very
> mundane need: I'm not such a good camera operator, so I always miss
> things or make a wrong framing. So this would be similar to how
> Electronic (as opposed to optical) Image Stabilization (EIS) works,
> although EIS does not need any transformation or calibration as the
> whole big image comes from one source.

With the combination of on-plane and off-axis image origination, one
has the potential for a stereoscopic and thus distance effect.  A
circular coverage wouldn't provide this.  Remapping a spherical
coverage into an immersive planar or cave coverage could accomplish
this.

> All that I write below starts from the assumption that the cameras are
> mounted on an assembly in a "permanent" position, such that their
> relative positions (one camera with respect to another) do not change.
> Also that you don't zoom or that you can control the zoom on all
> cameras simultaneously; otherwise putting all the movies together is
> probably hard (in the circular setup; but doable probably with motion
> vectors or related stuff that is already used in MPEG4 compression) or
> impossible (in the spherical setup where you'd miss parts of the
> space).

Makes sense to me.

> First step should be the calibration of the cameras with respect to
> each other. In the COTS world, I don't think that you'd be able to get
> cameras to be fixed such that they equally split the space between
> them (so that the overlap between any 2 cameras would be the same);
> then you also need color calibration, sound level calibration (with
> directional mics, otherwise it makes no sense) and so on -
> multi-camera setups are rather difficult to master for an amateur
> (like me, at least :-)) Another problem that you might face is the
> frame synchronization between the cameras which might come into play
> for moving objects.

First step is to rigidly characterize each camera's optical path: the
lens.  Once it's charactistics are known and correlated with its
cohort, the math for the reconstruction and projection becomes
somewhat simpler (took a lot to not say, "Trivial" having been down
this road before!).  THEN one characterizes the physical frame and
identifies the optical image overlap of adjacent cameras.  Evenly
splitting the overlap might not necessarily help here, if I understand
the process.

> Talking about moving, IMHO you need to have progressive output from
> the cameras. Interlaced movies would probably create artifacts when
> joining together; deinterlacing several video streams at once might be
> a nice application (but very coarse grained - f.e. one stream per CPU)
> for a cluster, but the results might still not be "perfect", as with
> progressive output, as the deinterlacing results for the same part of
> the scene taken from several cameras might be different.

Interlacing would have to be almost camera-frame sequential video and
at high frame rates.  I agree that deinterlaced streams would offer
better result.  One stream per CPU might be taxing:  This might
require more'n one CPU per stream at 1394 image speeds!

> If the position of the cameras can be finely modified, it might be a
> good idea to try to get them close to the ideal equal splitting of the
> space by just looking at their output. But in any case, if the cameras
> are fixed with respect to each other, you don't need to calculate the
> overlapping regions for every frame - which means that the final frame
> size can also be known at this time; knowing the overlapping also
> makes easy to arrange the blending parameters. If you want a spherical
> setup, it's quite likely to have more than 2 cameras that overlap in a
> certain place so the calibration will likely be more difficult, but
> once it's done you don't need to do it again...

I'm not so sure this is beneficial if you can appropriately model the
lens systems of each camera and the optical qualities of the imager
(CMOS or CCD...)  I think you can apply the numerical mixing and
projection in near real time if you have resolved the models early.

> What makes sense to me as a next step would be to map the "camera
> space" to the "real world" - for example for the circular setup by
> making a circle around the cameras assembly with degrees marked on it.
> In a spherical setup, you probably have to use 3 marked circles, one
> for each axis. This way you can find a correspondence between a pixel
> (let's say in the middle of the frame) and its real world angle, such
> that when you are looking later for a certain angle you know what
> pixel to put in the center of the image. If the whole camera assembly
> rotates by a certain angle (and that's the reason for my second
> question up in this message), you simply add (or substract) this to
> (or from) the angle that you're looking for.

If you make the camera frame a fixed element and fix the camera
positions, again, you can model the system numerically.  A dynamic
calibration system is a nice check after you've determined the
reference frame in physical space (OK, in a sense you're doing that
here but modeling the physical system is really mandatory.)

> To come back to cluster usage, I think that you can treat the whole
> thing by doing something like a spatial decomposition, such that
> either:
> - each computer gets the same amount of video data (to avoid
> overloading). This way it's clear which computer takes video from
> which camera, but the amount of "real world" space that each of them
> gives is not equal, so putting them together might be difficult.
> - each computer takes care the same amount of the "real world" space,
> so each computer provides the same amount of output data. However the
> video streams splitting between the CPUs might be a problem as each
> frame might need to be distributed to several CPUs.

I would vote for decomposition by hardware device (camera).  And, I'd
have some degree of preprocessing with consideration that the cluster
might not necessarily be our convenient little flat NUMA cluster we're
all so used to.  If I had a cluster of 4-way nodes I'd be considering
reordering the code to have preprocessing of the image-stream on one
core, making it in effect a 'head' core, and letting it decompose the
process to the remaining cores.  I'm not convinced the typical CPU can
appropriately handle a feature-rich environment imaged using a decent
DV video steam.

> > But, then, how do you do the real work... should the camera
> > recalibration be done all on one processor?
>
> It's not clear to me what you call "recalibration". You mean color
> correction, perspective change and so on ? These could be done in
> parallel, but if the cameras don't move relative to each other, the
> transformations are always the same so it's probably easy to partition
> them on CPUs even as much as to get a balanced load.

Agreed.

> > Should each camera (or pair) gets its own cpu, which builds that
> > part of the overall spherical image, and hands them off to yet
> > another processor which "looks" at the appropriate part of the video
> > image and sends that to the user?
>
> Well, first of all, your words suggest to me that you are talking
> about a circular setup. In a real spherical one, you should have some
> parts that overlap from at least 3 cameras (where their edges look
> like a T) so you can't talk about pairs.

Asymmetrical offset can occasionally drop this to pairs.  but your
solution is less costly in computational terms.

> Secondly, all my thoughts above try to cover the case where you want
> to get at each moment a complete image out of the system, like when
> different people are watching maybe different parts of the output. If
> there's only one "window" that should be seen, then the image would
> probably come from at most 2-3 cameras; the transformation could
> probably be done on different CPUs (like in the case for full output),
> but putting them together (blending) would be easy enough to do even
> on one CPU, so no much use for a cluster there... unless the
> transformations are so CPU intensive that can't be done in realtime,
> in which case you could send each frame to a different CPU and get the
> output with a small delay (equal to the time needed to transform one
> frame).
>
> That's it, I hope that I made sense... given that it's well past
> midnight ;-)

Made pretty good sense to me!  I'm just hoping my ramblings made sense to you!

Gerry
--
Gerry Creager N5JXS
Texas A&M University AATLT
SCOOP/Texas Mesonet