2. Stitching Video from Webcams 421
2 Related Work
A tremendous amount of progress has been made in static image mosaicing. For
example, strip panorama techniques [2][3] capture the horizontal outdoor scenes
continuously and then stitch them into a long panoramic picture, which can be used
for digital tourism and the like. Many techniques such as plane-sweep [5] and
multi-view projection [6] have been developed for removing ghosting and blurring
artifacts.
As for panoramic video, however, the technology is still not mature. One of the
main troubles is the real-time requirement. The common frame rate is 25~ 30 FPS,
so if we want to create video panorama, we need to create each panoramic frames
within at most 0.04 seconds, which means that the stitching algorithms for static
image mosaicing cannot be applied to stitch real-time frames directly. And due to
the time-consuming computation involved, existing methods for improving static
panorama can hardly be applied for stitching videos. To skirt these troubles, some
researchers resort to hardware. For example, a carefully designed camera cluster
which guarantees an approximate common virtual COP (center of projection) [7]
can easily register the inputs and avoid parallax to some extent. But from another
perspective, this kind of algorithm is undesirable because it relies heavily on the
capturing device.
Our approach does not need special hardware. Instead, it makes use of the or-
dinary webcams, which means that the system is inexpensive and easily applica-
ble. Besides this, the positions and directions of the webcams are flexible as long
as they have some overlapped field-of-view. We design a two-stage solution to
tackle this challenging situation. The whole system will be discussed in section 3
and the implementation of each stage will be discussed in section 4 and 5 in
detail.
3 System Framework
As is shown in Fig. 1, the inputs of our system are independent frame sequences from
two common webcams and the output is the stitching video. To achieve real time, we
separate the processing into two stages. The first one, called the initialization stage,
only needs to be run once after the webcams are fixed. This stage includes several
time-consuming procedures which are responsible for calculating the geometric rela-
tionship between the adjacent webcams. We firstly detect robust features in the initial
frame of each webcams and then match them between the adjacent views. The correct
matches are then employed to estimate the perspective matrix. The next stage runs in
real time. In this stage, we make use of the matrix from the first stage to register the
frames of different webcams on the same plane and blend the overlapped region using
a nonlinear weight mask. The implementation of the two stages will be discussed later
in detail.
3. 422 M. Zheng, X. Chen, and L. Guo
Real-time Stage
Frames Projection Frames
of
Initialized
Y & of
Narrow Blending Wide
View ? View
Display
Webcams N
Feature Detection Projective Matrix
Feature Matching RANSAC
Initialization
Fig. 1. Framework of our system. The initialization stage estimates the geometric relationship
between the webcams based on the initial frames. The real-time stage registers and blends the
frame sequences in real time.
4 Initialization Stage
Since the location and orientation of the webcams are flexible, the geometric relation-
ship between the adjacent views is unknown before registration. We choose one web-
cam as a base and use the full planar perspective motion model [8] to register the
other view on the same plane. The planar perspective transform warps an image into
another using 8 parameters:
⎡x '⎤ ⎛ h11 h12 h13 ⎞ ⎡ x ⎤
⎢ y '⎥ = u ' ~ Hu = ⎜ h h ⎟
h23 ⎟ ⎢ y ⎥ .
⎢ ⎥ ⎜ 21 22 ⎢ ⎥ (1)
⎢1 ⎥ ⎜ ⎟⎢ ⎥
⎣ ⎦ ⎝ h31 h32 1 ⎠ ⎣1 ⎦
where u = ( x, y,1)T and u ' = ( x ', y ',1)T are homogeneous coordinates of the two
views, and ~ indicates equality up to scale since H is itself homogeneous. The per-
spective transform is a superset of translation, rigid, and similarity well as affine
transforms. We seek to compute an optimized matrix H between the views so that
they can be aligned well in the same plane.
To recover the 8 parameters, we firstly extract keypoints in each input frame, and
then match them between the adjacent two. Many classic detectors such as Canny [9]
and Harris [10] can be employed to extract interesting points. However, they are not
robust enough for matching in our case, which involves rotation and some perspective
relation between the adjacent views. In this paper, we calculate the SIFT features [11]
[12], which were originally used in object recognition.
4. Stitching Video from Webcams 423
Simply put, there are 4 extraction steps. In the first step, we filter the frame with a
Gaussian kernel:
L( x, y,σ) =G( x, y,σ) ⋅ I ( x, y) . (2)
(x +y )
2 2
I ( x, y ) is the initial frame and G ( x, y,σ ) = 1 −
where ⋅e 2σ 2 . Then we con-
2πσ 2
struct a DoG (Difference of Gaussian) space as follows:
D ( x, y, σ ) = L ( x, y, kσ ) − L ( x, y, σ ) . (3)
where k is the scaling factor. The extrema in the DoG space are taken as keypoints.
In the second step, we calculate the accurate localization of the keypoints through
Taylor expansion of the DoG function:
∂DT 1 ∂2D
D( v) = D + v + vT v. (4)
∂v 2 ∂v 2
where v = ( x, y , σ ) . From formula (4), we get the sub-pixel and sub-scale coordi-
T
nates as follows:
^ ∂ 2 D −1 ∂DT
v=− . . (5)
∂v 2 ∂v
^
A threshold is added on the D ( v ) value to discard unstable points. Also, we make
use of the Hessian matrix to eliminate edge responses:
Tr ( M Hes ) ( r + 1)
2
< . (6)
Det ( M Hes ) r
⎛ Dxx Dxy ⎞
where M Hes = ⎜ ⎟ is the Hessian matrix, and Tr ( M Hes ) and Det ( M Hes ) are
⎝ Dxy Dyy ⎠
the trace and determinant of M Hes . r is an experience threshold and we set r =10 in
this study.
In the third step, the gradient orientations and magnitudes of the sample pixels
within a Gaussian window are employed to calculate a histogram to assign the key-
point with orientation. And finally, a 128D descriptor of every keypoint is obtained by
concatenating the orientation histograms over a 16 × 16 region.
By comparing the Euclidean distances of the descriptors, we get an initial set of
corresponding keypoints (Fig. 2 (a)). The feature descriptors are invariant to transla-
tion and rotation as well as scaling. However, they are only partially affine-invariant,
so the initial matched pairs often contain outliers in our cases. We prune the outliers
5. 424 M. Zheng, X. Chen, and L. Guo
by fitting the candidate correspondences into a perspective motion model based on
RANSAC [13] iteration. Specifically, we randomly choose 4 pairs of matched points
in each iteration and calculate an initial projective matrix, then use the formula below
to find out whether the matrix is suitable for other points:
⎛ xn ⎞
'
⎛ xn ⎞
⎜ '⎟ ⎜ ⎟
⎜ yn ⎟ − H ⋅ ⎜ yn ⎟ < θ . (7)
⎜1⎟ ⎜1⎟
⎝ ⎠ ⎝ ⎠
Here H is the initial projective matrix andθ is the threshold of outliers. In order to
get a better toleration of parallax, a loose threshold of inliers is used. The matrix con-
sistent with most initial matched pairs is considered as the best initial matrix, and the
pairs fitting in it are considered as correct matches (Fig. 2 (b)).
(a)
(b)
Fig. 2. (a) Two frames with large misregistration and initial matched features between them.
Note that there are mismatched pairs besides the correct ones (b) Correct matches after
RANSAC filtering
After purifying the matched pairs, the ideal perspective matrix H is estimated us-
ing use a least squares method. In detail, we construct the error function below and
minimize the sum of the squared distances between the coordinates of the correspond-
ing features:
N 2 N 2
Ferror = ∑ Hu warp,n − u base, n = ∑ u warp ',n − u base,n (8)
n =1 n =1
where ubase,n is the homogeneous coordinate of the nth feature in the image to be
projected to, and uwarp,n is the correspondence of ubase,n in another view.
6. Stitching Video from Webcams 425
5 Real-Time Stage
After obtaining the perspective matrix between the adjacent webcams, we project the
frames of one webcam onto another and blend them in real-time. Since the webcams
are placed relatively freely, they may not have a common center of projection and
thus are likely to result in parallax. In other words, the frames of different webcams
cannot be registered strictly. Therefore, we designed a nonlinear blending strategy to
minimize the ghosting and blurring of the overlapped region. Essentially, this is a
kind of alpha-blending. The synthesized frames Fsyn can be presented as follows:
Fsyn ( x, y ) = α ( x, y ) ∗ Fbase + (1 − α ( x, y ) ) * Fproj (9)
where Fbase are the frames of the base webcam, Fproj are the frames projected from
the adjacent webcam, and α ( x, y ) is the weight on pixel ( x, y ) .
In the conventional blending method, the weight of pixels is a linear function in
proportion to the distance to the image boundaries. This method treats the different
views equally and performs well in normal cases. However, in the case of severe
parallax, the linear combination will result in blurring and ghosting in the whole over-
lapped region, as is the case in Fig 3(b), so we use a special α function to give prior-
ity to one pass to avoid the conflict in the overlapped region of two webcams. Simply
put, we construct a nonlinear α mask as below:
⎧1,if min(x, y, x -W , y - H )>T
⎪
α ( x, y ) = ⎨sin(π ⋅ (min( x, y, x −W , y − H ) / T − 0.5) +1 (10)
⎪ , otherwise
⎩ 2
where W and H are the width and height of the frame, and T is the width of the
nonlinear decreased border. The mask is registered with the frames and clipped ac-
cording to the region to be blended. The α value remains the same in the central part
of the base frame, and begins to drop sharply when it comes close enough to the
boundaries of another layer. The gradual change can be controlled by T . The transi-
tion of different frames is smoother and more natural if T is larger, but the clear cen-
tral region is also smaller, and vice versa. We refer to this method as nonlinear mask
blending. By this nonlinear synthesis, we keep a balance between the smooth transi-
tion of boundaries and the uniqueness and clarity of the interiors.
(a) A typical pair of scene (b) Linear Blending (c) Our blending
with strong parallax
Fig. 3. Comparison between linear blending and our blending strategy on typical scenes with
severe parallax
7. 426 M. Zheng, X. Chen, and L. Guo
6 Results
In this section, we show the results of our method on different scenes. We built a
prototype with two common webcams, as is shown in Fig, 4. The webcams are placed
together on a simple support, and the lenses are flexible and can be rotated and di-
rected to different orientations freely. Each webcam has a resolution of QVGA
(320 × 240 pixels) with a frame rate of 30 FPS.
Fig. 4. Two common webcams fixed on a simple support. The lenses are flexible and can be
rotated and adjusted to different orientations freely.
Table 1. Processing time of the main procedures of the system
Stage Procedure Time (second)
Feature detection 0.320 ~ 0.450
Feature matching 0.040 ~ 0.050
Initialization
RANSAC filtering 0.000 ~ 0.015
Matrix computation 0.000 ~ 0.001
Real time Projection and Blending 0.000 ~ 0.020
The processing time of the main procedures is listed in Table 1. The system runs
on a PC with E4500 2.2GHz CPU and 2GB memory. The initialization stage usually
takes about 0.7 ~ 1 second, according to the content of the scene. The projection and
blending usually takes less than 0.02 seconds for a pair of frames, thus can run in real-
time. Note that whenever the webcams are changed, there should be an initialization
stage to re-compute the geometric relationship between the webcams. Currently, this
re-initialization is started by user. After the initialization, the system can process the
video at a rate of 30FPS.
In our system, the positions and directions of the webcams are adjustable as long as
they have some overlapped field-of view. Typically, the overlapped region should be
20% of the original view or above, otherwise there may not be enough robust features
to match between the webcams. Fig. 5 shows the stitching result of some typical
frames. In these cases, the webcams are intentionally rotated to a certain angle or even
turned upside down. As can be seen in the figures, the system can still register and
blend the frames into a natural whole scene. Fig. 6 shows some typical stitching
scenes from a real-time video. In (a), the two static indoor views are stitched into a
wide view. In (b) and (c), some moving objects show up in the scene, either far away
or close to the lens. As illustrated in the figures, the stitching views are as clear and
natural as the original narrow view.
8. Stitching Video from Webcams 427
(a)A pair of frames with 150 rotation and the stitching result
(b) A pair of frames with 90 0 rotation and the stitching result
(c) A pair of frames with 1800 rotation and the stitching result
Fig. 5. Stitching frames of some typical scenes. The webcams are intentionally rotated to a
certain angle or turned upside down.
(a) A static scene
(b) A far away object moving in the scene
(c) A close object moving in the scene
Fig. 6. Stitching results from a real-time video. Moving objects in the stitching scene are as
clear as in the original narrow view.
9. 428 M. Zheng, X. Chen, and L. Guo
Although our system is flexible and robust enough in normal conditions, the qual-
ity of the mosaicing video does drop severely in the following two cases: firstly, when
the scene lacks salient features, as is the case of a white wall, then the geometric rela-
tionship of the webcams cannot be estimated correctly; secondly, when the parallax is
too strong, there may be noticeable traces of stitching in the frame border. These
problems can be avoided by targeting the lens at some salient scenes and adjusting the
orientation of the webcams.
7 Conclusions and Future Work
In this paper, we have presented a technique for stitching videos from webcams. The
system receives the frame sequences from common webcams and outputs a synthe-
sized video with a wide field-of-view in real time. The positions and directions of the
webcams are flexible as long as they have some overlapped field-of-view. There are
two stages in the system. The initialization stage calculates the geometric relationship
of frames from adjacent webcams. A nonlinear mask blending method which avoids
the ghosting and blurring in the main part of the overlapped region is proposed for
synthesizing the frames in real time. As illustrated by experimental result, this is an
effective and inexpensive way to construct video with a wide field-of-view.
Currently, we have only focused on using two webcams. As a natural extension of
our work, we would like to boost up to more webcams. We also plan to explore the
hard and interesting issues of how to eliminate the exposure differences between
webcams in real time and solve the problems mentioned at the end of last section.
Acknowledgment
The financial support provided by National Natural Science Foundation of China
(Project ID: 60772032) and Microsoft (China) Co., Ltd. are gratefully acknowledged.
References
1. Szeliski, R., Shum, H.Y.: Creating Full View Panoramic Mosaics and Environment Maps.
In: Proc. of SIGGRAPH 1997, Computer Graphics Proceedings. Annual Conference Se-
ries, pp. 251–258 (1997)
2. Agarwala, A., Agrawala, M., Chen, M., Salesin, D., Szeliski, R.: Photographing Long
Scenes with Multi-Viewpoint Panoramas. In: Proc. of SIGGRAPH, pp. 853–861 (2006)
3. Zheng, J.Y.: Digital route panoramas. IEEE MultiMedia 10(3), 57–67 (2003)
4. Hsu, C.-T., Cheng, T.-H., Beukers, R.A., Horng, J.-K.: Feature-based Video Mosaic. Im-
age Processing, 887–890 (2000)
5. Kang, S.B., Szeliski, R., Uyttendaele, M.: Seamless Stitching Using Multi-Perspective
Plane Sweep. Microsoft Research, Tech. Rep. MSR-TR-2004-48 (2004)
6. Zelnik-Manor, L., Peters, G., Perona, P.: Squaring the Circle in Panoramas. In: Proc. 10th
IEEE Conf. on Computer Vision (ICCV 2005), pp. 1292–1299 (2005)
10. Stitching Video from Webcams 429
7. Majumder, A., Gopi, M., Seales, W.B., Fuchs, H.: Immersive Ieleconferencing: A New
Algorithm to Generate Seamless Panoramic Imagery. In: Proc. of ACM Multimedia, pp.
169–178 (1999)
8. Szeliski, R.: Video Mosaics for Virtual Environments. IEEE Computer Graphics and Ap-
plications, 22–23 (1996)
9. Canny, J.: A Computational Approach to Edge Detection. IEEE Trans Pattern Analysis
and Machine Intelligence 8, 679–698 (1986)
10. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: Proc. of the 4th
Alvey Vision Conference, pp. 147–151 (1988)
11. Lowe, D.G.: Distinctive Image Features From Scale-invariant Keypoints. International
Journal of Computer Vision 60(2), 91–110 (2004)
12. Winder, S., Brown, M.: Learning Local Image Descriptors. In: Proc. of the International
Conference on Computer Vision and Pattern Recognition (CVPR 2007), pp. 1–8 (2007)
13. Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice-Hall, Engle-
wood Cliffs (2003)