Isvc08

Stitching Video from Webcams

Mai Zheng, Xiaolin Chen, and Li Guo

Department of Electronic Science and Technology, University of Science and
Technology of China
{zhengmai,myhgy}@mail.ustc.edu.cn, lguo@ustc.edu.cn

Abstract. This paper presents a technique to create wide field-of-view from
common webcams. Our system consists of two stages: the initialization stage
and the real-time stage. In the first stage, we detect robust features in the initial
frame of each webcam and find the corresponding points between them. Then
the matched point pairs are employed to compute the perspective matrix which
describes the geometric relationship of the adjacent views. After the initializa-
tion stage, we register the frame sequences of different webcams on the same
plane using the perspective matrix and synthesize the overlapped region using a
nonlinear blending method in real time. In this way, the narrow fields of each
webcam are displayed together as one wide scene. We demonstrate the effec-
tiveness of our method on a prototype that consists of two ordinary webcams
and show that this is an interesting and inexpensive way to experience the wide-
angle observation.

1 Introduction

As we explore a scene, we turn our eyes and head around, capture information in a
wide field-of-view, and then get a comprehensive view. Similarly, a panoramic pic-
ture or video can always provide much more information as well as richer experience
than a single narrow representation. These advantages, together with various applica-
tion prospects, such as teleconferencing and virtual reality, have motivated many
researchers to develop techniques for creating a panorama.
Typically, the term “panorama” refers to single-viewpoint panoramic image [1],
which can be created by rotating a camera around its optical center. Another main
type of panorama is called the strip panorama [2][3], which is created from a translat-
ing camera. But no matter which technical variants they use, creating a panorama
starts with static images and requires that all of the frames to be stitched be prepared
and organized as an image sequence before mosaicing. For a static mosaic, there is no
time constraint on stitching all of the images into one.
In this paper, we propose a novel method to create panoramic video from web-
cams. Different from previous video mosaics[4] which move one camera to record a
continuous image sequence and then create a static panoramic image, we capture two-
pass videos and stitch each pair of frames from the different videos in real-time. In
other words, our panorama is a wide video displayed in real-time instead of a static
panoramic picture.

G. Bebis et al. (Eds.): ISVC 2008, Part II, LNCS 5359, pp. 420–429, 2008.
© Springer-Verlag Berlin Heidelberg 2008

Stitching Video from Webcams 421

2 Related Work

A tremendous amount of progress has been made in static image mosaicing. For
example, strip panorama techniques [2][3] capture the horizontal outdoor scenes
continuously and then stitch them into a long panoramic picture, which can be used
for digital tourism and the like. Many techniques such as plane-sweep [5] and
multi-view projection [6] have been developed for removing ghosting and blurring
artifacts.
As for panoramic video, however, the technology is still not mature. One of the
main troubles is the real-time requirement. The common frame rate is 25~ 30 FPS,
so if we want to create video panorama, we need to create each panoramic frames
within at most 0.04 seconds, which means that the stitching algorithms for static
image mosaicing cannot be applied to stitch real-time frames directly. And due to
the time-consuming computation involved, existing methods for improving static
panorama can hardly be applied for stitching videos. To skirt these troubles, some
researchers resort to hardware. For example, a carefully designed camera cluster
which guarantees an approximate common virtual COP (center of projection) [7]
can easily register the inputs and avoid parallax to some extent. But from another
perspective, this kind of algorithm is undesirable because it relies heavily on the
capturing device.
Our approach does not need special hardware. Instead, it makes use of the or-
dinary webcams, which means that the system is inexpensive and easily applica-
ble. Besides this, the positions and directions of the webcams are flexible as long
as they have some overlapped field-of-view. We design a two-stage solution to
tackle this challenging situation. The whole system will be discussed in section 3
and the implementation of each stage will be discussed in section 4 and 5 in
detail.

3 System Framework

As is shown in Fig. 1, the inputs of our system are independent frame sequences from
two common webcams and the output is the stitching video. To achieve real time, we
separate the processing into two stages. The first one, called the initialization stage,
only needs to be run once after the webcams are fixed. This stage includes several
time-consuming procedures which are responsible for calculating the geometric rela-
tionship between the adjacent webcams. We firstly detect robust features in the initial
frame of each webcams and then match them between the adjacent views. The correct
matches are then employed to estimate the perspective matrix. The next stage runs in
real time. In this stage, we make use of the matrix from the first stage to register the
frames of different webcams on the same plane and blend the overlapped region using
a nonlinear weight mask. The implementation of the two stages will be discussed later
in detail.

422 M. Zheng, X. Chen, and L. Guo

Real-time Stage

Frames Projection Frames
of
Initialized
Ｙ & of
Narrow Blending Wide
View ? View
Display
Webcams Ｎ
Feature Detection Projective Matrix

Feature Matching RANSAC

Initialization

Fig. 1. Framework of our system. The initialization stage estimates the geometric relationship
between the webcams based on the initial frames. The real-time stage registers and blends the
frame sequences in real time.

4 Initialization Stage

Since the location and orientation of the webcams are flexible, the geometric relation-
ship between the adjacent views is unknown before registration. We choose one web-
cam as a base and use the full planar perspective motion model [8] to register the
other view on the same plane. The planar perspective transform warps an image into
another using 8 parameters:

⎡x '⎤ ⎛ h11 h12 h13 ⎞ ⎡ x ⎤
⎢ y '⎥ = u ' ~ Hu = ⎜ h h ⎟
h23 ⎟ ⎢ y ⎥ .
⎢ ⎥ ⎜ 21 22 ⎢ ⎥ (1)
⎢1 ⎥ ⎜ ⎟⎢ ⎥
⎣ ⎦ ⎝ h31 h32 1 ⎠ ⎣1 ⎦

where u = ( x, y,1)T and u ' = ( x ', y ',1)T are homogeneous coordinates of the two
views, and ~ indicates equality up to scale since H is itself homogeneous. The per-
spective transform is a superset of translation, rigid, and similarity well as affine
transforms. We seek to compute an optimized matrix H between the views so that
they can be aligned well in the same plane.
To recover the 8 parameters, we firstly extract keypoints in each input frame, and
then match them between the adjacent two. Many classic detectors such as Canny [9]
and Harris [10] can be employed to extract interesting points. However, they are not
robust enough for matching in our case, which involves rotation and some perspective
relation between the adjacent views. In this paper, we calculate the SIFT features [11]
[12], which were originally used in object recognition.


Simply put, there are 4 extraction steps. In the first step, we filter the frame with a
Gaussian kernel:

L( x, y,σ) =G( x, y,σ) ⋅ I ( x, y) . (2)

(x +y )
2 2

I ( x, y ) is the initial frame and G ( x, y,σ ) = 1 −
where ⋅e 2σ 2 . Then we con-
2πσ 2
struct a DoG (Difference of Gaussian) space as follows:

D ( x, y, σ ) = L ( x, y, kσ ) − L ( x, y, σ ) . (3)

where k is the scaling factor. The extrema in the DoG space are taken as keypoints.
In the second step, we calculate the accurate localization of the keypoints through
Taylor expansion of the DoG function:

∂DT 1 ∂2D
D( v) = D + v + vT v. (4)
∂v 2 ∂v 2
where v = ( x, y , σ ) . From formula (4), we get the sub-pixel and sub-scale coordi-
T

nates as follows:
^ ∂ 2 D −1 ∂DT
v=− . . (5)
∂v 2 ∂v
^
A threshold is added on the D ( v ) value to discard unstable points. Also, we make
use of the Hessian matrix to eliminate edge responses:

Tr ( M Hes ) ( r + 1)
2

< . (6)
Det ( M Hes ) r

⎛ Dxx Dxy ⎞
where M Hes = ⎜ ⎟ is the Hessian matrix, and Tr ( M Hes ) and Det ( M Hes ) are
⎝ Dxy Dyy ⎠
the trace and determinant of M Hes . r is an experience threshold and we set r =10 in
this study.
In the third step, the gradient orientations and magnitudes of the sample pixels
within a Gaussian window are employed to calculate a histogram to assign the key-
point with orientation. And finally, a 128D descriptor of every keypoint is obtained by
concatenating the orientation histograms over a 16 × 16 region.
By comparing the Euclidean distances of the descriptors, we get an initial set of
corresponding keypoints (Fig. 2 (a)). The feature descriptors are invariant to transla-
tion and rotation as well as scaling. However, they are only partially affine-invariant,
so the initial matched pairs often contain outliers in our cases. We prune the outliers


by fitting the candidate correspondences into a perspective motion model based on
RANSAC [13] iteration. Specifically, we randomly choose 4 pairs of matched points
in each iteration and calculate an initial projective matrix, then use the formula below
to find out whether the matrix is suitable for other points:

⎛ xn ⎞
'
⎛ xn ⎞
⎜ '⎟ ⎜ ⎟
⎜ yn ⎟ − H ⋅ ⎜ yn ⎟ < θ . (7)
⎜1⎟ ⎜1⎟
⎝ ⎠ ⎝ ⎠
Here H is the initial projective matrix andθ is the threshold of outliers. In order to
get a better toleration of parallax, a loose threshold of inliers is used. The matrix con-
sistent with most initial matched pairs is considered as the best initial matrix, and the
pairs fitting in it are considered as correct matches (Fig. 2 (b)).

(a)

(b)

Fig. 2. (a) Two frames with large misregistration and initial matched features between them.
Note that there are mismatched pairs besides the correct ones (b) Correct matches after
RANSAC filtering

After purifying the matched pairs, the ideal perspective matrix H is estimated us-
ing use a least squares method. In detail, we construct the error function below and
minimize the sum of the squared distances between the coordinates of the correspond-
ing features:
N 2 N 2

Ferror = ∑ Hu warp,n − u base, n = ∑ u warp ',n − u base,n (8)
n =1 n =1

where ubase,n is the homogeneous coordinate of the nth feature in the image to be
projected to, and uwarp,n is the correspondence of ubase,n in another view.


5 Real-Time Stage
After obtaining the perspective matrix between the adjacent webcams, we project the
frames of one webcam onto another and blend them in real-time. Since the webcams
are placed relatively freely, they may not have a common center of projection and
thus are likely to result in parallax. In other words, the frames of different webcams
cannot be registered strictly. Therefore, we designed a nonlinear blending strategy to
minimize the ghosting and blurring of the overlapped region. Essentially, this is a
kind of alpha-blending. The synthesized frames Fsyn can be presented as follows:

Fsyn ( x, y ) = α ( x, y ) ∗ Fbase + (1 − α ( x, y ) ) * Fproj (9)
where Fbase are the frames of the base webcam, Fproj are the frames projected from
the adjacent webcam, and α ( x, y ) is the weight on pixel ( x, y ) .
In the conventional blending method, the weight of pixels is a linear function in
proportion to the distance to the image boundaries. This method treats the different
views equally and performs well in normal cases. However, in the case of severe
parallax, the linear combination will result in blurring and ghosting in the whole over-
lapped region, as is the case in Fig 3(b), so we use a special α function to give prior-
ity to one pass to avoid the conflict in the overlapped region of two webcams. Simply
put, we construct a nonlinear α mask as below:
⎧1,if min(x, y, x -W , y - H )>T
⎪
α ( x, y ) = ⎨sin(π ⋅ (min( x, y, x −W , y − H ) / T − 0.5) +1 (10)
⎪ , otherwise
⎩ 2
where W and H are the width and height of the frame, and T is the width of the
nonlinear decreased border. The mask is registered with the frames and clipped ac-
cording to the region to be blended. The α value remains the same in the central part
of the base frame, and begins to drop sharply when it comes close enough to the
boundaries of another layer. The gradual change can be controlled by T . The transi-
tion of different frames is smoother and more natural if T is larger, but the clear cen-
tral region is also smaller, and vice versa. We refer to this method as nonlinear mask
blending. By this nonlinear synthesis, we keep a balance between the smooth transi-
tion of boundaries and the uniqueness and clarity of the interiors.

(a) A typical pair of scene (b) Linear Blending (c) Our blending
with strong parallax
Fig. 3. Comparison between linear blending and our blending strategy on typical scenes with
severe parallax


6 Results
In this section, we show the results of our method on different scenes. We built a
prototype with two common webcams, as is shown in Fig, 4. The webcams are placed
together on a simple support, and the lenses are flexible and can be rotated and di-
rected to different orientations freely. Each webcam has a resolution of QVGA
(320 × 240 pixels) with a frame rate of 30 FPS.

Fig. 4. Two common webcams fixed on a simple support. The lenses are flexible and can be
rotated and adjusted to different orientations freely.

Table 1. Processing time of the main procedures of the system

Stage Procedure Time (second)
Feature detection 0.320 ~ 0.450
Feature matching 0.040 ~ 0.050
Initialization
RANSAC filtering 0.000 ~ 0.015
Matrix computation 0.000 ~ 0.001
Real time Projection and Blending 0.000 ~ 0.020

The processing time of the main procedures is listed in Table 1. The system runs
on a PC with E4500 2.2GHz CPU and 2GB memory. The initialization stage usually
takes about 0.7 ~ 1 second, according to the content of the scene. The projection and
blending usually takes less than 0.02 seconds for a pair of frames, thus can run in real-
time. Note that whenever the webcams are changed, there should be an initialization
stage to re-compute the geometric relationship between the webcams. Currently, this
re-initialization is started by user. After the initialization, the system can process the
video at a rate of 30FPS.
In our system, the positions and directions of the webcams are adjustable as long as
they have some overlapped field-of view. Typically, the overlapped region should be
20% of the original view or above, otherwise there may not be enough robust features
to match between the webcams. Fig. 5 shows the stitching result of some typical
frames. In these cases, the webcams are intentionally rotated to a certain angle or even
turned upside down. As can be seen in the figures, the system can still register and
blend the frames into a natural whole scene. Fig. 6 shows some typical stitching
scenes from a real-time video. In (a), the two static indoor views are stitched into a
wide view. In (b) and (c), some moving objects show up in the scene, either far away
or close to the lens. As illustrated in the figures, the stitching views are as clear and
natural as the original narrow view.


(a)A pair of frames with 150 rotation and the stitching result

(b) A pair of frames with 90 0 rotation and the stitching result

(c) A pair of frames with 1800 rotation and the stitching result
Fig. 5. Stitching frames of some typical scenes. The webcams are intentionally rotated to a
certain angle or turned upside down.

(a) A static scene

(b) A far away object moving in the scene

(c) A close object moving in the scene
Fig. 6. Stitching results from a real-time video. Moving objects in the stitching scene are as
clear as in the original narrow view.


Although our system is flexible and robust enough in normal conditions, the qual-
ity of the mosaicing video does drop severely in the following two cases: firstly, when
the scene lacks salient features, as is the case of a white wall, then the geometric rela-
tionship of the webcams cannot be estimated correctly; secondly, when the parallax is
too strong, there may be noticeable traces of stitching in the frame border. These
problems can be avoided by targeting the lens at some salient scenes and adjusting the
orientation of the webcams.

7 Conclusions and Future Work
In this paper, we have presented a technique for stitching videos from webcams. The
system receives the frame sequences from common webcams and outputs a synthe-
sized video with a wide field-of-view in real time. The positions and directions of the
webcams are flexible as long as they have some overlapped field-of-view. There are
two stages in the system. The initialization stage calculates the geometric relationship
of frames from adjacent webcams. A nonlinear mask blending method which avoids
the ghosting and blurring in the main part of the overlapped region is proposed for
synthesizing the frames in real time. As illustrated by experimental result, this is an
effective and inexpensive way to construct video with a wide field-of-view.
Currently, we have only focused on using two webcams. As a natural extension of
our work, we would like to boost up to more webcams. We also plan to explore the
hard and interesting issues of how to eliminate the exposure differences between
webcams in real time and solve the problems mentioned at the end of last section.

Acknowledgment
The financial support provided by National Natural Science Foundation of China
(Project ID: 60772032) and Microsoft (China) Co., Ltd. are gratefully acknowledged.

References
1. Szeliski, R., Shum, H.Y.: Creating Full View Panoramic Mosaics and Environment Maps.
In: Proc. of SIGGRAPH 1997, Computer Graphics Proceedings. Annual Conference Se-
ries, pp. 251–258 (1997)
2. Agarwala, A., Agrawala, M., Chen, M., Salesin, D., Szeliski, R.: Photographing Long
Scenes with Multi-Viewpoint Panoramas. In: Proc. of SIGGRAPH, pp. 853–861 (2006)
3. Zheng, J.Y.: Digital route panoramas. IEEE MultiMedia 10(3), 57–67 (2003)
4. Hsu, C.-T., Cheng, T.-H., Beukers, R.A., Horng, J.-K.: Feature-based Video Mosaic. Im-
age Processing, 887–890 (2000)
5. Kang, S.B., Szeliski, R., Uyttendaele, M.: Seamless Stitching Using Multi-Perspective
Plane Sweep. Microsoft Research, Tech. Rep. MSR-TR-2004-48 (2004)
6. Zelnik-Manor, L., Peters, G., Perona, P.: Squaring the Circle in Panoramas. In: Proc. 10th
IEEE Conf. on Computer Vision (ICCV 2005), pp. 1292–1299 (2005)


7. Majumder, A., Gopi, M., Seales, W.B., Fuchs, H.: Immersive Ieleconferencing: A New
Algorithm to Generate Seamless Panoramic Imagery. In: Proc. of ACM Multimedia, pp.
169–178 (1999)
8. Szeliski, R.: Video Mosaics for Virtual Environments. IEEE Computer Graphics and Ap-
plications, 22–23 (1996)
9. Canny, J.: A Computational Approach to Edge Detection. IEEE Trans Pattern Analysis
and Machine Intelligence 8, 679–698 (1986)
10. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: Proc. of the 4th
Alvey Vision Conference, pp. 147–151 (1988)
11. Lowe, D.G.: Distinctive Image Features From Scale-invariant Keypoints. International
Journal of Computer Vision 60(2), 91–110 (2004)
12. Winder, S., Brown, M.: Learning Local Image Descriptors. In: Proc. of the International
Conference on Computer Vision and Pattern Recognition (CVPR 2007), pp. 1–8 (2007)
13. Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice-Hall, Engle-
wood Cliffs (2003)

Isvc08

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Isvc08

Similar to Isvc08 (20)

Recently uploaded

Recently uploaded (20)

Isvc08