Mosaicing Video Sequences
Arnon Netzer
Craig Gotsman
Computer Science Dept.
Technion - Israel Institute of Technology
Haifa 32000, Israel
Abstract
With the advent of cheap, but relatively low-resolution, video sensors, the importance of automatically mosaicing
many small video images to one large image has increased. This is due mainly to the many useful applications that
may be based on such technology, such as hand-held mobile scanning, multi-resolution imaging, panoramic
spreads and video compression. The process of creating a mosaiced image consists of first finding the geometric
registration of each small image from the sequence to a global image plane, and then combining all the registered
images into one smooth, pleasing image. We review two existing methods for finding the registration: sequential
and canvas mosaicing, and discuss the advantages and disadvantages of each of those algorithms.
After suggesting two methods to improve the reliability and the accuracy of the basic sequential algorithm, we
present our algorithm, which combines and enhances the sequential and canvas methods. We demonstrate how our
algorithm overcomes most of the pitfalls of each of the two simpler algorithms. Finally, we present a novel approach to combining the registered images utilizing both geometrical and content information in order to improve
the visual quality of the mosaic.
Keywords: Image mosaicing, image registration.
Contact Author:
Craig Gotsman
Computer Science Dept.
Technion – Israel Institute of Technology
Haifa 32000, Israel
Email
gotsman@cs.technion.ac.il
Phone: +972-4-8294336
Fax: +972-4-8294353
1
1. Introduction
With the advent of cheap, but relatively low-resolution, video sensors, the importance of
automatically mosaicing many small video images to one large image has increased. This is
due mainly to the many useful applications that may be based on such technology, such as
hand-held mobile scanning [1], multi-resolution imaging [2,3,17], creating high resolution
stills from videos [4], panoramic spreads [5,6,13] and video compression [7]. In the general
case, mosaicing two arbitrary images of a three dimensional scene is not simple. Green and
Heckbert [8] showed how do this in the case the camera movement is given. Several works
explore finding this 3D camera movement from the images themselves [14,15]. Jaillon and
Montavert [16] show how to register when knowledge of the three dimensional structure of
the scene is available.
Since a true 3D solution is difficult, in many cases a 2D registration between pixels is
sought. Finding this may be achieved by optical flow methods [18,23] , but this does not produce registrations accurate enough for video mosaicing. Aiger and Cohen [19] used an iterative algorithm to improve the registration accuracy of the optical flow solution. Such solutions are common in three dimensional imaging for medical applications, where the quality of
the source images is poor to begin with and the accuracy of the registration is less important.
Herman and Peleg [9] suggested to register each image in the sequence to its predecessor
without reconstructing depths or transforming the images to the same plane. This can produce
in some cases a pleasing result, but usually results in a distorted image of the scene.
In the cases where the registration between the images can be described as one global 2D
projective transformation, the problem can be solved much more accurately. Among those
cases are all the scenes consisting of planar objects (e.g. documents, black boards, satellite
pictures etc.) or panoramic spreads.
Mann [10] showed how to mosaic two images taken from a planar scene using a transformation with eight parameters. Most works to date deal with the mosaicing of a small number of
large images. In this paper we consider the case of a large number of small images, a scenario
in which accumulated error may be significant, and an arbitrary camera trajectory may result
in overlaps between arbitrary images of the sequence.
There are two basic approaches for generalizing a a two-image registration algorithm to a
longer sequence of images. One is sequential mosaicing [11], in which each image is registered to the one before it in the sequence. Those registrations are then accumulated to produce a transformation between each image and global image plane. The second method,
called canvas mosaicing [12] registers each image directly to the canvas - the “big” image
which is gradually being constructed from the smaller images on the global image plane. The
sequential method encounters several difficulties. If a single image in the sequence is corrupted, or lacks sufficient information for a successful registration, the sequence is broken. In
this case, the best result possible is an estimate based on previous trends. Furthermore, if a
single registration is inaccurate, the error is propagated through the rest of the sequence. This
problem is amplified when the family of allowed transformations is rich. Other problems
arise from the accuracy of the registration algorithm. Since registration involves solving a
non-linear optimization problem, even sophisticated algorithms such as the LevenbergMarquardt procedure [22] are susceptible to the classical local minima and plateau pitfalls.
Moreover, in a situation in which the image sequence is the result of a simple sensor translation, numerical errors may disguise them as a more complex transformation (e.g. scaling),
confusing the algorithm.
2
The advantage of the canvas method is that each new image is registered in a way that produces the best "global” fit, thus resulting in an image which is smoothest to the eye. However, since errors may have been accumulated in the registration of previous images, it might
become impossible, after a while, to register properly a new image which overlaps a region
previously imaged to the canvas. This is because that canvas area may no longer be a clean
consistent image, rather a mosaic of many (possibly incorrectly) registered images. Encountering such a scenario can mislead the algorithm totally.
Experience shows that neither of the two mosaicing methods works very well in practice. In
order to devise a more robust algorithm, we enhance each of the methods and then combine
the two, thus minimizing their respective difficulties and capitalizing on their individual
strengths. We then suggest ways for combining the registered images into a “big” image in a
visually pleasing manner.
The rest of this paper is organized as follows: Section 2 describes the basic registration procedure for two images. Section 3 describes the problems encountered by these simple methods when extending them to image sequences, and our solutions to these problems. In Section
4 we propose a mosaicing method based on the image content to minimize seams. We conclude in Section 5.
2. Registering Two Images
When registering two images I and I’ taken from a scene containing planar objects, the registration may be approximated well by a 3 x 3 projective transform matrix:
x ′ m0
y ′ = m
3
w′ m6
m1
m4
m7
m2 x
m5 y
m8 w
Since the projective transform is unique up to a scaling factor, the matrix can be reduced to
an eight parameter transformation:
x' ( xi , yi ) =
m0 xi + m1 y i + m2
,
m6 xi + m7 y i + 1
y ' ( xi , y i ) =
m3 x i + m 4 y i + m5
m6 xi + m7 y i + 1
(2.1)
The transformation parameter vector m is found by minimizing the intensity error function
between the two images:
E ( I , I ' ) = ∑ ei2 = ∑ [I ′(xi′ , y i′ ) − I (xi , y i )]
2
i
(2.2)
i
where xi′ , yi′ are as in
(2.1). This is based on the assumption that the minimum is
achieved when the images are best registered [18].
The error function can be minimized using the Levenberg-Marquardt method [22]. To use
this method, derivatives of ei are calculated for each of the projective matrix parameters
m0 K m7 . From these derivatives a weighted gradient vector b is calculated, as well as an
approximation of the Hessian matrix A . In order to avoid calculating second order derivatives, the product of first order derivatives is used as an approximation of A .
At each iteration, the projective matrix parameters are updated by ∆m = ( A + λ I ) b with
λ decreasing near the minimum, giving more weight to A . The advantage of the Levenberg
- Marquardt method is its ability to combine advancing in the gradient direction when it is far
−1
3
from the minimum, and take into account the curvature when approaching the minimum. As
all non-linear minimization techniques, this method converges to a local minimum, hence
must be initialized with a “reasonable” initial guess. It is common to use a transformation
consisting of translation only as this initial guess (only m2 and m5 are non-zero), implicitly
assuming that the translation parameters in the transformation matrix are significantly larger
than the others. This translation may be found efficiently using the multi-pyramid method
[23].
3. Registering an Image Sequence
This section elaborates on how the basic two-image registration method is extended to handle
a long image sequence.
3.1. The “Sequential” Algorithm
In the sequential extension of the basic two-image registration procedure, each image I i is
registered to its predecessor I i −1 only. Denote this transformation by Ti . These are accumulated
to
produce
the
transformation
of
each
image
TC 0 = I (the identity ) and TC i = TC i −1 * Ti for i ≥ 1 (see Figure 3.1).
Figure 3.1: The sequential algorithm: Each image
Ii
to
the
canvas:
is registered to its predecessor only. The registrations
are accumulated to produce the transformation of each image to the canvas TCi = TCi −1 * Ti
The drawback of this method is that in cases when an inaccuracy is introduced during the
registration process, this inaccuracy or “noise” affects not only the current registration, but
all the following ones (see Figure 3.4(a)). Furthermore, if one registration fails due to a “bad”
image or an image lacks sufficient information for a successful registration, the sequence is
broken. In this case, the best result that can be hoped for is an estimate based on previous
trends.
3.2. The “Window” Algorithm
To improve the sequential algorithm, we propose to broaden the base of the registration.
Since the source images originate in a video sequence, in all probability each image overlaps
more than one predecessor. At any given time i , consider a window of 2n + 1 images, where
4
n = 3 . For the images I i − n K I i −1 (the
dotted squares) the registration has already been determined. For each new image I i + n entering the window, n
Figure 3.2: The window algorithm: A window of 2n+1 images for
registrations are calculated.
n is the history “depth”. The window is centered on image I i , so the window contains images I i − n K I i K I i + n . When treating image I i , the registrations of I i − n K I i −1 to the
canvas have already been determined. For each new image I i + n entering the window, n registrations Ti +j n are calculated to the images I j : j = i ... i + n − 1 (see Fig 3.2).
In addition, for each such transformation an error measure E is calculated - the average sum
of the squares of the intensity difference between the two images:
2
1 m
I i ( x k , y k ) − I j (Ti j ( x k , y k ))
(3.1)
∑
m k =1
where k iterates through all the m pixels in the conjunction of the images, and xi′ , yi′ are
calculated using
(2.1). Each registration with an error measure E greater than a given
Ei j =
[
]
threshold is considered to be a failure.
Once all the registrations for the image I i + n against each of the images I i KI i +n −1 have been
computed, transformations to the canvas TCi K TCi + n are calculated. Each of these is calculated by weighing all the relevant registrations, namely, all the registrations whose E is under
a fixed threshold (see Figure 3.3):
i+n
TC i =
∑
TC j * Ti j
Ei j
i+n
1
∑
j
j =i − n E i
j =i − n
(3.2)
−1
Note that Ti j = (T ji ) , namely that a transformation to the future is the inverse of the
transformation to the past of the same images, hence there is no need to calculate it again. At
this point we have all the possible relevant registrations for the image I i , and a new image
entering the window will not effect the transformation TCi . The image I i may then be
merged into the canvas and the window advanced one image.
5
Figure 3.3: When
two to the future.
n=3
the relevant registrations for the image
I i +1 are three transformations to the past and
3.2.1. Experimental Results
The window algorithm proved to be especially effective in its ability to “skip” over “bad”
images. When one image in the sequence cannot be registered at all, this image is not merged
into the canvas, and the next image based on a registration deeper into the history for calculating its transformation. This leads to an overall improvement in the accuracy of the mosaic
(see Figure 3.4).
(a)
(b)
Figure 3.4: The window mosaicing algorithm applied to a sequence of 200 images (70x100 pixels each). The
order of the sequence is from the top-left clockwise. Towards the end of the sequence the images return to an area
in the canvas containing earlier information. (a) 1-image window. Notice the error accumulated during the mosaicing process. (b) 3-image window. The improvement in the registration accuracy relative to (a) is evident.
Experiments with windows of different sizes showed a significant improvement in the overall
registration accuracy as n was increased from 1 to 3, and almost no improvement at all when
increased beyond 5. This supports the theory that part of the inaccuracy in the algorithm is
white noise introduced due to numerical errors, so when averaged it converges rapidly to
zero. Hence the significant gain for small n ’s and marginal for larger n ’s. The drawback of
the window algorithm is the increase in time complexity. The computation time is n times
that of the basic sequential algorithm.
6
3.3. The “Multi-Stage” Algorithm
One of the problems encountered when using a non linear optimization procedure is the plateau problem. In the vicinity of the minimum the derivatives become small, and inaccuracies
may be introduced. Since the projective transformation has eight parameters, each of them
can have an error of ∆ mi introduced to it due to inaccuracies in the minimization procedure.
This error may not be significant when mosaicing two images, but it may become crucial
when mosaicing long sequences of images. Furthermore, the long-term consequence of error
is different for each parameter. While a small error in translation stays constant regardless of
the number of registrations following it (it may even average to zero assuming such errors
have the characteristics of white noise), an error in rotation is amplified by the number of
images following (see Figure 3.5).
Another related problem is the ``disguise’’ problem. Sometimes two images may be registered in more than one way. Consider an image containing a horizontal line starting at the left
end and continuing for one hundred pixels, and another image containing the same line, only
150 pixels in length. There is no way to know if the “correct” registration between the two is
translation of fifty pixels to the left, or scaling along the x axis (see Figure 3.6).
The problems described above can result in a situation in which even though the input image
sequence is the result of a simple sensor translation, numerical errors and disguise issues will
mislead the algorithm to yield a more complex transformation. This results in an erroneous
output even for simple inputs.
To deal with these problems, we suggest a multi-stage algorithm for computing the
transformation. Categorize the family of possible transformations into five families with
decreasing priorities: Translation , Rigid, Similarity, Affine and Projective.
(a)
(b)
Figure 3.5: The effect of an error in transformation parameter space. (a) Translation parameter error. The error is
propagated to all the following images. The distance between each image and its correct position remains constant. (b) Rotation parameter error. The error is propagated to all the following images. The distance between
each image and its correct position increases with time.
7
(a)
(b)
Figure 3.6: The “disguise” problem. There is no way of knowing whether the “correct” registration transformation
between (a) and (b) is translation or scaling along the x axis.
Given an image I to be registered to an image I ′ , five transformations are calculated, each
under the constraints of its respective family. For each of these transformations, an error
measure ETranslation , E Rigid , ESimilarity , E Affine and EPr ojective is calculated using
(3.1). E final calculated as in (3.3), and the transformation corresponding to E final chosen.
Efinal = min(min(mi
n(min(ETranslation , ERigid ∗ c), ESimilarity∗ c),EAffine ∗ c), EProjective∗ c) (3.3)
When c is a constant larger then one, a higher level family will be chosen only if it yields a
significant improvement in the error measure.
3.3.1. Experimental Results
The multi-stage algorithm showed vast improvement in the cases when the correct transformation belonged to a simple family such as translation only (see Figure 3.7). On the other
hand we retained the ability to handle transformation from more complex families (see Figure
3.8).
8
(a)
(b)
Figure 3.7: The multi-stage algorithm applied to a sequence of fifty 70x100 pixel images, where the actual transformation between each two successive images is approximately a five pixel translation. (a) Single-stage algorithm
results. Note the shear that has crept in. (b) Multi-stage algorithm results.
Figure 3.8: The multi-stage algorithm applied to a “general” input - a sequence of forty 70x100 pixel images,
where the actual transformation between each two successive images contains translation, rotation, shear and perspective elements.
9
The weakness of this algorithm is in the cases where there is no significant difference between the error measures of transformations from different families, but the correct transformation is indeed from a higher family. In these cases, it is very important to choose the right
factor c . If c is too big, it will not permit choosing from a higher family even when it is necessary. On the other hand, too small a factor will not filter out the noises. Experimenting with
different values of c showed best results when 105
. < c < 110
. . As before, the drawback of
this algorithm is the increase in complexity, which is basically multiplied by the number of
families used.
3.4. The “Combined” Algorithm
As mentioned above, there are two approaches to generalizing the basic two-image registration procedure to a sequence of images. One is sequential mosaicing in which each image is
registered its predecessor in the sequence. The other is the canvas method which registers
each image directly to the canvas.
Consider the case in which the image sequence creates a closed loop, and a new image I i is
to be registered to a place in the canvas where the image I i − k was mapped to in the past (see
Figure 3.9).
Figure 3.9: The image
past.
Ii
should be registered to a place in the canvas where the image
Ii − k
was mapped in the
In sequential mosaicing, the algorithm continues to use only the information from the near
history. The information from the image I i − k will not be taken into account. On the other
hand, canvas mosaicing might incur a deadlock if the information from the near history and
that on the canvas conflict. In this case there might be no transformation for I i consistent
with both I i−1 and I i − k .
Our “combined” algorithm combines the sequential and canvas mosaicing utilizing their respective advantages and overcoming their disadvantages. First we calculate for each new image
Ii
the transformation to the canvas TCi using the sequential algorithm. Based on this
transformation an error measure ECi is calculated, much alike in
here the error is calculated between the image and the canvas:
10
(3.1), only
EC = ∑ [I ( xi , y i ) − C ( x ' i , y ' i )]
2
(3.4)
i
where i traverses all the pixels in I , x i′, y i′ are calculated as in Eq.
(2.1) and C is the
canvas. The transformation TCi is then used as the initial guess for finding a registration
~
~
~
TCi directly to the canvas. An error measure ECi is calculated for TCi too. If this shows
~
significant improvement over TCi ( ECi ∗ c < ECi where c is a factor greater than 1) it is
adopted. If not, TCi is adopted. This method, however, does not take into account the fact
that in video sequences the overlap between consecutive images is relatively large, thus most
of the information in the canvas where I i is to be registered comes from its recent history.
Hence, the combined algorithm will show little improvement over the sequential algorithm.
This problem may be rectified by the following: At any given time i , two instances of the
canvas are considered, one at time i , denoted by Ci , the other at time i − n , denoted by
Ci − n . While the error measurements are calculated with respect to Ci , the canvas registra-
~
tion TCi is calculated with respect to Ci − n , ensuring the utilization of “older” information.
3.4.1. Experimental Results
~
The combined algorithm uses the parameter c to decide whether TCi or TCi should be
used. A large c will ensure that the combined algorithm will not “harm” the sequential result.
On the other hand, too large a c will not allow using the canvas information. We found that
useful values for c are 10
. < c < 115
. .
The parameter n depends on the characteristics of the image sequences, and should be chosen in a manner that will leave Ci − n with the relevant information. We found that n ’s between 5 and 25 produce good results (see Figure 3.10). Note that n can be changed dynamically based on the transformation being calculated.
(a)
(b)
Figure 3.10: The combined algorithm applied to a sequence of fifty 70x100 pixel images, where the actual
transformation between each two successive images is approximately a five pixel translation. The order that
the image stream was taken is from the bottom right corner moving left, up to the top left corner, and then
moving right to the top right corner. (a) Sequential algorithm results. (b) Combined algorithm results.
11
4. Placing the Seams in Low Activity Regions
Our final contribution was triggered by the observation that registration errors in the mosaic
are much more visible when they occur in regions containing significant image activity. The
same error, occurring in a low activity region, may be practically invisible. It is not easy to
devise a general algorithm that “tucks” registration errors into low activity regions. However,
in some cases, a simple utilization of this principle may yield significant improvements. Such
a case is the mosaicing of video-scanned text. In this case, all that is needed is to find the gap
between the text lines, and to place the seams within that gap. Our experimental results show
that the success rate of a typical OCR algorithm increased from approximately 85% to 99%
when presented with an input generated by this improved mosaicing procedure (see Fig. 4.1).
5. Conclusion
The importance of video image mosaicing will continue to increase as cheaper and smaller
sensors become available. In order for this technology to be really useful, it must yield good
robust results, preferably at real-time rates. While the first requirement depends on algorithmic quality, the second will be addressed somewhat by the expected rapid increase in
computing power over the next few years. It seems, therefore, that superior, but somewhat
slow, algorithms are to be preferred over fast inferior ones. This is the reason we propose
quality algorithms, even at the price of them being somewhat complex.
Our algorithms register based on intensity information. Better results might possibly be obtained if the individual color components are considered.
6. References
[1] Toshiba, VideoBrush Corporation , http://www.videobrush.com/
[2] M. Elad and A. Feuer, "Super-resolution restoration of continuous image sequence - adaptive filtering approach" IEEE Trans. Image Proc., December 1995.
[3] M. Irani and S. Peleg “Improving resolution by image registration” GMIP(53), pp. 231239., May 1991
[4] S. Mann and R. Picard. “Constructing high quality stills from video” IEEE Trans. Image
Proc., pp. 13-16 November 1994.
[5] E. Chen, "QuickTime VR - An image-based approach to virtual environment navigation".
Proc. of SIGGRAPH, pp. 29-38, August 1995.
[6] L. McMillan and G. Bishop. “Plenoptic modeling: An image based rendering system”,
Proc. of SIGGRAPH, pp. 39-46, August 1995.
[7] M. Irani, P. Anandan and S. Hsu, "Mosaic based representations of video sequences and
their applications". Proc. of IEEE ICCV pp. 605-611, 1995.
[8] N. Green and P. Heckbert. “Creating raster omnimax images from multiple prespective
views using the elliptical weighted avarege filter” IEEE CG&A pp. 21-27, June 1986.
12
[9] S. Peleg and J. Herman, “Panoramic mosaic by manifold projection”, Proc. of CVPR,
June 1997, pp. 338-343.
[10] S. Mann “Composing multiple pictures of the same scene: Generalized large displacement 8-parameters motion” IS&T Cambridge, May 1993.
[11] R. Szeliski, ``Video mosaics for virtual environments,'' IEEE CG&A 13, pp. 22-30,
1996.
[12] US Patent No. 5649032, ”System for automatically aligning images to form a mosaic
image.” David Sarnoff Research Center, Inc., Princeton, NJ.
[13] A. Krishnan and A. Ahuja “Panoramic image acquisition” Proc. of IEEE CVPR. pp.
379-384 June 1996.
[14] M. Glisher and A. Witkin “Through-the-lens camera control” Proc. of SIGGRAPH pp.
331-340 July 1992.
[15] M. Irani, B. Rousso, S. Peleg “Recovery of ego-motion using image stabilization” Proc.
of CVPR-94, pp. 39-45 June 1994.
[16] P. Jaillon and A. Montavert “Image mosaicing applied to three-dimensional surfaces”
Proc. of IEEE CVPR pp. 253-257 October 1994.
[17] P.J. Burt and E.H Adelson “A multiresolution spline with application to image mosaics”
ACM Transactions on Graphics 2(4):217-236, 1983.
[18] D.C. Barber “Registration of low resolution medical images” Phys. Med. Biol. 27(3), pp.
87-96 1992.
[19] D. Aiger and D. Cohen “Mosaicing ultrasonic volumes for visual simulation” Tel Aviv
University, Computer Science Dept. technical report.
[20] J. D. Foley, A. Van Dam, S. K. Feiner, and J. F. Hughes. “Computer graphics: principles
and practice”. Addison-Wesley, Reading, MA, 2nd Edition, 1990.
[21] P.S. Heckbert, "Fundamentals of texture mapping and image warping, "Masters Thesis,
Dept. of EECS, UCB, Technical Report No. UCB/CSD 89/516, June 1989.
[22] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. Numerical Recipes in
C: The Art of Scientific Computing. Cambridge University Press, Cambridge, England,
second edition, 1992.
[23] L.H. Quam. “Hierarchical warp stereo.” In Image Understanding Workshop, pp. 149155, December 1984.
13
(a)
(b)
Figure 4.1: Sequence of approximately 1000 images at resolution of 70x100 pixels. (a) Registration using the
multi-stage algorithm with a 1-image window. The seams of the canvas are very evident in the text lines. Applying
OCR to this resulted in recognition of only 85% of the characters. (b) Registration such that the seams are placed
between the lines. Applying OCR to this resulted in recognition of 98% of the characters.
14