Computer Vision - ECCV 2008
Computer Vision - ECCV 2008
Computer Vision - ECCV 2008
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
David Forsyth Philip Torr
Andrew Zisserman (Eds.)
Computer Vision –
ECCV 2008
13
Volume Editors
David Forsyth
University of Illinois at Urbana-Champaign, Computer Science Department
3310 Siebel Hall, Urbana, IL 61801, USA
E-mail: daf@cs.uiuc.edu
Philip Torr
Oxford Brookes University, Department of Computing
Wheatley, Oxford OX33 1HX, UK
E-mail: philiptorr@brookes.ac.uk
Andrew Zisserman
University of Oxford, Department of Engineering Science
Parks Road, Oxford OX1 3PJ, UK
E-mail: az@robots.ox.ac.uk
ISSN 0302-9743
ISBN-10 3-540-88692-3 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-88692-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2008
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 12553631 06/3180 543210
Preface
there had been a successful attempt to download all submissions shortly after
the deadline. We warned all authors that this had happened to ward off dangers
to intellectual property rights, and to minimize the chances that an attempt at
plagiarism would be successful. We were able to identify the responsible party,
discussed this matter with their institutional management, and believe we re-
solved the issue as well as we could have. Still, it is important to be aware that
no security or software system is completely safe, and papers can leak from
conference submission.
We felt the review process worked well, and recommend it to the community.
The process would not have worked without the efforts of many people. We thank
Lyndsey Pickup, who managed the software system, author queries, Area Chair
queries and general correspondence (most people associated with the conference
will have exchanged e-mails with her at some point). We thank Simon Baker,
Ramin Zabih and especially Jiřı́ Matas for their wise advice on how to organize
and run these meetings; the process we have described is largely their model from
CVPR 2007. We thank Jiřı́ Matas and Dan Večerka, for extensive help with,
and support of, the software system. We thank C. J. Taylor for the 3-from-5
optimization code. We thank the reviewers for their hard work. We thank the
Area Chairs for their very hard work, and for the time and attention each gave
to reading papers, reviews and summaries, and writing summaries.
We thank the Organization Chairs Peter Sturm and Edmond Boyer, and
the General Chair, Jean Ponce, for their help and support and their sharing of
the load. Finally, we thank Nathalie Abiola, Nasser Bacha, Jacques Beigbeder,
Jerome Bertsch, Joëlle Isnard and Ludovic Ricardou of ENS for administra-
tive support during the Area Chair meeting, and Danièle Herzog and Laetitia
Libralato of INRIA Rhône-Alpes for administrative support after the meeting.
Conference Chair
Jean Ponce Ecole Normale Supérieure, France
Honorary Chair
Jan Koenderink EEMCS, Delft University of Technology,
The Netherlands
Program Chairs
David Forsyth University of Illinois, USA
Philip Torr Oxford Brookes University, UK
Andrew Zisserman University of Oxford, UK
Organization Chairs
Edmond Boyer LJK/UJF/INRIA Grenoble–Rhône-Alpes, France
Peter Sturm INRIA Grenoble–Rhône-Alpes, France
Specialized Chairs
Frédéric Jurie Workshops Université de Caen, France
Frédéric Devernay Demos INRIA Grenoble–Rhône-Alpes,
France
Edmond Boyer Video Proc. LJK/UJF/INRIA
Grenoble–Rhône-Alpes, France
James Crowley Video Proc. INPG, France
Nikos Paragios Tutorials Ecole Centrale, France
Emmanuel Prados Tutorials INRIA Grenoble–Rhône-Alpes,
France
Christophe Garcia Industrial Liaison France Telecom Research, France
Théo Papadopoulo Industrial Liaison INRIA Sophia, France
Jiřı́ Matas Conference Software CTU Prague, Czech Republic
Dan Večerka Conference Software CTU Prague, Czech Republic
Administration
Danile Herzog INRIA Grenoble–Rhône-Alpes, France
Laetitia Libralato INRIA Grenoble–Rhône-Alpes, France
Conference Website
Elisabeth Beaujard INRIA Grenoble–Rhône-Alpes, France
Amaël Delaunoy INRIA Grenoble–Rhône-Alpes, France
Mauricio Diaz INRIA Grenoble–Rhône-Alpes, France
Benjamin Petit INRIA Grenoble–Rhône-Alpes, France
Printed Materials
Ingrid Mattioni INRIA Grenoble–Rhône-Alpes, France
Vanessa Peregrin INRIA Grenoble–Rhône-Alpes, France
Isabelle Rey INRIA Grenoble–Rhône-Alpes, France
Area Chairs
Horst Bischof Graz University of Technology, Austria
Michael Black Brown University, USA
Andrew Blake Microsoft Research Cambridge, UK
Stefan Carlsson NADA/KTH, Sweden
Tim Cootes University of Manchester, UK
Alyosha Efros CMU, USA
Jan-Olof Eklund KTH, Sweden
Mark Everingham University of Leeds, UK
Pedro Felzenszwalb University of Chicago, USA
Richard Hartley Australian National University, Australia
Martial Hebert CMU, USA
Aaron Hertzmann University of Toronto, Canada
Dan Huttenlocher Cornell University, USA
Michael Isard Microsoft Research Silicon Valley, USA
Aleš Leonardis University of Ljubljana, Slovenia
David Lowe University of British Columbia, Canada
Jiřı́ Matas CTU Prague, Czech Republic
Joe Mundy Brown University, USA
David Nistér Microsoft Live Labs/Microsoft Research, USA
Tomáš Pajdla CTU Prague, Czech Republic
Patrick Pérez IRISA/INRIA Rennes, France
Marc Pollefeys ETH Zürich, Switzerland
Ian Reid University of Oxford, UK
Cordelia Schmid INRIA Grenoble–Rhône-Alpes, France
Bernt Schiele Darmstadt University of Technology, Germany
Christoph Schnörr University of Mannheim, Germany
Steve Seitz University of Washington, USA
Organization IX
Conference Board
Horst Bischof Graz University of Technology, Austria
Hans Burkhardt University of Freiburg, Germany
Bernard Buxton University College London, UK
Roberto Cipolla University of Cambridge,UK
Jan-Olof Eklundh Royal Institute of Technology, Sweden
Olivier Faugeras INRIA, Sophia Antipolis, France
Anders Heyden Lund University, Sweden
Aleš Leonardis University of Ljubljana, Slovenia
Bernd Neumann University of Hamburg, Germany
Mads Nielsen IT University of Copenhagen, Denmark
Tomáš Pajdla CTU Prague, Czech Republic
Giulio Sandini University of Genoa, Italy
David Vernon Trinity College, Ireland
Program Committee
Sameer Agarwal Tamara Berg Thomas Brox
Aseem Agarwala James Bergen Andrés Bruhn
Jörgen Ahlberg Marcelo Bertalmio Antoni Buades
Narendra Ahuja Bir Bhanu Joachim Buhmann
Yiannis Aloimonos Stan Bileschi Hans Burkhardt
Tal Arbel Stan Birchfield Andrew Calway
Kalle Åström Volker Blanz Rodrigo Carceroni
Peter Auer Aaron Bobick Gustavo Carneiro
Jonas August Endre Boros M. Carreira-Perpinan
Shai Avidan Terrance Boult Tat-Jen Cham
Simon Baker Richard Bowden Rama Chellappa
Kobus Barnard Edmond Boyer German Cheung
Adrien Bartoli Yuri Boykov Ondřej Chum
Benedicte Bascle Gary Bradski James Clark
Csaba Beleznai Chris Bregler Isaac Cohen
Peter Belhumeur Thomas Breuel Laurent Cohen
Serge Belongie Gabriel Brostow Michael Cohen
Moshe Ben-Ezra Matthew Brown Robert Collins
Alexander Berg Michael Brown Dorin Comaniciu
X Organization
Additional Reviewers
Lourdes Agapito Ross Beveridge Yixin Chen
Daniel Alexander V. Bhagavatula Dmitry Chetverikov
Elli Angelopoulou Edwin Bonilla Sharat Chikkerur
Alexandru Balan Aeron Buchanan Albert Chung
Adrian Barbu Michael Burl Nicholas Costen
Nick Barnes Tiberio Caetano Gabriela Oana Cula
João Barreto Octavia Camps Goksel Dedeoglu
Marian Bartlett Sharat Chandran Hervé Delingette
Herbert Bay François Chaumette Michael Donoser
XII Organization
Sponsoring Institutions
Table of Contents – Part IV
Segmentation
Image Segmentation in the Presence of Shadows and Highlights . . . . . . . . 1
Eduard Vazquez, Joost van de Weijer, and Ramon Baldrich
Computational Photography
Light-Efficient Photography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Samuel W. Hasinoff and Kiriakos N. Kutulakos
Priors for Large Photo Collections and What They Reveal about
Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Sujit Kuthirummal, Aseem Agarwala, Dan B Goldman, and
Shree K. Nayar
Poster Session IV
CenSurE: Center Surround Extremas for Realtime Feature Detection
and Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Motilal Agrawal, Kurt Konolige, and Morten Rufus Blas
Sample Sufficiency and PCA Dimension for Statistical Shape Models . . . 492
Lin Mei, Michael Figl, Ara Darzi, Daniel Rueckert, and
Philip Edwards
Active Reconstruction
Temporal Dithering of Illumination for Fast Active Vision . . . . . . . . . . . . . 830
Srinivasa G. Narasimhan, Sanjeev J. Koppal, and
Shuntaro Yamazaki
Compressive Structured Light for Recovering Inhomogeneous
Participating Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
Jinwei Gu, Shree Nayar, Eitan Grinspun, Peter Belhumeur, and
Ravi Ramamoorthi
Passive Reflectometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859
Fabiano Romeiro, Yuriy Vasilyev, and Todd Zickler
Fusion of Feature- and Area-Based Information for Urban Buildings
Modeling from Aerial Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873
Lukas Zebedin, Joachim Bauer, Konrad Karner, and Horst Bischof
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 1–14, 2008.
c Springer-Verlag Berlin Heidelberg 2008
2 E. Vazquez, J. van de Weijer, and R. Baldrich
surface reflectance forms a branch which points in the direction of the reflected
illuminant. In conclusion, the distribution of a single DC forms a ridge-like struc-
ture in histogram space.
a) b) c) d)
Fig. 1. (a) An image from [14] and (b) its histogram. The effects of shading and
highlights are clearly visible in the red colours of the histogram. (c) Segmented images
using RAD. (d) Ridges found with RAD. Note that the three branches of the red
pepper are correctly connected in a single ridge.
a) b) c) d)
Fig. 2. (a) An image and (b) its 3D RGB histogram. (c) A patch of a) and its RGB
histogram. (d) 2D histogram of c) to illustrate the discontinuities appearing on a DC.
objects described by this equation will trace connected ridges in histogram space.
This makes the method more robust to deviations from the dichromatic model.
This paper is organized as follows: in section 2 RAD is presented as a feature
space analysis method. Afterwards, in section 3 RAD is introduced as a segmen-
tation technique. The results obtained and a comparison with Mean Shift and
various other state-of-the-art methods on the Berkeley dataset is presented in
section 4. Finally, conclusions of the current work are given in section 5.
The creaseness measure of Ω(x) for a given point x, named k(x, σ), is computed
with the divergence between the dominant gradient orientation and the normal
vectors, namely nk , on the r -connected neighbourhood of size proportional to
σi . That is:
d t
r
k(x, σ) = −Div(w(x, σ)) = − w (k, σ) · nk (4)
r
k=1
a) b) c) d)
e) f) g) h)
Fig. 3. A graphical example of the whole process. (a) Opponent Red-Green and Blue-
Yellow histogram Ω(x) of g). (b) Creaseness representation of a). (c) Ridges found in
b). (d)Ridges fitted on original distribution. (e) Top-view of d). (f)Dominant structures
of a). (g) Original image. (h)Segmented image.
Points (SP): when there is a local maximum in one direction and a local minimum
in another one. Third, Local Maximum Points (LMP). Formally, let Ω(x, y) be
a continuous 2D surface and ∇Ω(x, y) be the gradient vector of the function
Ω(x, y). We define ω1 and ω2 as the unit eigenvectors of the Hessian matrix and
λ1 and λ2 its corresponding eigenvalues with | λ1 |≤| λ2 |. Then, for the 2D case:
Basically, the flooding process begins on the local minima and, iteratively, the
landscape sinks on the water. Those points where the water coming from different
local minima join, compose the watershed lines. To avoid potential problems with
irregularities [16], we force the flooding process to begin at the same time in all
DS descriptors, on the smoothed Ω(x) distribution with a Gaussian kernel of
standard deviation σd (already computed on the ST calculus). Then, we define
RAD as the operator returning the set of DS of Ω σ using RPs as marks:
RAD(Ω(x)) = W (Ω σ , RP (Ω σ )) (8)
Following this procedure, Figure 3f depicts the 2D projection of the DSs found
on 3a.
in one unique mode. Finally, the basis of attraction of these modes will com-
pose a dominant colour of the image. Mean Shift has two basic parameters to
adapt the segmentation to an specific problem, namely, hs , which controls a
smoothing process, and hr related with the size of the kernel used to determine
the modes and its basis of attraction. To test the method, we have selected the
set parameters (hs , hr ) = {(7, 3), (7, 15), (7, 19), (7, 23), (13, 7)(13, 19), (17, 23)}
given in [24] and [5]. The average times for this set of parameters, expressed in
seconds, are 3.17, 4.15, 3.99, 4.07, 9.72, 9.69, 13.96 respectively. Nevertheless,
these parameters do not cover the complete spectrum of possibilities of the MS.
Here we want to compare RAD and MS from a soft oversegmentation to a soft
undersegmentation. Hence, in order to reach an undersegmentation with MS, we
add the following parameter settings (hs , hr ) = {(20, 25), (25, 30), (30, 35)}. For
these settings, the average times are 18.05, 24.95 and 33.09 respectively.
The parameters used for RAD based segmentation are (σd ,σi )={ (0.8,0.05),
(0.8,0.5), (0.8,1), (0.8,1.5), (1.5,0.05), (1.5,0.5), (1.5,1.5), (2.5,0.05), (2.5,0.5),
(2.5,1.5) }. These parameters vary from a soft oversegmentation to an underseg-
mentation, and have been selected experimentally. The average times for RAD
are 6.04, 5.99, 6.11, 6.36, 6.11, 5.75, 6.44, 5.86, 5.74 and 6.35. These average
times, point out the fact that RAD is not dependent of the parameters used.
In conclusion, whereas the execution time of Mean Shift increases significantly
with increasing spatial scale, the execution time of RAD remains constant from
an oversegmentation to an undersegmentation.
The experiments has been performed on the publicly available Berkeley image
segmentation dataset and benchmark [12]. We use the Global Constancy Error
(GCE) as an error measure. This measure was also proposed in [12] and takes
care of the refinement between different segmentations. For a given pixel pi ,
consider the segments (sets of connected pixels), S1 from the benchmark and S2
from the segmented image that contain this pixel. If one segment is a proper
subset of the other, then pi lies in an area of refinement and the error measure
should be zero. If there is no subset relationship, then S1 and S2 overlap in an
inconsistent manner and the error is higher than zero, (up to one in the worst
possible case). MS segmentation has been done on the CIE Luv space since
this is the space used in [24] and [5]. RAD based segmentation has been done
on the RGB colour space for two reasons. First, the Berkeley image dataset
does not have calibrated images and, consequently, we can not assure a good
transformation from sRGB to CIE Luv. Second, because the size of L, u and
v, is not the same and the method will require six parameters, instead of two,
that is, −
σ→ −
→ −
→
L , σu and σv . Nonetheless, for the sake of clarity, we also present some
results of RAD on CIE Luv to directly compare results with MS.
Figure 4 depicts a set of examples for RAD on RGB. From left to right: original
image, RAD for (σd ,σi )={ (0.8,0.05) , (1.5,0.05) , (2.5,0.05) , (2.5,1.5) } and hu-
man segmentation. Figure 5 shows some results for the mean shift segmentation,
corresponding to (hs , hr ) = {(7, 15), (13, 19), (17, 23), (20, 25), (25, 30), (30, 35)}.
These results point out the main advantage of RAD in favor of MS, namely, the
capability of RAD to capture the DS of a histogram, whereas MS is ignorant
to the physical processes underlying the structure of the DSs as Abd-Almageed
and S. Davis explain in [10]. Graphically, the set of images depicted in the first
row of Figure 5, shows this behavior in a practical case. In the last column, MS
joins rocks with the mountain, and the mountain with the sky, but is not able
to find one unique structure for a rock or for the mountain, whereas RAD, as
shown in Figure 4, is able to do.
A danger of RAD is that for some parameter settings it is prone to underseg-
menting. Consequently it finds only one dominant colour for the whole image.
This happens in some cases for (σd ,σi )={(2.5,1),(2.5,1.5)}, as Figure 6 illus-
trates. In the first example, the aircraft has a bluish colour similar to the sky,
as well as the fish and its environment in the second example.
Additional examples related to the presence of physical effects, such as shad-
ows, shading and highlights are shown in Figure 7. The good performance of
RAD in these conditions can be clearly observed for the skin of the people, the
elephants and buffalos, as well as for the clothes of the people.
18 18 0.7 0.7
16 16
Percent of images
Percent of images
0.6 0.6
14 14
0.5 0.5
12 12
10 10 0.4 0.4
8 8 0.3 0.3
6 6
0.2 0.2
4 4
0.1 0.1
2 2
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
mean GCE mean GCE set of parameters set of parameters
a) b) c) d)
RAD(RGB)−MS (mean GCE index)
1 0.3
RAD(Luv)−MS (mean GCE index)
0.3
0.9
0.2 0.2 0.2
RAD(Luv)−RAD(RGB)
0.8
GCE index (mean)
0.6
0 0 0
0.5
−0.1 −0.1 −0.1
0.4
e) f) g) h)
Fig. 8. (a,b)Mean GCE values for each set of parameters. (c,d) Standard deviation of
GCE along maximum and minimum values for each set of parameters. (e)Mean GCE
values for each image sorted form lower to higher. (f)Values higher than zero: images
where MS performs better RAD. (g,h)The same as f) but for MS and RAD Luv and
for RAD RGB versus RAD Luv.
Image Segmentation in the Presence of Shadows and Highlights 11
Table 1. Global Constancy Error for several state-of the-art methods: seed [27], fow
[28], MS, and nCuts [29]. Values taken from [27] and [5].
The histogram of the mean GCE values versus the percentage of images for
each GCE value are shown in Figures 8a,b for RAD on RGB and MS respectively.
As more bars are accumulated on the left, the better is the method. Figures 8c,d
show the standard deviation along the maximum and the minimum GCE val-
ues (red lines) for each of the 10 sets of parameters for RAD on RGB and MS.
Note that the behaviour of both methods in this sense is almost the same. A
low and similar standard deviation along all parameters means that the method
has a stable behaviour. Figure, 8e depicts the mean GCE index for each image
ordered by increasing index for MS (green), RAD on RGB (black) and RAD on
Luv (red). This plot shows, not only the good performance of RAD, but that
RAD has a similar behavior on RGB and CIE Luv spaces, even with the afore-
mentioned potential problems on Luv. Figure 8f plots the GCE index differences
for each image between RAD on RGB and MS. Values lower than zero indicate
the number of images where RAD performs better than MS. The same but for
RAD on Luv versus MS, and RAD on RGB versus RAD on Luv is depicted on
Figure 8g,h.
Additionally, table 1 shows GCE values for several state-of-the-art methods.
These values are taken from [27] and [5]. These experiments have been performed
using the train set of 200 images. For both RAD and MS we present the results
obtained with the best parameter settings. For our method the best results
were obtained with (σd ,σi )={(2.5,0.05)}. The mean number of dominant colours
found using RAD had been 5, but it is not directly translated in 5 segments on
segmented images. Often, some segments of few pixels appear due to chromaticity
of surfaces as can bee seen in figure 3h. CGE evaluation favors oversegmentation
[12]. Hence, to make feasible a comparison with other methods using GCE, we
have performed the segmentation without considering segments of an area lower
than 2% of the image area. In this case, the mean number of segments for the 200
test images is 6.98 (7 segments). The number of segments for the other methods
varies from 5 to 12.
As can be seen our method obtains the best results. Furthermore, it should be
noted that the method is substantially faster than the seed and the nCuts [29]
method. In addition, the results obtained with the MS need an additional step.
Namely, a final combination step, which requires a new threshold value, is used
to fuse adjacent segments in the segmented image if their chromatic difference
is lower than the threshold (without pre- an postprocessing MS obtains a score
of 0.2972). For our RAD method we do not apply any pre- or postprocessing
steps.
12 E. Vazquez, J. van de Weijer, and R. Baldrich
5 Conclusions
This paper introduces a new feature space segmentation method that extracts the
Ridges formed by a dominant colour in an image histogram. This method is robust
against discontinuities appearing in image histograms due to compression and ac-
quisition conditions. Furthermore, those strong discontinuities, related with the
physical illumination effects are correctly treated due to the topological treatment
of the histogram. As a consequence, the presented method yields better results
than Mean shift on a widely used image dataset and error measure. Additionally,
even with neither preprocessing nor postprocessing steps, RAD has a better per-
formance than the state-of-the-art methods. It points out that the chromatic in-
formation is an important cue on human segmentation. Additionally, the elapsed
time for RAD is not affected by its parameters. Due to that it becomes a faster
method than Mean Shift and the other state-of-the-art methods.
The next step is to add spatial coherence to help the method in those areas
which are not well-represented by a dominant colour. Furthermore, improvement
is expected by looking for dominant colours only in interesting regions in the
image instead of in the whole image at once.
Acknowledgements
This work has been partially supported by projects TIN2004-02970, TIN2007-
64577 and Consolider-Ingenio 2010 CSD2007-00018 of Spanish MEC (Ministery
of Science) and the Ramon y Cajal Program.
References
1. Skarbek, W., Koschan, A.: Colour image segmentation — a survey. Technical re-
port, Institute for Technical Informatics, Technical University of Berlin (October
1994)
2. Cheng, H., Jiang, X., Sun, Y., Wang, J.: Color image segmentation:advances and
prospects. Pattern Recognition 34(6), 2259–2281 (2001)
3. Lucchese, L., Mitra, S.: Color image segmentation: A state-of-the-art survey. INSA-
A: Proceedings of the Indian National Science Academy, 207–221 (2001)
4. Agarwal, S., Madasu, S., Hanmandlu, M., Vasikarla, S.: A comparison of some
clustering techniques via color segmentation. In: ITCC 2005: Proceedings of the In-
ternational Conference on Information Technology: Coding and Computing (ITCC
2005), vol. II, pp. 147–153. IEEE Computer Society Press, Washington (2005)
5. Yang, Y., Wright, J., Sastry, S., Ma, Y.: Unsupervised segmentation of natural
images via lossy data compression (2007)
6. Freixenet, J., Munoz, X., Raba, D., Mart, J., Cuf, X.: Yet another survey on image
segmentation: Region and boundary information integration. In: Heyden, A., Sparr,
G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 408–422.
Springer, Heidelberg (2002)
Image Segmentation in the Presence of Shadows and Highlights 13
7. Sezgin, M., Sankur, B.: Survey over image thresholding techniques and quantitative
performance evaluation. J. Electron. Imaging 13(1), 146–165 (2004)
8. Fukunaga, K., Hostetler, L.D.: The estimation of the gradient of a density func-
tion, with applications in pattern recognition. IEEE Transactions on Information
Theory 121(1), 32–40 (1975)
9. Verma, D., Meila, M.: A comparison of spectral clustering algorithms. technical
report uw-cse-03-05-01, university of washington
10. Abd-Almageed, W., Davis, L.: Density Estimation Using Mixtures of Mixtures of
Gaussians. In: 9th European Conference on Computer Vision (2006)
11. Bilmes, J.: A Gentle Tutorial of the EM Algorithm and its Application to Param-
eter Estimation for Gaussian Mixture and Hidden Markov Models. International
Computer Science Institute 4 (1998)
12. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A Database of Human Segmented
Natural Images and its Application to Evaluating Segmentation Algorithms and
Measuring Ecological Statistics. In: Proc. Eighth Int’l Conf. Computer Vision,
vol. 2, pp. 416–423 (2001)
13. Shafer, S.A.: Using color to seperate reflection components. COLOR research and
application 10(4), 210–218 (1985)
14. Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The Amsterdam library
of object images. Int. J. Comput. Vision 61(1), 103–112 (2005)
15. Klinker, G., Shafer, S.: A physical approach to color image understanding. Int.
Journal of Computer Vision 4, 7–38 (1990)
16. López, A.M., Lumbreras, F., Serrat, J., Villanueva, J.J.: Evaluation of methods for
ridge and valley detection. IEEE Trans. Pattern Anal. Mach. Intell. 21(4), 327–335
(1999)
17. Wang, L., Pavlidis, T.: Direct gray-scale extraction of features for character recog-
nition. IEEE Trans. Pattern Anal. Mach. Intell. 15(10), 1053–1067 (1993)
18. Bishnu, A., Bhowmick, P., Dey, S., Bhattacharya, B.B., Kundu, M.K., Murthy,
C.A., Acharya, T.: Combinatorial classification of pixels for ridge extraction in a
gray-scale fingerprint image. In: ICVGIP (2002)
19. Vazquez, E., Baldrich, R., Vazquez, J., Vanrell, M.: Topological histogram reduc-
tion towards colour segmentation. In: Martı́, J., Benedı́, J.M., Mendonça, A.M.,
Serrat, J. (eds.) IbPRIA 2007. LNCS, vol. 4477, pp. 55–62. Springer, Heidelberg
(2007)
20. Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based
on immersion simulations. IEEE Transactions on Pattern Analysis and Machine
Intelligence 13(6), 583–598 (1991)
21. Gauch, J.M., Pizer, S.M.: Multiresolution analysis of ridges and valleys in grey-
scale images. IEEE Trans. Pattern Anal. Mach. Intell. 15(6), 635–646 (1993)
22. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space
analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
23. Christoudias, C., Georgescu, B., Meer, P.: Synergism in low level vision. Interna-
tional Conference on Pattern Recognition 4, 150–155 (2002)
24. Pantofaru, C., Hebert, M.: A comparison of image segmentation algorithms. Tech-
nical Report CMU-RI-TR-05-40, Robotics Institute, Carnegie Mellon University,
Pittsburgh, PA (September 2005)
25. Ge, F., Wang, S., Liu, T.: New benchmark for image segmentation evaluation.
Journal of Electronic Imaging 16, 033011 (2007)
26. Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. Intl.
Journal of Computer Vision 59(2) (2004)
14 E. Vazquez, J. van de Weijer, and R. Baldrich
27. Micusık, B., Hanbury, A.: Automatic image segmentation by positioning a seed. In:
Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952. Springer,
Heidelberg (2006)
28. Fowlkes, C., Martin, D., Malik, J.: Learning affinity functions for image segmen-
tation: combining patch-based and gradient-based approaches. In: Proceedings of
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
2003, vol. 2 (2003)
29. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
Image Segmentation by Branch-and-Mincut
1 Introduction
Binary image segmentation is often posed as a graph partition problem. This is
because efficient graph algorithms such as mincut permit fast global optimiza-
tion of the functionals measuring the quality of the segmentation. As a result,
difficult image segmentation problems can be solved efficiently, robustly, and
independently of initialization. Yet, while graphs can represent energies based
on localized low-level cues, they are much less suitable for representing non-local
cues and priors describing the foreground or the background segment as a whole.
Consider, for example, the situation when the shape of the foreground segment
is known a priori to be similar to a particular template (segmentation with shape
priors). Graph methods can incorporate such a prior for a single pre-defined and
pre-located shape template[13,20]. However, once the pose of the template is al-
lowed to change, the relative position of each graph edge with respect to the tem-
plate becomes unknown, and the non-local property of shape similarity becomes
hard to express with local edge weights. Another example would be the segmen-
tation with non-local color priors, when the color of the foreground and/or back-
ground is known a priori to be described by some parametric distribution (e.g.
a mixture of the Gaussians as in the case of GrabCut [25]). If the parameters of
these distributions are allowed to change, such a non-local prior depending on the
segment as a whole becomes very hard to express with the local edge weights.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 15–29, 2008.
c Springer-Verlag Berlin Heidelberg 2008
16 V. Lempitsky, A. Blake, and C. Rother
2 Related Work
Our framework employs the fact that a submodular quadratic function of boolean
variables can be efficiently minimized via minimum cut computation in the as-
sociated graph [2,11,18]. This idea has been successfully applied to binary image
segmentation [3] and quickly gained popularity. As discussed above, the approach
[3] still has significant limitations, as the high-level knowledge such as shape or
color priors are hard to express with fixed local edge weights. These limitations
are overcome in our framework, which allows the edge weights to vary.
In the restricted case, when unary energy potentials are allowed to vary and
depend on a single scalar non-local parameter monotonically, efficient algorithms
known as parametric maxflow have been suggested (see e.g. [19]). Our framework
is however much more general then these methods (at a price of having higher
worst-case complexity), as we allow both unary and pairwise energy terms to
depend non-monotonically on a single or multiple non-local parameters. Such
generality gives our framework flexibility in incorporating various high-level pri-
ors while retaining the globality of the optimization.
Image Segmentation by Branch-and-Mincut 17
Image segmentation with non-local shape and color priors has attracted a
lot of interest in the last years. As discussed above, most approaches use either
local continuous optimization [27,7,24,9] or iterated minimization alternating
graph cut and search over non-local parameter space [25,6,16]. Unfortunately,
both groups of methods are prone to getting stuck in poor local minima. Global-
optimization algorithms have also been suggested [12,26] . In particular, simulta-
neous work [10] presented a framework that also utilizes branch-and-bound ideas
(paired with continuous optimization in their case). While all these global opti-
mization methods are based on elegant ideas, the variety of shapes, invariances,
and cues that each of them can handle is limited compared to our method.
Finally, our framework may be related to branch-and-bound search methods
in computer vision (e.g. [1,21]). In particular, it should be noted that the way our
framework handles shape priors is related to previous approaches like [14] that
used tree search over shape hierarchies. However, neither of those approaches
accomplish pixel-wise image segmentation.
3 Optimization Framework
In this section, we discuss our global energy optimization framework for obtain-
ing image segmentations under non-local priors1 . In the next sections, we detail
how it can be used for the segmentation with non-local shape priors (Section 4)
and non-local intensity/color priors (Section 5).
Firstly, we introduce notation and give the general form of the energy that can
be optimized in our framework. Below, we consider the pixel-wise segmentation
of the image. We denote the pixel set as V and use letters p and q to denote
individual pixels. We also denote the set of edges connecting adjacent pixels
as E and refer to individual edges as to the pairs of pixels (e.g. p, q). In our
experiments, the set of edges consisted of all 8-connected pixel pairs in the raster.
The segmentation of the image is given by its 0−1 labeling x ∈ 2V , where
individual pixel labels xp take the values 1 for the pixels classified as the fore-
ground and 0 for the pixels classified as the background. Finally, we denote the
non-local parameter as ω and allow it to vary over a discrete, possibly very large,
set Ω. The general form of the energy function that can be handled within our
framework is then given by:
E(x, ω) = C(ω)+ F p (ω)·xp + B p (ω)·(1−xp )+ P pq (ω)·|xp −xq | . (1)
p∈V p∈V p,q∈E
Here, C(ω) is a constant potential, which does not depend directly on the seg-
mentation x; F p (ω) and B p (ω) are the unary potentials defining the cost for
1
The C++ code for this framework is available at the webpage of the first author.
18 V. Lempitsky, A. Blake, and C. Rother
⎡
min E(x, ω) = min min ⎣C(ω) + F p (ω)·xp + B p (ω)·(1 − xp )+
x∈2V ,ω∈Ω x∈2V ω∈Ω
p∈V p∈V
⎤ ⎡
P pq (ω)·|xp − xq |⎦ ≥ min ⎣min C(ω) + min F p (ω)·xp +
x∈2V ω∈Ω ω∈Ω
p,q∈E p∈V
⎤
min B p (ω)·(1 − xp ) + min P pq (ω)·|xp − xq |⎦ =
ω∈Ω ω∈Ω
p∈V p,q∈E
⎡ ⎤
min ⎣CΩ + FΩp ·xp + p
BΩ ·(1 − xp ) + PΩpq ·|xp − xq |⎦ = L(Ω) . (2)
x∈2V
p∈V p∈V p,q∈E
Here, CΩ , FΩp , BΩ
p
, PΩpq denote the minima of C(ω), F p (ω), B p (ω), P pq (ω) over
ω ∈ Ω referred below as aggregated potentials. L(Ω) denotes the derived lower
bound for E(x, ω) over 2V ⊗ Ω. The inequality in (2) is essentially the Jensen
inequality for the minimum operation.
The proposed lower bound possesses three properties crucial to the Branch-
and-Mincut framework:
Tightness. For a singleton Ω the bound is tight: L({ω}) = minx∈2V E(x, ω). In
such case, the minimal st-cut also yields the segmentation x optimal for this ω
(xp = 0 iff the respective vertex belongs to the s-component of the cut).
Note, that the fact that the lower bound (2) may be evaluated via st-mincut
gives rise to a whole family of looser, but cheaper, lower bounds. Indeed, the
minimal cut on a network graph is often found by pushing flows until the flow
becomes maximal (and equal to the weight of the mincut) [5]. Thus, the se-
quence of intermediate flows provides a sequence of the increasing lower bounds
on (1) converging to the bound (2) (flow bounds). If some upper bound on
the minimum value is imposed, the process may be terminated earlier without
computing the full maxflow/mincut. This happens when the new flow bound
exceeds the given upper bound. In this case it may be concluded that the value
of the global minimum is greater than the imposed upper bound.
Finding the global minimum of (1) is, in general, a very difficult problem. Indeed,
since the potentials can depend arbitrarily on the non-local parameter spanning
arbitrary discrete set Ω, in the worst-case any optimization has to search ex-
haustively over Ω. In practice, however, any segmentation problem has some
specifically-structured space Ω. This structure can be efficiently exploited by
the branch-and-bound search detailed below.
We assume that the discrete domain Ω can be hierarchically clustered and
the binary tree of its subregions TΩ = {Ω = Ω0 , Ω1 , . . . ΩN } can be constructed
(binarity of the tree is not essential). Each non-leaf node corresponding to the
subregion Ωk then has two children corresponding to the subregions Ωch1(k) and
Ωch2(k) such that Ωch1(k) ⊂ Ωk , Ωch2(k) ⊂ Ωk . Here, ch1(·) and ch2(·) map the
index of the node to the indices of its children. Also, leaf nodes of the tree are
in one-to-one correspondence with singleton subsets Ωl = {ωt }.
Given such tree, the global minimum of (1) can be efficiently found using
the best-first branch-and-bound search [8]. This algorithm propagates a front of
nodes in the top-down direction (Fig. 1). During the search, the front contains a
set of tree nodes, such that each top-down path from the root to a leaf contains
20 V. Lempitsky, A. Blake, and C. Rother
exactly one active vertex. In the beginning, the front contains the tree root
Ω0 . At each step the active node with the smallest lower bound (2) is removed
from the active front, while two of its children are added to the active front (by
monotonicity property they have higher or equal lower bounds). Thus, an active
front moves towards the leaves making local steps that increase the lowest lower
bound of all active nodes. Note, that at each moment, this lowest lower bound
of the front constitutes a lower bound on the global optimum of (1) over the
whole domain.
At some moment of time, the active node with the smallest lower bound
turns out to be a leaf {ω }. Let x be the optimal segmentation for ω (found via
minimum st-cut). Then, E(x , ω ) = L(ω ) (tightness property) is by assumption
the lowest bound of the front and hence a lower bound on the global optimum
over the whole domain. Consequently, (x , ω ) is a global minimum of (1) and
the search terminates without traversing the whole tree. In our experiments,
the number of the traversed nodes was typically very small (two-three orders
of magnitude smaller then the size of the full tree). Therefore, the algorithm
performed global optimization much faster than exhaustive search over Ω.
In order to further accelerate the search, we exploit the coherency between
the mincut problems solved at different nodes. Indeed, the maximum flow as
well as auxiliary structures such as shortest path trees computed for one graph
may be “reused” in order to accelerate the computation of the minimal st-cut
on another similar graph [3,17]. For some applications, this trick may give an
order of magnitude speed-up for the evaluation of lower bounds.
In addition to the best-first branch-and-bound search we also tried the depth-
first branch-and-bound [8]. When problem-specific heuristics are available that
give good initial solutions, this variant may lead to moderate (up to a factor
of 2) time savings. Interestingly, the depth-first variant of the search, which
maintains upper bounds on the global optimum, may benefit significantly from
the use of flow bounds discussed above. Nevertheless, we stick with the best-first
branch-and-bound for the final experiments due to its generality (no need for
initialization heuristics).
Image Segmentation by Branch-and-Mincut 21
In the rest of the paper we detail how the general framework developed above
may be used within different segmentation scenarios.
where ρ denotes the Hamming distance between segmentations. This term clearly
has the form (1) and therefore its combinations with other terms of this form can
be optimized within our framework. Being optimized over the domain 2V ⊗Ω, this
term would encourage the segmentation x to be close in the Hamming distance
to some of the exemplar shapes (note that some other shape distances can be
used in a similar way).
The full segmentation energy then may be defined by adding a standard
contrast-sensitive edge term [3]:
||Kp −Kq ||
e− σ
Eshape (x, ω) = Eprior (x, ω) + λ ·|xp − xq | , (4)
|p − q|
p,q∈E
where ||Kp − Kq || denote the SAD (L1) distance between RGB colors of the
pixels p and q in the image (λ and σ were fixed throughout the experiments
described in this section), |p − q| denotes
√ the distance between the centers of the
pixels p and q (being either 1 or 2 for the 8-connected grid). The functional
(4) thus incorporates the shape prior with edge-contrast cues.
In practice, the set Ωshape could be huge, e.g. tens of millions exemplars,
which poses a problem for hierarchical clustering as well as pre-computing and
storing aggregated potentials. Fortunately, for many scenarios all these tasks
can be accomplished with reasonable amount of time and memory provided that
22 V. Lempitsky, A. Blake, and C. Rother
Fig. 2. Using the shape prior constructed from the set of exemplars (left column) our
approach can accomplish segmentation of an object undergoing general 3D pose changes
within two differently illuminated sequences (two middle columns). Note the varying
topology of the segmentations. For comparison, we give the results of a standard graph
cut segmentation (right column): even with parameters tuned specifically to the test
images, separation is entirely inaccurate.
4.2 Experiments
Single object+3D pose changes. In our first experiment, we constructed
a shape prior for a single object (a coffee cup) undergoing 3D pose changes.
We obtained a set of outlines using “blue-screening”. We then normalized these
outlines (by centering at the origin, resizing to a unit scale and orienting the
principle axes with the coordinate axes). After that we clustered the normalized
outlines using k-means. A representative of each cluster was then taken into the
exemplar set. After that we added scale variations, in-plane rotations, and trans-
lations. As a result, we got a set {yω |ω ∈ Ωshape } containing about 30,000,000
exemplar shapes.
The results of the global optimization of the functional (4) for the frames from
the two sequences containing clutter and camouflage are shown in Fig. 2. On
Image Segmentation by Branch-and-Mincut 23
Fig. 3. Results of the global optimization of (5) on some of the 170 UIUC car images
including 1 of the 2 cases where localization failed (bottom left). In the case of the
bottom right image, the global minimum of (4) (yellow) and the result of our feature-
based car detector (blue) gave erroneous localization, while the global minimum of
their combination (5) (red) represented an accurate segmentation.
under consideration. At the same time, there exists a large number of algorithms
working with image appearance cues and performing object detection based
on these cues (see e.g. [22] and references therein). Typically, such algorithms
produce the likelihood of the object presence either as a a function of a bounding
box or even in the form of per-pixel “soft segmentation” masks. Both types of
the outputs can be added into the functional (1) either via constant potential
C(Ω) or via unary potentials. In this way, such appearance-based detectors can
be integrated with shape prior and edge-contrast cues.
As an example of such integration, we devised a simple detector similar in
spirit to [22]. The detector looked for the appearance features typical for cars
(wheels) using normalized cross-correlation. Each pixel in the image then “voted”
for the location of the car center depending on the strength of the response to
the detector and the relative position of the wheels with respect to the car center
observed on the training dataset. We then added an additional term Cvote (ω) in
our energy (1) that for each ω equaled minus the accumulated strength of the
votes for the center of yω :
||Kp −Kq ||
e− σ
Eshape&detect(x, ω) = Cvote (ω) + Eprior (x, ω) + λ ·|xp − xq | , (5)
|p − q|
p,q∈E
where S denotes the foreground segment, and I(p) is a grayscale image. The
first two terms measure the length of the boundary and the area, the third and
the forth terms are the integrals over the fore- and background of the difference
between image intensity and the two intensity values cf and cb , which correspond
to the average intensities of the respective regions. Traditionally, this functional
is optimized using level set framework converging to one of its local minima.
Below, we show that the discretized version of this functional can be optimized
globally within our framework. Indeed, the discrete version of (6) can be written
as (using notation as before):
μ
E(x, (cf , cb )) = ·|xp − xq |+
|p − q|
p,q∈E
2 (7)
ν + λ1 (I(p) − cf )2 ·xp + λ2 I(p) − cb ·(1 − xp ) .
p∈V p∈V
Here, the first term approximates the first term of (6) (the accuracy of the
approximation depends on the size of the pixel neighborhood [4]), and the last
two terms express the last three terms of (6) in a discrete setting.
The functional (7) clearly has the form (1) with non-local
parameter ω = {cf , cb }. Discretizing intensities cf and cb into
255 levels and building a quad-tree over their joint domain,
we can apply our framework to find the global minima of
(6). An example of a global minimum of (7) is shown to the
right (this 183x162 image was segmented in 3 seconds, the
proportion of the tree traversed was 1:115). More examples
are given in [23].
In [25], the GrabCut framework for the interactive color image segmentation
based on Gaussian mixtures was proposed. In GrabCut, the segmentation is
driven by the following energy:
EGrabCut (x, (GM f , GM b )) = − log(P(Kp | GM f ))·xp +
p∈V
||K −Kq ||2 (8)
λ1 + λ2 ·e− p β
+ − log(P(Kp | GM ))·(1 − xp )+
b
·|xp − xq | .
|p − q|
p∈V p,q∈E
Here, GM f and GM b are Gaussian mixtures in RGB color space and the first
two terms of the energy measure how well these mixtures explain colors Kp of
pixels attributed to fore- and background respectively. The third term is the con-
trast sensitive edge term, ensuring that the segmentation boundary is compact
and tends to stick to color region boundaries in the image. In addition to this
energy, the user provides supervision in the form of a bounding rectangle and
26 V. Lempitsky, A. Blake, and C. Rother
brush strokes, specifying which parts of the image should be attributed to the
foreground and to the background.
The original method [25] minimizes the energy within EM-style process, al-
ternating between (i) the minimization of (8) over x given GM f and GM b and
(ii) refitting the mixtures GM f and GM b given x. Despite the use of the global
graph cut optimization within the segmentation update step, the whole process
yields only a local minimum of (8). In [25], the segmentation is initialized to the
provided bounding box and then typically shrinks to one of the local minima.
The energy (8) has the form (1) and therefore can be optimized within Branch-
and-Mincut framework, provided that the space of non-local parameters (which
in this case is the joint space of the Gaussian mixtures for the foreground and
for the background) is discretized and the tree of the subregions is built. In this
scenario, however, the dense discretization of the non-local parameter space is
infeasible (if the mixtures contain n Gaussians then the space is described by
20n − 2 continuous parameters). It is possible, nevertheless, to choose a much
smaller discrete subset Ω that is still likely to contain a good approximation to
the globally-optimal mixtures.
To construct such Ω, we fit a mixture of M = 8 Gaussians G1 , G2 , ...GM with
the support areas a1 , a2 , ...aM to the whole image. The support area ai here
counts the number of pixels p such as ∀j P(Kp |Gi ) ≥ P(Kp |Gj ). We assume
that the components are ordered such that the support areas decrease (ai >
ai+1 ). Then, the Gaussian mixtures we consider are defined by the binary vector
β = {β1 , β2 . . . βM } ∈ {0, 1}M specifying
which Gaussians
should be included
into the mixture: P(K| GM (β)) = i βi ai P(K|Gi ) / i βi ai .
The overall set Ω is then defined as {0, 1}2M , where odd bits correspond to
the foreground mixture vector β f and even bits correspond to the background
mixture vector β b . Vectors with all even bits and/or all odd bits equal to zero
do not correspond to meaningful mixtures and are therefore assigned an infinite
cost. The hierarchy tree is naturally defined by the bit-ordering (the first bit
corresponding to subdivision into the first two branches etc.).
Depending on the image and the value of M , the solutions found by Branch-
and-Mincut framework may have larger or smaller energy (8) than the solutions
found by the original EM-style method [25]. This is because Branch-and-Mincut
here finds the global optimum over the subset of the domain of (8) while [25]
searches locally but within the continuous domain. However, for all 15 images in
our experiments, improving Branch-and-Mincut solutions with a few EM-style
iterations [25] gave lower energy than the original solution of [25]. In most cases,
these additional iterations simply refit the Gaussians properly and change very
few pixels near boundary (see Fig. 4).
In terms of performance, for M = 8 the segmentation takes on average a few
dozen seconds (10s and 40s for the images in Fig. 4) for 300x225 image. The
proportion of the tree traversed by an active front is one to several hundred
(1:963 and 1:283 for the images in Fig. 4).
Image Segmentation by Branch-and-Mincut 27
Fig. 4. Being initialized with the user-provided bounding rectangle (shown in green
in the first column) as suggested in [25], EM-style process [25] converges to a local
minimum (the second column). Branch-and-Mincut result (the third column) escapes
that local minimum and after EM-style improvement lead to the solution with much
smaller energy and better segmentation accuracy (the forth column). Energy values
are shown in brackets.
6 Conclusion
The Branch-and-Mincut framework presented in this paper finds global optima
of a wide class of energies dependent on the image segmentation mask and non-
local parameters. The joint use of branch-and-bound and graph cut allows effi-
cient traversal of the solution space. The developed framework is useful within a
variety of image segmentation scenarios, including segmentation with non-local
shape priors and non-local color/intensity priors.
Future work includes the extension of Branch-and-Mincut to other problems,
such as simultaneous stitching and registration of images, as well as deriving
analogous branch-and-bound frameworks for combinatorial methods other than
binary graph cut, such as minimum ratio cycles and multilabel MRF inference.
Acknowledgements
We would like to acknowledge discussions and feedback from Vladimir Kol-
mogorov and Pushmeet Kohli. Vladimir has also kindly made several modifi-
cations of his code of [5] that allowed to reuse network flows more efficiently.
28 V. Lempitsky, A. Blake, and C. Rother
References
1. Agarwal, S., Chandaker, M., Kahl, F., Kriegman, D., Belongie, S.: Practical Global
Optimization for Multiview Geometry. In: Leonardis, A., Bischof, H., Pinz, A.
(eds.) ECCV 2006. LNCS, vol. 3951. Springer, Heidelberg (2006)
2. Boros, E., Hammer, P.: Pseudo-boolean optimization. Discrete Applied Mathemat-
ics 123(1-3) (2002)
3. Boykov, Y., Jolly, M.-P.: Interactive Graph Cuts for Optimal Boundary and Region
Segmentation of Objects in N-D Images. In: ICCV 2001 (2001)
4. Boykov, Y., Kolmogorov, V.: Computing Geodesics and Minimal Surfaces via
Graph Cuts. In: ICCV 2003 (2003)
5. Boykov, Y., Kolmogorov, V.: An Experimental Comparison of Min-Cut/Max-Flow
Algorithms for Energy Minimization in Vision. PAMI 26(9) (2004)
6. Bray, M., Kohli, P., Torr, P.: PoseCut: Simultaneous Segmentation and 3D Pose
Estimation of Humans Using Dynamic Graph-Cuts. In: Leonardis, A., Bischof, H.,
Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952. Springer, Heidelberg (2006)
7. Chan, T., Vese, L.: Active contours without edges. Trans. Image Process 10(2)
(2001)
8. Clausen, J.: Branch and Bound Algorithms - Principles and Examples. Parallel
Computing in Optimization (1997)
9. Cremers, D., Osher, S., Soatto, S.: Kernel Density Estimation and Intrinsic Align-
ment for Shape Priors in Level Set Segmentation. IJCV 69(3) (2006)
10. Cremers, D., Schmidt, F., Barthel, F.: Shape Priors in Variational Image Segmen-
tation: Convexity, Lipschitz Continuity and Globally Optimal Solutions. In: CVPR
2008 (2008)
11. Greig, D., Porteous, B., Seheult, A.: Exact maximum a posteriori estimation for
binary images. Journal of the Royal Statistical Society 51(2) (1989)
12. Felzenszwalb, P.: Representation and Detection of Deformable Shapes. PAMI 27(2)
(2005)
13. Freedman, D., Zhang, T.: Interactive Graph Cut Based Segmentation with Shape
Priors. In: CVPR 2005 (2005)
14. Gavrila, D., Philomin, V.: Real-Time Object Detection for ”Smart” Vehicles. In:
ICCV 1999 (1999)
15. Huang, R., Pavlovic, V., Metaxas, D.: A graphical model framework for coupling
MRFs and deformable models. In: CVPR 2004 (2004)
16. Kim, J., Zabih, R.: A Segmentation Algorithm for Contrast-Enhanced Images.
ICCV 2003 (2003)
17. Kohli, P., Torr, P.: Effciently Solving Dynamic Markov Random Fields Using Graph
Cuts. In: ICCV 2005 (2005)
18. Kolmogorov, V., Zabih, R.: What Energy Functions Can Be Minimized via Graph
Cuts. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002.
LNCS, vol. 2352. Springer, Heidelberg (2002)
19. Kolmogorov, V., Boykov, Y., Rother, C.: Applications of Parametric Maxflow in
Computer Vision. In: ICCV 2007 (2007)
20. Pawan Kumar, M., Torr, P., Zisserman, A.: OBJ CUT. In: CVPR 2005 (2005)
21. Lampert, C., Blaschko, M., Hofman, T.: Beyond Sliding Windows: Object Local-
ization by Efficient Subwindow Search. In: CVPR 2008 (2008)
22. Leibe, B., Leonardis, A., Schiele, B.: Robust Object Detection with Interleaved
Categorization and Segmentation. IJCV 77(3) (2008)
Image Segmentation by Branch-and-Mincut 29
23. Lempitsky, V., Blake, A., Rother, C.: Image Segmentation by Branch-and-Mincut.
Microsoft Technical Report MSR-TR-2008-100 (July 2008)
24. Leventon, M., Grimson, E., Faugeras, O.: Statistical Shape Influence in Geodesic
Active Contours. In: CVPR 2000 (2000)
25. Rother, C., Kolmogorov, V., Blake, A.: ”GrabCut”: interactive foreground extrac-
tion using iterated graph cuts. ACM Trans. Graph. 23(3) (2004)
26. Schoenemann, T., Cremers, D.: Globally Optimal Image Segmentation with an
Elastic Shape Prior. In: ICCV 2007 (2007)
27. Wang, Y., Staib, L.: Boundary Finding with Correspondence Using Statistical
Shape Models. In: CVPR 1998 (1998)
What Is a Good Image Segment?
A Unified Approach to Segment Extraction
1 Introduction
One of the most fundamental vision tasks is image segmentation; the attempt to
group image pixels into visually meaningful segments. However, the notion of a
“visually meaningful” image segment is quite complex. There is a huge diversity
in possible definitions of what is a good image segment, as illustrated in Fig. 1.
In the simplest case, a uniform colored region may be a good image segment
(e.g., the flower in Fig. 1.a). In other cases, a good segment might be a textured
region (Fig. 1.b, 1.c) or semantically meaningful layers composed of disconnected
regions (Fig. 1.c) and all the way to complex objects (Fig. 1.e, 1.f).
The diversity in segment types has led to a wide range of approaches for
image segmentation: Algorithms for extracting uniformly colored regions (e.g.,
[1,2]), algorithms for extracting textured regions (e.g., [3,4]), algorithm for ex-
tracting regions with a distinct empirical color distribution (e.g., [5,6,7]). Some
algorithms employ symmetry cues for image segmentation (e.g., [8]), while others
use high-level semantic cues provided by object classes (i.e., class-based segmen-
tation, see [9,10,11]). Some algorithms are unsupervised (e.g., [2]), while others
require user interaction (e.g., [7]). There are also variants in the segmentation
Author names are ordered alphabetically due to equal contribution.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 30–44, 2008.
c Springer-Verlag Berlin Heidelberg 2008
What Is a Good Image Segment? 31
Fig. 3. Notations:
Seg = S, S, ∂S denotes a
Fig. 2. Segmentation by composition: A good figure-ground segmentation.
segment S (e.g., the butterfly or the dome) can be eas- S is the foreground segment,
ily composed of other regions in the segment. Regions S (its compliment) is the
R1 , R2 are composed from other corresponding regions background, and ∂S is the
in S (using transformations T1 , T2 respectively). boundary of the segment.
a unified non-parametric score for segment quality. Our unified score captures
a wide range of segment types: uniformly colored segments, through textured
segments, and even complex objects. We further present a simple interactive
segment extraction algorithm, which optimizes our score – i.e., given a single
point marked by the user, the algorithm extracts the “best” image segment con-
taining that point. This in turn induces a figure-ground segmentation of the
image. We provide results demonstrating the applicability of our score and al-
gorithm to a diversity of segment types and segmentation tasks. The rest of
this paper is organized as follows: In Sec. 2 we explain the basic concept behind
our “Segmentation-by-Composition” approach for evaluating the visual quality
of image segments. Sec. 3 provides the theoretical formulation of our unified
segment quality score. We continue to describe our figure-ground segmentation
algorithm in Sec. 4. Experimental results are provided in Sec. 5.
Examining the image segments of Fig. 1, we note that good segments of signifi-
cantly different types share a common property: Given any point within a good
image segment, it is easy to compose (“describe”) its surrounding region using
other chunks of the same segment (like a ‘jigsaw puzzle’), whereas it is difficult to
compose it using chunks from the remaining parts of the image. This is trivially
true for uniformly colored and textured segments (Fig. 1.a, 1.b, 1.c), since each
portion of the segment (e.g., the dome) can be easily synthesized using other
portions of the same segment (the dome), but difficult to compose using chunks
from the remaining parts of the image (the sky). The same property carries to
more complex structured segments, such as the compound puffins segment in
Fig. 1.f. The surrounding region of each point in the puffin segment is easy to
“describe” using portions of other puffins. The existence of several puffins in the
image provides ‘visual evidence’ that the co-occurrence of different parts (orange
beak, black neck, white body, etc.) is not coincidental, and all belong to a single
compound segment. Similarly, one half of a complex symmetric object (e.g., the
butterfly of Fig. 1.d, the man of Fig. 1.e) can be easily composed using its other
half, providing visual evidence that these parts go together. Moreover, the sim-
pler the segment composition (i.e., the larger the puzzle pieces), the higher the
evidence that all these parts form together a single segment. Thus, the entire
man of Fig. 1.e forms a better single segment than his pants or shirt alone.
The ease of describing (composing) an image in terms of pieces of another
image was defined by [14], and used there in the context of image similarity.
The pieces used for composition are structured image regions (as opposed to un-
structured ‘bags’/distributions of pointwise features/descriptors, e.g., as in [5,7]).
Those structured regions, of arbitrary shape and size, can undergo a global geo-
metric transformation (e.g., translation, rotation, scaling) with additional small
local non-rigid deformations. We employ the composition framework of [14] for
the purpose of image segmentation. We define a “good image segment” S as
one that is easy to compose (non-trivially) using its own pieces, while difficult to
What Is a Good Image Segment? 33
compose from the remaining parts of the image S = I \S. An “easy” composition
consists of a few large image regions, whereas a “difficult” composition consists
of many small fragments. A segment composition induces a description of the
segment, with a corresponding “description length”. The easier the composition,
the shorter the description length. The ease of composing S from its own pieces
is formulated in Sec. 3 in terms of the description length DL (S|S). This is con-
trasted with the ease of composing
S from pieces of the remaining image parts
S, which is captured by DL S|S . This gives rise to a “segment quality score”
Score (S), which is measured by the difference between these two description
lengths: Score (S) = DL S|S − DL (S|S).
Our definition of a “good image segment” will maximize this difference in de-
scription lengths. Any deviation from the optimal segment S will reduce this differ-
ence, and accordingly decrease Score (S). For example, the entire dome in Fig. 1.b
is an optimal image segment S; it is easy to describe non-trivially in terms of its own
pieces (see Fig. 2), and difficult to describe in terms of the background sky. If, how-
ever, we were to define the segment S to be only a smaller part of the dome, then the
background S would contain the sky alongwiththe parts of the dome excluded from
S. Consequently, this would decrease DL S|S and therefore Score (S) would de-
crease. It can be similarly shown that Score (S) would decrease if we were to define
S which is larger than the dome and contains also parts of the sky. Note that unlike
previous simplistic formulations of segment description length (e.g., entropy of sim-
ple color distributions [5]), our composition-based description length can capture
also complex structured segments.
A good figure-ground segmentation Seg = S, S, ∂S (see Fig. 3) partitions
the image into a foreground segment S and a background segment S, where at
least one of these two segments (and hopefully both) is a ‘good image segment’
according to the definition above. Moreover, we expect the segment boundary ∂S
of a good figure-ground segmentation to coincide with meaningful image edges.
Boiman and Irani [14] further employed the composition framework for coarse
grouping of repeating patterns. Our work builds on top of [14], providing a gen-
eral segment quality score and a corresponding image segmentation algorithm,
which applies to a large diversity of segment types, and can be applied for vari-
ous segmentation tasks. Although general, our unified segmentation framework
does not require any pre-definition or modelling of segment types (in contrast to
the unified framework of [13]).
3 Theoretical Formulation
Because there are many possible partitions of Q into regions, the righthand side
of (2) is marginalized over all possible partitions in [14].
p (Q|Ref ) /p (Q|H0 ) is the likelihood-ratio between the ‘ease’ of generating
Q from Ref vs. the ease of generating Q using a “random process” H0 (e.g.,
a default image distribution). Noting that the optimal (Shannon) description
length of a random variable x is DL (x) ≡ − log p (x) [15], Boiman and Irani [14]
defined their compositional similarity score as: log (p (Q|Ref )/p (Q|H0 )) =
DL (Q|H0 ) − DL (Q|Ref ) i.e., the “savings” in the number of bits obtained
by describing Q as composed from regions in Ref vs. the ‘default’ number of
bits required to describe Q using H0 . The larger the regions Ri composing Q
the higher the savings in description length. High savings in description length
provide high statistical evidence for the similarity of Q to Ref .
In order to avoid the computationally-intractable marginalization over all pos-
sible query partitions, the following approximation was derived in [14]:
DL (Q|H0 ) − DL (Q|Ref ) ≈ PES (i|Ref ) (3)
i∈Q
(a) (b)
We next show that our segment quality score, Score (S), has an interesting
information-theoretic interpretation, which reduces in special sub-cases to com-
monly used information-theoretic measures. Let us first examine the simple case
where the composition of a segment S is restricted to degenerate one-pixel sized
regions Ri . In this case, p (Ri |Ref = S) in (1) reduces to the frequency of the
color of the pixel Ri inside S (given by the color histogram of S). Using (2) with
one-pixel sized regions Ri , the description length DL (S|Ref = S) reduces to:
DL (S|Ref = S) = − log p (S|Ref = S) = − log p (Ri |Ref = S)
i∈S
=− log p (Ri |Ref = S) = |S| · Ĥ (S)
i∈S
where Ĥ (S) is the empirical entropy1 of the regions {Ri } composing S, which
is
the color entropy of S in case of one-pixel sized Ri . Similarly, DL S|Ref = S =
− i∈S log p Ri |Ref = S = |S| · Ĥ S, S , where Ĥ S, S is the empirical
cross-entropy of regions Ri ⊂ S in S (which reduces to the color cross-entropy
1
The empirical entropy of the sample x1 , .., xn is Ĥ (x) = − n1 i log p (xi ) which
approaches the statistical entropy H (x) as n → ∞.
What Is a Good Image Segment? 37
in case of one-pixel sized Ri ). Using these observations, Score (S) of (5) reduces
to the empirical KL divergence between the region distributions of S and S:
Score (S) = DL S|S − DL (S|S) = |S| · Ĥ S, S − Ĥ (S) = |S| · KL S, S
Fig. 5. Our result vs. GrabCut [7]. GrabCut fails to segment the butterfly (fore-
ground) due to the similar colors of the flowers in the background. Using composition
with arbitrarily shaped regions, our algorithm accurately segments the butterfly. We
used the GrabCut implementation of www.cs.cmu.edu/∼ mohitg/segmentation.htm
union of these regions, along with their corresponding reference regions, is used
as a crude initialization, S0 , of the segment S (see Fig. 7.c for an example).
During the iterative process, similar regions may have conflicting labels. Due
to the EM-like iterations, such regions may simultaneously flip their labels, and
fail to converge (since each such region provides “evidence” for the other to flip
its label). Therefore, in each iteration, we perform two types of steps successively:
(i) an “expansion” step, in which only background pixels in St are allowed to flip
their label to foreground pixels. (ii) a “shrinking” step, in which only foreground
pixels in St are allowed to flip their label to background pixels. Fig. 7 shows a
few steps in the iterative process, from initialization to convergence.
5 Results
We applied our segment extraction algorithm to a variety of segment types and
segmentation tasks, using images from several segmentation databases [19,20,7].
In each case, a single point-of-interest was marked (a green cross in the fig-
ures). The algorithm extracted the “best” image segment containing that point
42 S. Bagon, O. Boiman, and M. Irani
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(highlighted in red). Higher resolution images and many more results can be
found in www.wisdom.weizmann.ac.il/∼vision/GoodSegment.html.
Single-Image Segmentation. Fig. 10 demonstrates the capability of our ap-
proach to handle a variety of different segments types: uniformly colored seg-
ments (Fig. 10.f), complex textured segments (Fig. 10.h), complex symmetric
objects (e.g., the butterfly in Fig. 5, the Man in Fig. 1.e). More complex objects
can also be segmented (e.g., a non-symmetric person Fig. 10.b, or the puffins
Fig. 10.g), resulting from combinations of different types of transformations Ti
for different regions Ri within the segment, and different types of descriptors.
We further evaluated our algorithm on the benchmark database of [19], which
consists of 100 images depicting a single object in front of a background, with
ground-truth human segmentation. The total F-measure score of our algorithm
was 0.87±0.01 (F = 2·Recall·P recision
Recall+P recision ), which is state-of-the-art on this database.
References
1. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space anal-
ysis. PAMI (2002)
2. Shi, J., Malik, J.: Normalized cuts and image segmentation. PAMI (2000)
3. Malik, J., Belongie, S., Shi, J., Leung, T.K.: Textons, contours and regions: Cue
integration in image segmentation. In: ICCV (1999)
4. Galun, M., Sharon, E., Basri, R., Brandt, A.: Texture segmentation by multiscale
aggregation of filter responses and shape elements. In: ICCV (2003)
5. Kadir, T., Brady, M.: Unsupervised non-parametric region segmentation using level
sets. In: ICCV (2003)
6. Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM TOG (2004)
7. Rother, C., Kolmogorov, V., Blake, A.: “grabcut”: Interactive foreground extrac-
tion using iterated graph cuts. In: SIGGRAPH (2004)
8. Riklin-Raviv, T., Kiryati, N., Sochen, N.: Segmentation by level sets and symmetry.
In: CVPR (2006)
9. Borenstein, E., Ullman, S.: Class-specific, top-down segmentation. In: Heyden, A.,
Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351. Springer,
Heidelberg (2002)
44 S. Bagon, O. Boiman, and M. Irani
10. Leibe, B., Schiele, B.: Interleaved object categorization and segmentation. In:
BMVC (2003)
11. Levin, A., Weiss, Y.: Learning to combine bottom-up and top-down segmenta-
tion. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954.
Springer, Heidelberg (2006)
12. Rother, C., Minka, T., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs
by histogram matching - incorporating a global constraint into mrfs. In: CVPR
(2006)
13. Tu, Z., Chen, X., Yuille, A.L., Zhu, S.C.: Image parsing: Unifying segmentation,
detection, and recognition. IJCV (2005)
14. Boiman, O., Irani, M.: Similarity by composition. In: NIPS (2006)
15. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley, Chichester
(1991)
16. Martin, D.R., Fowlkes, C.C., Malik, J.: Learning to detect natural image bound-
aries using local brightness, color, and texture cues. PAMI (2004)
17. Boykov, Y., Veksler, O., Zabih, R.: Efficient approximate energy minimization via
graph cuts. PAMI (2001)
18. Boiman, O., Irani, M.: Detecting irregularities in images and in video. IJCV (2007)
19. Alpert, S., Galun, M., Basri, R., Brandt, A.: Image segmentation by probabilistic
bottom-up aggregation and cue integration. In: CVPR (2007)
20. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural
images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In: ICCV (2001)
Light-Efficient Photography
1 Introduction
Two of the most important choices when taking a photo are the photo’s exposure level
and its depth of field. Ideally, these choices will result in a photo whose subject is
free of noise or pixel saturation [1,2], and appears in-focus. These choices, however,
come with a severe time constraint: in order to take a photo that has both a specific
exposure level and a specific depth of field, we must expose the camera’s sensor for
a length of time dictated by the optics of the lens. Moreover, the larger the depth of
field, the longer we must wait for the sensor to reach the chosen exposure level. In
practice, this makes it impossible to efficiently take sharp and well-exposed photos of
a poorly-illuminated subject that spans a wide range of distances from the camera. To
get a good exposure level, we must compromise something – accepting either a smaller
depth of field (incurring defocus blur [3,4,5,6]) or a longer exposure (incurring motion
blur [7,8,9]).
In this paper we seek to overcome the time constraint imposed by lens optics, by
capturing a sequence of photos rather than just one. We show that if the aperture, ex-
posure time, and focus setting of each photo is selected appropriately, we can span a
given depth of field with a given exposure level in less total time than it takes to expose
a single photo (Fig. 1). This novel observation is based on a simple fact: even though
wide apertures have a narrow depth of field (DOF), they are much more efficient than
narrow apertures in gathering light from within their depth of field. Hence, even though
This work was supported in part by the Natural Sciences and Engineering Research Council of
Canada under the RGPIN program and by an Ontario Premier’s Research Excellence Award.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 45–59, 2008.
© Springer-Verlag Berlin Heidelberg 2008
46 S.W. Hasinoff and K.N. Kutulakos
2s 0.5 s 0.5 s
Fig. 1. Left: Traditional single-shot photography. The desired depth of field is shaded (red). Right:
Light-efficient photography. Two wide-aperture photos span the same DOF as a single-shot
narrow-aperture photo. Each wide-aperture photo requires 1/4 the time to reach the exposure
level of the single-shot photo, resulting in a 2× net speedup for the total exposure time.
it is not possible to span a wide DOF with a single wide-aperture photo, it is possible to
span it with several of them, and to do so very efficiently.
Using this observation as a starting point, we develop a general theory of light-
efficient photography that addresses four questions: (1) under what conditions is
capturing photo sequences with “synthetic” DOFs more efficient than single-shot pho-
tography? (2) How can we characterize the set of sequences that are globally optimal for
a given DOF and exposure level, i.e. whose total exposure time is the shortest possible?
(3) How can we compute such sequences automatically for a specific camera, depth of
field, and exposure level? (4) Finally, how do we convert the captured sequence into a
single photo with the specified depth of field and exposure level?
Little is known about how to gather light efficiently from a specified DOF. Research
on computational photography has not investigated the light-gathering ability of ex-
isting methods, and has not considered the problem of optimizing exposure time for
a desired DOF and exposure level. For example, even though there has been great
interest in manipulating a camera’s DOF through optical [10,11,12,13] or computa-
tional [5,14,15,16,17,18,2] means, current approaches do so without regard to exposure
time – they simply assume that the shutter remains open as long as necessary to reach
the desired exposure level. This assumption is also used for high-dynamic range pho-
tography [19,2], where the shutter must remain open for long periods in order to capture
low-radiance regions in a scene. In contrast, here we capture photos with camera set-
tings that are carefully chosen to minimize total exposure time for the desired DOF and
exposure level.
Since shorter total exposure times reduce motion blur, our work can be thought of
as complementary to recent synthetic shutter approaches whose goal is to reduce such
blur. Instead of controlling aperture and focus, these techniques divide a given exposure
interval into several shorter ones, with the same total exposure (e.g., n photos, each with
1/n the exposure time [9]; two photos, one with long and one with short exposure [8];
or one photo where the shutter opens and closes intermittently during the exposure [7]).
These techniques do not increase light-efficiency but can be readily combined with our
work, to confer the advantages of both methods.
Light-Efficient Photography 47
)
10 5
100 7.4
y
ver
t(
3 )
h
0
ig
(1
br
rk
da
50 1) 14.8
0
k (1
dar
very
0 ∞
−4 −3 −2 −1 0 1
10 10 10 10 10 10
exposure time (s)
Fig. 2. Each curve represents all pairs (τ, D) for which τ D2 = L∗ in a specific scene. Shaded
zones correspond to pairs outside the camera limits (valid settings were τ ∈ [1/8000 s, 30 s] and
D ∈ [f /16, f /1.2] with f = 85 mm). Also shown is the DOF corresponding to each diameter D.
The maximum acceptable blur was set to c = 25 µm, or about 3 pixels in our camera. Different
curves represent scenes with different average radiance (relative units shown in brackets).
other hand, affects the photo’s depth of field (DOF), i.e., the range of distances where
scene points do not appear out of focus. These side-effects lead to an important tradeoff
between a photo’s exposure time and its depth of field (Fig. 2):
Exposure Time vs. Depth of field Tradeoff: We can either achieve a desired
exposure level L∗ with short exposure times and a narrow DOF, or with long
exposure times and a wide DOF.
In practice, the exposure time vs. DOF tradeoff limits the range of scenes that can be
photographed at a given exposure level (Fig. 2). This range depends on scene radiance,
the physical limits of the camera (i.e., range of possible apertures and shutter speeds),
as well as subjective factors (i.e., acceptable levels of motion blur and defocus blur).
Our goal is to “break” this tradeoff by seeking novel photo acquisition strategies that
capture a given depth of field at the desired exposure level L∗ much faster than tradi-
tional optics would predict. We briefly describe below the basic geometry and relations
governing a photo’s depth of field, as they are particularly important for our analysis.
sensor in-focus
plane plane
b scene
D
DOF DOF
c c
v d
dn d df α v β
v d scene depth (cm) scene focus setting (mm)
(a) (b) (c)
Fig. 3. (a) Blur geometry for a thin lens. (b) Blur diameter as a function of distance to a scene
point. The plot is for a lens with f = 85 mm, focused at 117 cm with an aperture diameter
of 5.31 mm (i.e., an f /16 aperture in photography terminology). (c) Blur diameter and DOF
represented in the space of focus settings.
Table 1. Eqs. (A)–(F): Basic equations governing focus and DOFs for the thin-lens model
(A) Thin (B) Focus for (C) Blur diameter (D) Aper. diam. (E) Focus for (F) DOF for aper.
lens law distance d for distance d for DOF [α, β] DOF [α, β] diam. D, focus v
1 1 1 fd f |d − d| β +α 2αβ Dv
+ = v= b=D D=c v= α, β =
v d f d−f d (d − f ) β −α α+β D±c
For a given aperture and focus setting, the depth of field is the interval of distances
in the scene whose blur diameter is below a maximum acceptable size c (Fig. 3b).
Since every distance in the scene corresponds to a unique focus setting (Eq. (B)),
every DOF can also be expressed as an interval [α, β] in the space of focus settings.
This alternate DOF representation gives us especially simple relations for the aperture
and focus setting that produce a given DOF (Eqs. (D) and (E)) and, conversely, for
the DOF produced by a given aperture and focus setting (Eq. (F)). We adopt this DOF
representation for the rest of the paper (Fig. 3c).
A key property of the depth of field is that it shrinks when the aperture diameter
increases: from Eq. (C) it follows that for a given out-of-focus distance, larger apertures
always produce larger blur diameters. This equation is the root cause of the exposure
time vs. depth of field tradeoff.
2
β−α
τ one = L∗ · . (2)
c (β + α)
The key idea of our approach is that while lens optics do not allow us to reduce this
time without compromising the DOF or the exposure level, we can reduce it by taking
more photos. This is based on a simple observation that takes advantage of the different
rates at which exposure time and DOF change: if we increase the aperture diameter
and adjust exposure time to maintain a constant exposure level, its DOF shrinks (at a
rate of about 1/D), but the exposure time shrinks much faster (at a rate of 1/D2 ). This
opens the possibility of “breaking” the exposure time vs. DOF tradeoff by capturing a
sequence of photos that jointly span the DOF in less total time than τ one (Fig. 1).
Our goal is to study this idea in its full generality, by finding capture strategies that
are provably time-optimal. We therefore start from first principles, by formally defining
the notion of a capture sequence and of its synthetic depth of field:
Definition 1 (Photo Tuple). A tuple D, τ, v that specifies a photo’s aperture di-
ameter, exposure time, and focus setting, respectively.
Definition 2 (Capture Sequence). A finite ordered sequence of photo tuples.
Definition 3 (Synthetic Depth of Field). The union of DOFs of all photo tuples in a
capture sequence.
We will use two efficiency measures: the total exposure time of a sequence is the sum
of the exposure times of all its photos; the total capture time, on the other hand, is the
actual time it takes to capture the photos with a specific camera. This time is equal to
the total exposure time, plus any overhead caused by camera internals (computational
and mechanical). We now consider the following general problem:
Light-Efficient Photography: Given a set D of available aperture diameters,
construct a capture sequence such that: (1) its synthetic DOF is equal to [α, β];
(2) all its photos have exposure level L∗ ; (3) the total exposure time (or capture
time) is smaller than τ one ; and (4) this time is a global minimum over all finite
capture sequences.
Intuitively, whenever such a capture sequence exists, it can be thought of as being opti-
mally more efficient than single-shot photography in gathering light. Below we analyze
three instances of the light-efficient photography problem. In all cases, we assume that
the exposure level L∗ , depth of field [α, β], and aperture set D are known and fixed.
Noise Properties. All photos we consider have similar noise, because most noise
sources (photon, sensor, and quantization noise) depend only on exposure level, which
we hold constant. The only exception is thermal noise, which increases with exposure
time [1], and so will be lower for light-efficient sequences with shorter exposures.
capture sequence has an especially simple form – it is unique, it uses the same aperture
diameter for all tuples, and this diameter is either the maximum possible or a diameter
close to that maximum.
More specifically, consider the following special class of capture sequences:
Definition 4 (Sequences with Sequential DOFs). A capture sequence has sequential
DOFs if for every pair of adjacent photo tuples, the right endpoint of the first tuple’s
DOF is the left endpoint of the second.
The following theorem states that the solution to the light-efficient photography prob-
lem is a specific sequence from this class:
Theorem 1 (Optimal Capture √ Sequence for Continuous Apertures). (1) If the DOF
endpoints satisfy β < (7 + 4 3)α, the sequence that globally minimizes total exposure
time is a sequence with sequential DOFs whose tuples all have the same aperture. (2)
Define D(k) and n as follows:
⎢ ⎥
√ √ ⎢ ⎥
k
β+ α k ⎢ log α
⎥
D(k) = c √ √ , n = ⎣
β
⎦ . (3)
k
β− α k
log Dmax −cDmax +c
Theorem 1 specifies the optimal sequence indirectly, via a “recipe” for calculating the
optimal length and the optimal aperture diameter (Eqs. (3) and (4)). Informally, this
calculation involves three steps. The first step defines the quantity D(k); in our proof of
Theorem 1 (see Appendix A), we show that this quantity represents the only aperture
diameter that can be used to “tile” the interval [α, β] with exactly k photo tuples of the
same aperture. The second step defines the quantity n; in our proof, we show that this
represents the largest number of photos we can use to tile the interval [α, β] with photo
tuples of the same aperture. The third step involves choosing between two “candidates”
for the optimal solution – one with n tuples and one with n + 1.
Theorem 1 makes explicit the somewhat counter-intuitive fact that the most light-
efficient way to span a given DOF [α, β] is to use images whose DOFs are very narrow.
This fact applies broadly, because Theorem 1’s inequality condition for α and β is
satisfied for all lenses for consumer photography that we are aware of (e.g., see [22]).2
See Fig. 4 for an application of this theorem to a practical example.
Note that Theorem 1 specifies the number of tuples in the optimal sequence and
their aperture diameter, but does not specify their exposure times or focus settings.
The following lemma shows that specifying those quantities is not necessary because
they are determined uniquely. Importantly, Lemma 1 gives us a recursive formula for
computing the exposure time and focus setting of each tuple in the sequence:
2
To violate the condition, the minimum focusing distance must be under 1.077f , measured
from the lens center.
52 S.W. Hasinoff and K.N. Kutulakos
n
17 1500 1 3 5 7 9 11 13 15
0.993
0.991
Fig. 5. Optimal light-efficient photography with discrete apertures, shown for a Canon EF85mm
1.2L lens (23 apertures, illustrated in different colors). (a) For a depth of field whose left endpoint
is α, we show optimal capture sequences for a range of relative DOF sizes α β
. These sequences
can be read horizontally, with subintervals corresponding to the apertures determined by Theo-
rem 2. Note that when the DOF is large, the optimal sequence approximates the continuous case.
The diagonal dotted line indicates the DOF to be spanned. (b) Visualizing the optimal capture se-
quence as a function of the camera overhead for the DOF [α, β]. Note that with higher overhead,
the optimal sequence involves fewer photos with larger DOFs (i.e., smaller apertures).
See [24] for a proof. As with Theorem 1, Theorem 2 does not specify the focus settings
in the optimal capture sequence. We use Lemma 1 for this purpose, which explicitly
constructs it from the apertures and their multiplicities.
While it is not possible to obtain a closed-form expression for the optimal sequence,
solving the integer program for any desired DOF is straightforward. We use a simple
branch-and-bound method based on successive relaxations to linear programming [23].
Moreover, since the optimal sequence depends only on the relative DOF size α β , we
pre-compute it for all possible DOFs and store the results in a lookup table (Fig. 5a).
Clearly, a non-negligible overhead penalizes long capture sequences and reduces the
synthetic DOF advantage. Despite this, Fig. 5b shows that synthetic DOFs offer
54 S.W. Hasinoff and K.N. Kutulakos
significant speedups even for current off-the-shelf cameras. These speedups will be
amplified further as camera manufacturers continue to improve frame-per-second rates.
Synthesizing Photos for Novel Focus Settings and Aperture Diameters. To synthe-
size novel photos, we generalize DOF compositing and take advantage of the different
levels of defocus throughout the capture sequence. We proceed in four basic steps. First,
given a specific focus and aperture setting, we use Eq. (C) and the coarse depth map
to assign a blur diameter to each pixel in the final composite. Second, we use Eq. (C)
again to determine, for each pixel in the composite, the input photo whose blur diameter
that corresponds to the pixel’s depth matches most closely.3 Third, for each depth layer,
we synthesize a photo under the assumption that the entire scene is at that depth, and is
observed with the novel focus and aperture setting. To do this, we use the blur diameter
for this depth to define an interpolation between two of the input photos. We currently
interpolate using simple linear cross-fading, which we found to be adequate when the
DOF is sampled densely enough (i.e., with 5 or more images). Fourth, we generate the
final composite by merging all these synthesized images into one photo using the same
gradient-domain blending as in DOF compositing, with the same depth labels.
6 Experimental Results
Figure 6 shows results and timings for two experiments, performed with two differ-
ent cameras – a high-end digital SLR and a compact digital camera (see [24] for more
results and videos). All photos were captured at the same exposure level for each ex-
periment. In each case, we captured (1) a narrow-aperture photo and (2) the optimal
capture sequence for the equivalent DOF and the particular camera. To compensate for
the distortions that occur with changes in focus setting, we align the photos according
3
Note each blur diameter is consistent with two depths (Fig. 3b). We resolve the ambiguity by
choosing the matching input photo whose focus setting is closest to the synthetic focus setting.
Light-Efficient Photography 55
Canon S3 IS (6MP)
Fig. 6. Light-efficient photography timings and synthesis, for several real scenes, captured using
a compact digital camera and a digital SLR. (a,d) Sample wide-aperture photo from the synthetic
DOF sequence. (b,e) DOF composites synthesized from this sequence. (c,f) Narrow-aperture
photos spanning an equivalent DOF, but with much longer exposure time. (g) Coarse depth map,
computed from the labeling we used to compute (e). (h) Synthetically changing aperture size,
focused at the same setting as (d). (i) Synthetically changing the focus setting as well.
56 S.W. Hasinoff and K.N. Kutulakos
to a one-time calibration method that fits a radial magnification model to focus setting
[25]. To determine the maximum acceptable blur diameter c for each camera, we eval-
uated focus using a resolution chart. The values we found, 5 µm (1.4 pixels) and 25 µm
(3.5 pixels) respectively, agree with standard values [21].
DOF Compositing. Figures 6b and 6e show that despite the availability of just a coarse
depth map, our compositing scheme is able to reproduce high-frequency detail over the
whole DOF without noticeable artifacts, even in the vicinity of depth discontinuities.
Note that while the synthesized photos satisfy our goal of spanning a specific DOF,
objects outside that DOF will appear more defocused than in the corresponding narrow-
aperture photo (e.g., see the background in Figs. 6e–f). While increased background
defocus may be desirable (e.g., for portrait or macro photography), it is also possible to
capture sequences of photos to reproduce arbitrary levels of defocus outside the DOF.
Depth Maps and DOF Compositing. Despite being more efficient to capture, se-
quences with synthetic DOFs provide 3D shape information at no extra acquisition cost
(Fig. 6g). Figures 6h–i show results of using this depth map to compute novel images
whose aperture and focus setting was changed synthetically according to Sect. 5.
Implementation Details. Neither of our cameras provide the ability to control focus
remotely. For our compact camera we used modified firmware that enables scripting
[26], while for our SLR we used a computer-controlled motor to drive the focusing ring
mechanically. Both methods incur high overhead and limit us to about 1 fps.
While light-efficient photography is not practical in this context, it will become in-
creasingly so, as newer cameras begin to provide focus control and to increase frame-
per-second rates. For example, the Canon EOS-1Ds Mark III provides remote focus
control for all Canon EF lenses, and the Casio EX-F1 can capture 60 fps at 6MP.
7 Concluding Remarks
In this paper we studied the use of dense, wide-aperture photo sequences as a light-
efficient alternative to single-shot, narrow-aperture photography. While our emphasis
has been on the underlying theory, we believe our method has great practical potential.
We are currently investigating several extensions to the basic approach. These in-
clude designing light-efficient strategies (1) for spanning arbitrary defocus profiles,
rather than just the DOF; (2) improving efficiency by taking advantage of the camera’s
auto-focus sensor; and (3) operating under a highly-restricted time-budget, for which it
becomes important to weigh the tradeoff between noise and defocus.
Light-Efficient Photography 57
References
1. Healey, G.E., Kondepudy, R.: Radiometric CCD camera calibration and noise estimation.
TPAMI 16(3), 267–276 (1994)
2. Hasinoff, S.W., Kutulakos, K.N.: A layer-based restoration framework for variable-aperture
photography. In: Proc. ICCV (2007)
3. Pentland, A.P.: A new sense for depth of field. TPAMI 9(4), 523–531 (1987)
4. Krotkov, E.: Focusing. IJCV 1(3), 223–237 (1987)
5. Hiura, S., Matsuyama, T.: Depth measurement by the multi-focus camera. In: CVPR, pp.
953–959 (1998)
6. Watanabe, M., Nayar, S.K.: Rational filters for passive depth from defocus. IJCV 27(3), 203–
225 (1998)
7. Raskar, R., Agrawal, A., Tumblin, J.: Coded exposure photography: motion deblurring using
fluttered shutter. In: SIGGRAPH, pp. 795–804 (2006)
8. Yuan, L., Sun, J., Quan, L., Shum, H.Y.: Image deblurring with blurred/noisy image pairs.
In: SIGGRAPH (2007)
9. Telleen, J., Sullivan, A., Yee, J., Gunawardane, P., Wang, O., Collins, I., Davis, J.: Synthetic
shutter speed imaging. In: Proc. Eurographics, pp. 591–598 (2007)
10. Farid, H., Simoncelli, E.P.: Range estimation by optical differentiation. JOSA A 15(7), 1777–
1786 (1998)
11. Cathey, W.T., Dowski, E.R.: New paradigm for imaging systems. Applied Optics 41(29),
6080–6092 (2002)
12. Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and depth from a conventional
camera with a coded aperture. In: SIGGRAPH (2007)
13. Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled photography:
Mask enhanced cameras for heterodyned light fields and coded aperture refocusing. In: SIG-
GRAPH (2007)
14. Aizawa, K., Kodama, K., Kubota, A.: Producing object-based special effects by fusing mul-
tiple differently focused images. In: TCSVT 10(2) (2000)
15. Chaudhuri, S.: Defocus morphing in real aperture images. JOSA A 22(11), 2357–2365
(2005)
16. Hasinoff, S.W., Kutulakos, K.N.: Confocal stereo. In: Leonardis, A., Bischof, H., Pinz, A.
(eds.) ECCV 2006. LNCS, vol. 3951, pp. 620–634. Springer, Heidelberg (2006)
17. Ng, R.: Fourier slice photography. In: SIGGRAPH, pp. 735–744 (2005)
18. Levoy, M., Ng, R., Adams, A., Footer, M., Horowitz, M.: Light field microscopy. In: SIG-
GRAPH, pp. 924–934 (2006)
19. Debevec, P., Malik, J.: Recovering high dynamic range radiance maps from photographs. In:
SIGGRAPH, pp. 369–378 (1997)
20. Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless, B., Salesin,
D., Cohen, M.: Interactive digital photomontage. In: SIGGRAPH, pp. 294–302 (2004)
21. Smith, W.J.: Modern Optical Engineering, 3rd edn. McGraw-Hill, New York (2000)
22. Canon lens chart, http://www.usa.canon.com/app/pdf/lens/
23. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Heidelberg (1999)
24. http://www.cs.toronto.edu/∼kyros/research/lightefficient/
25. Willson, R., Shafer, S.: What is the center of the image? JOSA A 11(11), 2946–2955 (1994)
26. CHDK, http://chdk.wikia.com/
58 S.W. Hasinoff and K.N. Kutulakos
A Proof of Theorem 1
Theorem 1 follows as a consequence of Lemma 1 and four additional lemmas. We first
state Lemmas 2–5 below and then prove a subset of them, along with a proof sketch the
theorem. All missing proofs can be found in [24].
Lemma 3 (Permutation of Sequential DOFs). Given the left endpoint, α, every per-
mutation of D1 , . . . , Dn defines a capture sequence with sequential DOFs that has the
same synthetic DOF and the same total exposure time.
It therefore suffices to show that placing n − 1 points in [α, β] is most efficient when
n = n. To do this, we show that splitting a sub-interval always produces a more effi-
cient capture sequence.
Light-Efficient Photography 59
Consider the case n = 2, where the sub-interval to be split is actually equal to [α, β].
Let x ∈ [α, β] be a splitting point. The exposure time for the sub-intervals [α, x] and
[x, β] can be obtained by combining Eqs. (D) and (1):
2
2
τ (x) = L
c2
x−α
x+α + L
c2
β−x
β+x , (14)
1 Depth of Field
The depth of field (DOF) of an imaging system is the range of scene depths that
appear focused in an image. In virtually all applications of imaging, ranging
from consumer photography to optical microscopy, it is desirable to control the
DOF. Of particular interest is the ability to capture scenes with very large DOFs.
DOF can be increased by making the aperture smaller. However, this reduces the
amount of light received by the detector, resulting in greater image noise (lower
Parts of this work were supported by grants from the National Science Foundation
(IIS-04-12759) and the Office of Naval Research (N00014-08-1-0329 and N00014-06-
1-0032.)
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 60–73, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Flexible Depth of Field Photography 61
SNR). This trade-off gets worse with increase in spatial resolution (decrease in
pixel size). As pixels get smaller, DOF decreases since the defocus blur occupies
a greater number of pixels. At the same time, each pixel receives less light and
hence SNR falls as well. This trade-off between DOF and SNR is one of the
fundamental, long-standing limitations of imaging.
In a conventional camera, for any location of the image detector, there is
one scene plane – the focal plane – that is perfectly focused. In this paper, we
propose varying the position and/or orientation of the image detector during the
integration time of a photograph. As a result, the focal plane is swept through a
volume of the scene causing all points within it to come into and go out of focus,
while the detector collects photons.
We demonstrate that such an imaging system enables one to control the DOF
in new and powerful ways:
2 Related Work
A promising approach to extended DOF imaging is wavefront coding, where
phase plates placed at the aperture of the lens cause scene objects within a cer-
tain depth range to be defocused in the same way [5,6,7]. Thus, by deconvolving
the captured image with a single blur kernel, one can obtain an all-focused im-
age. In this case, the effective DOF is determined by the phase plate used and is
fixed. On the other hand, in our system, the DOF can be chosen by controlling
the motion of the detector. Our approach has greater flexibility as it can even
be used to achieve discontinuous or tilted DOFs.
Recently, Levin et al. [8] and Veeraraghavan et al. [9] have used masks at
the lens aperture to control the properties of the defocus blur kernel. From a
single captured photograph, they aim to estimate the structure of the scene
and then use the corresponding depth-dependent blur kernels to deconvolve the
image and get an all-focused image. However, they assume simple layered scenes
and their depth recovery is not robust. In contrast, our approach is not geared
towards depth recovery, but can significantly extend DOF irrespective of scene
complexity. Also, the masks used in both these previous works attenuate some
of the light entering the lens, while our system operates with a clear and wide
aperture. All-focused images can also be computed from an image captured
using integral photography [10,11,12]. However, since these cameras make spatio-
angular resolution trade-offs to capture 4D lightfields in a single image, the
computed images have much lower spatial resolutions when compared to our
approach.
Flexible Depth of Field Photography 63
Scene
m m b
u v
p Integration Time
(a) (b)
Fig. 1. (a) A scene point M , at a distance u from the lens, is imaged in perfect focus
by a detector at a distance v from the lens. If the detector is shifted to a distance p
from the lens, M is imaged as a blurred circle with diameter b centered around m . (b)
Our flexible DOF camera translates the detector along the optical axis during the inte-
gration time of an image. By controlling the starting position, speed, and acceleration
of the detector, we can manipulate the DOF in powerful ways.
1 1 1
= + . (1)
f u v
As shown in the figure, if the detector is shifted to a distance p from the lens
(dotted line), M is imaged as a blurred circle (the circle of confusion) centered
around m . The diameter b of this circle is given by
a
b = |(v − p)| . (2)
v
The distribution of light energy within the blur circle is referred to as the
point spread function (PSF). The PSF can be denoted as P (r, u, p), where r is
the distance of an image point from the center m of the blur circle. An idealized
model for characterizing the PSF is the pillbox function:
4 r
P (r, u, p) = Π( ), (3)
πb2 b
where, Π(x) is the rectangle function, which has a value 1, if |x| < 1/2 and
0 otherwise. In the presence of optical aberrations, the PSF deviates from the
pillbox function and is then often modeled as a Gaussian function:
2 2r2
P (r, u, p) = exp(− ), (4)
π(gb)2 (gb)2
where g is a constant.
We now analyze the effect of moving the detector during an image’s integration
time. For simplicity, consider the case where the detector is translated along the
optical axis, as in Figure 1(b). Let p(t) denote the detector’s distance from the
lens as a function of time. Then the aggregate PSF for a scene point at a distance
u from the lens, referred to as the integrated PSF (IPSF), is given by
T
IP (r, u) = P (r, u, p(t)) dt, (5)
0
where T is the total integration time. By programming the detector motion p(t)–
its starting position, speed, and acceleration – we can change the properties of
the resulting IPSF. This corresponds to sweeping the focal plane through the
scene in different ways. The above analysis only considers the translation of
the detector along the optical axis (as implemented in our prototype camera).
However, this analysis can be easily extended to more general detector motions,
where both its position and orientation are varied during image integration.
Figure 2(a) shows our flexible DOF camera. It consists of a 1/3” Sony CCD
(with 1024x768 pixels) mounted on a Physik Instrumente M-111.1DG transla-
tion stage. This stage has a DC motor actuator that can translate the detector
through a 15 mm range at a top speed of 2.7 mm/sec and can position it with
an accuracy of 0.05 microns. The translation direction is along the optical axis
of the lens. The CCD shown has a global shutter and was used to implement ex-
tended DOF and discontinuous DOF. For realizing tilted DOFs, we used a 1/2.5”
Micron CMOS detector (with 2592x1944 pixels) which has a rolling shutter.
Flexible Depth of Field Photography 65
Fig. 2. (a) Prototype system with flexible DOF. (b) Translation of the detector re-
quired for sweeping the focal plane through different scene depth ranges. The maxi-
mum change in the image position of a scene point that results from this translation,
when a 1024x768 pixel detector is used, is also shown.
The table in Figure 2(b) shows detector translations (third column) required
to sweep the focal plane through various depth ranges (second column), using
lenses with two different focal lengths (first column). As we can see, the detector
has to be moved by very small distances to sweep very large depth ranges. Us-
ing commercially available micro-actuators, such translations are easily achieved
within typical image integration times (a few milliseconds to a few seconds).
It must be noted that when the detector is translated, the magnification of
the imaging system changes. The fourth column of the table in Figure 2(b) lists
the maximum change in the image position of a scene point for different trans-
lations of a 1024x768 pixel detector. For the detector motions we require, these
changes in magnification are very small. This does result in the images not being
perspectively correct, but the distortions are imperceptible. More importantly,
the IPSFs are not significantly affected by such a magnification change, since a
scene point will be in high focus only for a small fraction of this change and will
be highly blurred over the rest of it. We verify this in the next section.
1.000 1.000
750mm 0.08 2000mm 750mm 0.06 2000mm
9.996 550mm 450mm 450mm
9.996 550mm
0.012 1100mm 0.06 550mm 550mm
0.008 1100mm 0.04
0.008 2000mm 0.04 1100mm 1100mm
750mm 2000mm 750mm
450mm 0.004 0.02
0.004 0.02 450mm
0 0 0 0
-10 0 10 -10 0 10 -10 0 10 -10 0 10
(a) Normal Camera (b) EDOF Camera (c) Normal Camera (d) EDOF Camera
PSF (Pillbox) IPSF PSF (Gaussian) IPSF
Fig. 3. Simulated (a,c) normal camera PSFs and (b,d) EDOF camera IPSFs, obtained
using pillbox and Gaussian lens PSF models for 5 scene depths. Note that the IPSFs
are almost invariant to scene depth.
uf λ0 + λT 2λ0 2λT
IP (r, u) = − − , (6)
(u − f )πasT r b(0) b(T )
where, b(t) is the blur circle diameter at time t, and λt = 1 if b(t) ≥ 2r and 0
otherwise. On the other hand, if we use the Gaussian function in Equation 4 for
the lens PSF, we get
uf r r
IP (r, u) = √ erfc √ + erfc √ . (7)
(u − f ) 2πrasT 2gb(0) 2gb(T )
Figures 3(a) and (c) show 1D profiles of a normal camera’s PSFs for 5 scene
points with depths between 450 and 2000 mm from a lens with focal length
f = 12.5 mm and f /# = 1.4, computed using Equations 3 and 4 (with g = 1),
respectively. In this simulation, the normal camera was focused at a distance
of 750 mm. Figures 3(b) and (d) show the corresponding IPSFs of an EDOF
camera with the same lens, p(0) = 12.5 mm, s = 1 mm/sec, and T = 360
msec, computed using Equations 6 and 7, respectively. As expected, the normal
camera’s PSF varies dramatically with scene depth. In contrast, the IPSFs of
the EDOF camera derived using both pillbox and Gaussian PSF models look
almost identical for all 5 scene depths, i.e., the IPSFs are depth invariant.
To verify this empirical observation, we measured a normal camera’s PSFs and
the EDOF camera’s IPSFs for several scene depths, by capturing images of small
dots placed at different depths. Both cameras have f = 12.5 mm, f /# = 1.4,
and T = 360 msec. The detector motion parameters for the EDOF camera are
p(0) = 12.5 mm and s = 1 mm/sec. The first column of Figure 4 shows the
measured PSF at the center pixel of the normal camera for 5 different scene
depths; the camera was focused at a distance of 750 mm. (Note that the scale
of the plot in the center row is 50 times that of the other plots.) Columns 2-4
of the figure show the IPSFs of the EDOF camera for 5 different scene depths
and 3 different image locations. We can see that, while the normal camera’s
PSFs vary widely with scene depth, the EDOF camera’s IPSFs appear almost
invariant to both spatial location and scene depth. This also validates our claim
that the small magnification changes that arise due to detector motion (discussed
in Section 3) do not have a significant impact on the IPSFs.
Flexible Depth of Field Photography 67
450mm
0.02 0.02 0.02 0.02
0 0 0 0
-10 0 10 -10 0 10 -10 0 10 -10 0 10
0.04 0.04 0.04 0.04
550mm
0.02 0.02 0.02 0.02
0 0 0 0
-10 0 10 -10 0 10 -10 0 10 -10 0 10
50x 0.04
Scene Depth 0.04 0.04 0.04
750mm
50x 0.02 0.02 0.02 0.02
0 0 0 0
-10 0 10 -10 0 10 -10 0 10 -10 0 10
0.04 0.04 0.04 0.04
1100mm
0 0 0 0
-10 0 10 -10 0 10 -10 0 10 -10 0 10
0.04 0.04 0.04 0.04
2000mm
0 0 0 0
-10 0 10 -10 0 10 -10 0 10 -10 0 10
(0,0) pix. (0,0) pix. (212,0) pix. (424,0) pix.
Center Center
Image Location (x,y)
Fig. 4. (Left column) The measured PSF of a normal camera shown for 5 different
scene depths. Note that the scale of the plot in the center row is 50 times that of
the other plots. (Right columns) The measured IPSF of our EDOF camera shown for
different scene depths (vertical axis) and image locations (horizontal axis). The EDOF
camera’s IPSFs are almost invariant to scene depth and image location.
Since the EDOF camera’s IPSF is invariant to scene depth and image location,
we can deconvolve a captured image with a single IPSF to get an image with
greater DOF. A number of techniques have been proposed for deconvolution,
Richardson-Lucy and Wiener [19] being two popular ones. For our results, we
have used the approach of Dabov et al. [20], which combines Wiener deconvolu-
tion and block-based denoising. In all our experiments, we used the IPSF shown
in the first row and second column of Figure 4 for deconvolution.
Figure 5(a) shows an image captured by our EDOF camera. It was captured
with a 12.5 mm Fujinon lens with f /1.4 and 0.36 second exposure. Notice that
the captured image looks slightly blurry, but high frequencies of all scene ele-
ments are captured. This scene spans a depth range of approximately 450 mm
to 2000 mm – 10 times larger than the DOF of a normal camera with identical
lens settings. Figure 5(b) shows the EDOF image computed from the captured
image, in which the entire scene appears focused1 . Figure 5(c) shows the image
1
Mild ringing artifacts in the computed EDOF images are due to deconvolution.
68 H. Nagahara et al.
captured by a normal camera with identical f /# and exposure time. The near-
est scene elements are in focus, while the farther scene elements are severely
blurred. The image captured by a normal camera with the same exposure time,
but with a smaller aperture (f /8) is shown in Figure 5(d). The intensities of this
image were scaled up so that its dynamic range matches that of the correspond-
ing computed EDOF image. All scene elements look reasonably sharp, but the
image is very noisy as can be seen in the inset (zoomed). The computed EDOF
image has much less noise, while having comparable sharpness. Figures 5(e-h)
show another example, of a scene captured outdoors at night. As we can see, in a
normal camera, the tradeoff between DOF and SNR is extreme for such dimly lit
scenes. In short, our EDOF camera can capture scenes with large DOFs as well
as high SNR. High resolution versions of these images as well as other examples
can be seen at [21].
(c) Image from Normal Camera (d) Image from Normal Camera
(f / 1. .4, T =0.36sec, Near Focus) ( f /8, T =0.36sec, Near Focus) with Scaling
(g) Image from Normal Camera (h) Image from Normal Camera
( f / 1.4, T =0.72sec, Near Focus) ( f /8, T =0.72sec, Near Focus) with Scaling
Fig. 5. (a,e) Images captured by the EDOF camera. (b,f) EDOF images computed
from images in (a) and (e), respectively. Note that the entire scene appears focused.
(c,g) Images captured by a normal camera with identical settings, with the nearest
object in focus. (d,h) Images captured by a normal camera at f /8.
70 H. Nagahara et al.
1.0
Camera f /# Noise standard deviation DOF (mm)
IPSF k 1
0.8 IPSF k 2 Normal 1.4 0.001 0.002 0.005 0.010 0.020 140.98
0.6 Normal 2.8 0.004 0.008 0.020 0.040 0.080 289.57
MTF
(a) (b)
In the above analysis, the SNR was averaged over all frequencies. However,
it must be noted that SNR is frequency dependent - SNR is greater for lower
frequencies than for higher frequencies in the deconvolved EDOF images. Hence,
high frequencies in an EDOF image would be degraded, compared to the high
frequencies in a perfectly focused image. However, in our experiments this degra-
dation is not strong, as can be seen in the full resolution images at [21].
Consider the image in Figure 7(a), which shows two toys (cow and hen) in front
of a scenic backdrop with a wire mesh in between. A normal camera with a small
DOF can capture either the toys or the backdrop in focus, while eliminating the
mesh via defocusing. However, since its DOF is a single continuous volume, it
cannot capture both the toys and the backdrop in focus and at the same time
eliminate the mesh. If we use a large aperture and program our camera’s detector
motion such that it first focuses on the toys for a part of the integration time,
and then moves quickly to another location to focus on the backdrop for the
remaining integration time, we obtain the image in Figure 7(b). While this image
includes some blurring, it captures the high frequencies in two disconnected
DOFs - the foreground and the background - but almost completely eliminates
the wire mesh in between. This is achieved without any post-processing. Note
that we are not limited to two disconnected DOFs; by pausing the detector at
several locations during image integration, more complex DOFs can be realized.
Normal cameras can focus on only fronto-parallel scene planes. On the other
hand, view cameras [2,3] can be made to focus on tilted scene planes by adjusting
the orientation of the lens with respect to the detector. We show that our flexible
Flexible Depth of Field Photography 71
(a) Image from Normal Camera (f /11) (b) Image from Our Camera (f /1.4)
Fig. 7. (a) An image captured by a normal camera with a large DOF. (b) An image
captured by our flexible DOF camera, where the toy cow and hen in the foreground
and the landscape in the background appear focused, while the wire mesh in between
is optically erased via defocusing.
(a) Image from Normal Camera (b) Image from our Camera
(f /1.4, T =0.03sec) (f /1.4, T =0.03sec)
Fig. 8. (a) An image captured by a normal camera of a table top inclined at 53◦ with
respect to the lens plane. (b) An image captured by our flexible DOF camera, where
the DOF is tilted by 53◦ . The entire table top (with the newspaper and keys) appears
focused. Observe that the top of the mug is defocused, but the bottom appears focused,
illustrating that the focal plane is aligned with the table top. Three scene regions of
both the images are shown at a higher resolution to highlight the defocus effects.
sT 2f tan(θ)
θ = tan−1 ( ) and, φ = tan−1 . (8)
H 2p(0) + H tan(θ) − 2f
Here, H is the height of the detector. Therefore, by controlling the speed s of
the detector, we can vary the tilt angle of the image detector, and hence the tilt
of the focal plane and its associated DOF.
Figure 8 shows a scene where the dominant scene plane – a table top with
a newspaper, keys and a mug on it – is inclined at an angle of approximately
53◦ with the lens plane. As a result, a normal camera is unable to focus on the
entire plane, as seen from Figure 8(a). By translating a rolling-shutter detector
(1/2.5” CMOS sensor with a 70msec exposure lag between the first and last row
of pixels) at 2.7 mm/sec, we emulate a detector tilt of 2.6◦ . This enables us to
achieve the desired DOF tilt of 53◦ (from Equation 8) and capture the table top
(with the newspaper and keys) in focus, as shown in Figure 8(b). Observe that
the top of the mug is not in focus, but the bottom appears focused, illustrating
the fact that the DOF is tilted to be aligned with the table top. It is interesting
to note that, by translating the detector with varying speed, we can emulate
non-planar detectors, that can focus on curved scene surfaces.
7 Discussion
In this paper we have proposed a camera with a flexible DOF. DOF is manip-
ulated in various ways by changing the position of the detector during image
integration. We have shown how such a system can capture arbitrarily com-
plex scenes with extended DOF and high SNR. We have also shown that we
can create DOFs that span multiple disconnected volumes. In addition, we have
demonstrated that our camera can focus on tilted scene planes. All of these func-
tionalities are achieved by simply controlling the motion of the detector during
the exposure of a single image.
While computing images with extended DOF, we have not explicitly modeled
occlusions at depth discontinuities or motion blur caused by object/camera mo-
tion. Due to defocus blur, images points that lie close to occlusion boundaries
can receive light from scene points at very different depths. However, since the
IPSF of the EDOF camera is nearly depth invariant, the aggregate IPSF for
such an image point can be expected to be similar to the IPSF of points far
from occlusion boundaries. With respect to motion blur, we have not observed
any visible artifacts in EDOF images computed for scenes with typical object
motion (see Figure 5). However, motion blur due to high-speed objects can be
expected to cause problems. In this case, a single pixel sees multiple objects with
possibly different depths. It is possible that neither of the objects are imaged
in perfect focus during detector translation. This scenario is an interesting one
that warrants further study.
In addition to the DOF manipulations shown in this paper, we have (a) cap-
tured extended DOF video by moving the detector forward one frame, backward
the next, and so on (the IPSF is invariant to the direction of motion), (b) cap-
tured scenes with non-planar DOFs, and (c) exploited the camera’s focusing
Flexible Depth of Field Photography 73
References
1. Hausler, G.: A Method to Increase the Depth of Focus by Two Step Image Pro-
cessing. Optics Communications, 38–42 (1972)
2. Merklinger, H.: Focusing the View Camera (1996)
3. Krishnan, A., Ahuja, N.: Range estimation from focus using a non-frontal imaging
camera. IJCV, 169–185 (1996)
4. Scheimpflug, T.: Improved Method and Apparatus for the Systematic Alteration
or Distortion of Plane Pictures and Images by Means of Lenses and Mirrors for
Photography and for other purposes. GB Patent (1904)
5. Dowski, E.R., Cathey, W.T.: Extended Depth of Field Through Wavefront Coding.
Applied Optics, 1859–1866 (1995)
6. George, N., Chi, W.: Extended depth of field using a logarithmic asphere. Journal
of Optics A: Pure and Applied Optics, 157–163 (2003)
7. Castro, A., Ojeda-Castaneda, J.: Asymmetric Phase Masks for Extended Depth of
Field. Applied Optics, 3474–3479 (2004)
8. Levin, A., Fergus, R., Durand, F., Freeman, B.: Image and depth from a conven-
tional camera with a coded aperture. SIGGRAPH (2007)
9. Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled pho-
tography: mask enhanced cameras for heterodyned light fields and coded aperture.
SIGGRAPH (2007)
10. Adelson, E., Wang, J.: Single lens stereo with a plenoptic camera. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 99–106 (1992)
11. Ng, R., Levoy, M., Brdif, M., Duval, G., Horowitz, M., Hanrahan, P.: Light field
photography with a hand-held plenoptic camera. Technical Report Stanford Uni-
versity (2005)
12. Georgiev, T., Zheng, C., Curless, B., Salesin, D., Nayar, S.K., Intwala, C.: Spatio-
angular resolution tradeoff in integral photography. In: Eurographics Symposium
on Rendering, pp. 263–272 (2006)
13. Darrell, T., Wohn, K.: Pyramid based depth from focus. CVPR, 504–509 (1988)
14. Nayar, S.K.: Shape from Focus System. CVPR, 302–308 (1992)
15. Subbarao, M., Choi, T.: Accurate Recovery of Three-Dimensional Shape from Im-
age Focus. PAMI, 266–274 (1995)
16. Levin, A., Sand, P., Cho, T.S., Durand, F., Freeman, W.T.: Motion-Invarient Pho-
tography. SIGGRAPH, ACM Transaction on Graphics (2008)
17. Ben-Ezra, M., Zomet, A., Nayar, S.: Jitter Camera: High Resolution Video from a
Low Resolution Detector. CVPR, 135–142 (2004)
18. Ait-Aider, O., Andreff, N., Lavest, J.M., Martinet, P.: Simultaneous Object Pose
and Velocity Computation Using a Single View from a Rolling Shutter Camera.
In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp.
56–68. Springer, Heidelberg (2006)
19. Jansson, P.A.: Deconvolution of Images and Spectra. Academic Press, London
(1997)
20. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image restoration by sparse 3D
transform-domain collaborative filtering. SPIE Electronic Imaging (2008)
21. www.cs.columbia.edu/CAVE/projects/flexible dof
Priors for Large Photo Collections and
What They Reveal about Cameras
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 74–87, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Priors for Large Photo Collections and What They Reveal about Cameras 75
2 Related Work
A number of image priors have been proposed to describe the statistics of in-
dividual photographs, such as the sparsity of outputs of band-pass filters (e.g.
derivative filters) [6,7], biases in the distribution of gradient orientations [8,9],
and 1/f fall-off of the amplitude spectrum [10,11]. These priors have been ex-
ploited for applications such as deriving intrinsic images from image sequences
[15], super-resolution and image demosaicing [16], removing effects of camera
shake [17], and classifying images as belonging to different scene categories [9,18].
We focus on the aggregate statistics of large photo collections, which tend to have
76 S. Kuthirummal et al.
less variability than the statistics of a single image. We thus propose two new
priors for aggregate statistics of large photo collections and describe how they
can be exploited to recover radiometric properties of cameras.
The most popular method for estimating the camera response function in-
volves taking multiple registered images of a static scene with varying camera
exposures [19,20]. Grossberg and Nayar [21] relax the need for spatial corre-
spondences by using histograms of images at different exposures. If the exposure
cannot be varied, but can be locked, the response can be estimated by capturing
multiple registered images of a static scene illuminated by different combina-
tions of light sources [22]. All these methods require significant user effort and
physical access to the camera. Farid [23] assumes that the response function
has the form of a gamma curve and estimates it from a single image. However,
in practice response functions can differ significantly from gamma curves. Lin
et al. [24] also estimate the response from a single image by exploiting intensity
statistics at edges. Their results depend on the kinds of edges detected, and their
method employs a non-linear optimization which needs multiple initial guesses
for robustness. In contrast, we automatically and robustly estimate the response
function using numerous existing photographs.
Vignetting can be estimated by imaging a uniformly illuminated flat texture-
less Lambertian surface, and comparing the intensity of every pixel with that
of the center pixel (which is assumed to have no vignetting) [25,26]. Unfortu-
nately, realizing such capture conditions is difficult. One approach is to use a
device called an “integrating sphere,” but this specialized hardware is expen-
sive. Stumpfel et al. [27] capture many images of a known illuminant at different
locations in the image and fit a polynomial to the measured irradiances. The
same principle has been used to estimate vignetting from overlapping images
of an arbitrary scene [28,29,30] using measured irradiances of the same scene
point at different image locations. All these methods require the user to acquire
new images under controlled conditions. Some of the above approaches [28,29]
can be used to simultaneously estimate the vignetting and the response function
of a camera, but there are typically ambiguities in recovering this information.
Since we recover both properties independently, we do not have any ambigu-
ities. Recently, Zheng et al. [31] have proposed estimating vignetting from a
single image by assuming that a vignette-corrected image will yield an image
segmentation with larger segments. Their optimization algorithm, which con-
sists of many alternating image segmentation and vignetting estimation steps,
is highly non-linear and hence is likely to have local minima issues. In contrast,
we estimate vignetting linearly and efficiently.
During manufacturing, bad pixels are typically identified by exposing image
detectors to uniform illuminations. However, some pixels develop defects later
and it is difficult for consumers to create uniform environments to detect them.
Dudas et al. [32] detect such pixels by analyzing a set of images in a Bayesian
framework. However, they only show simulation results. We propose a simple
technique that is able to detect bad pixels, albeit using many images.
Priors for Large Photo Collections and What They Reveal about Cameras 77
e (Log−Luminance)
60 Focal Length: 5.8, F/#: 2.8
40
50
30
40
20
30
10
20
0 400 800 1200 1600 0 400 800 1200 1600 2000
Rows of the Average Image Columns of the Average Image
(e) (f)
average images. This is possibly because illumination sources are typically above
– outdoors, from the sun and sky, while indoors, from ceiling-mounted light fix-
tures. (ii) The average images do not have a horizontal gradient, illustrated by
Figure 1 (f) which shows log-luminances along a row. We have found that these
two observations are general and they hold true for all camera models and lens
settings. In summary, in the absence of vignetting, average log-luminance images
have a vertical gradient, but no horizontal gradient. This observation serves as
the prior, which we exploit to recover vignetting in Section 4.2.
-5
Irradiances
Irradiances
Irradiances
-15
100 100 100
-20
-25
0 100 200 0 100 200 0 100 200
Irradiances Irradiances Irradiances
(a) Red Joint Histogram (b) Green Joint Histogram (c) Blue Joint Histogram
Fig. 2. Log of the joint histograms of (a) red, (b) green, and (c) blue irradiances
computed from 15,550 photographs captured by Canon S1IS cameras with the extreme
lens setting – smallest focal length (5.8 mm) and largest f-number (4.5). The inverse
camera response functions used were normalized so that irradiance values were in the
range (0,255). When computing the histograms we ignored irradiances less than 5 and
greater than 250 to avoid the effects of under-exposure and saturation, respectively.
function, R, for that channel, where R(i) is the irradiance value corresponding
to intensity i. Using R we linearize that channel in photographs from that model
and compute a joint histogram, JH, where JH(i, j), gives the number of times
irradiances R(i) and R(j) occur in neighboring pixels in a desired pixel block. We
interpret the joint histogram as the joint probability distribution of irradiances
by assuming that the distribution is piecewise uniform within each bin. However,
since the values of R are typically non-uniformly spaced, the bins have different
areas. Therefore, to convert the joint histogram to a probability distribution, we
divide the value of each bin by its area. Note that the values of R determine the
sampling lattice, so to enable comparisons between joint histograms for different
response functions we resample the histogram on a regular grid in irradiance
space. Finally, we normalize the resampled distribution so that it sums to one.
We computed joint histograms of red, green, and blue irradiances for several
camera models using 31 × 31 pixel blocks at the center of photographs. Figure 3
shows the joint histograms for the Canon S1IS camera model computed from
photographs with the smallest focal length and largest f-number. These his-
tograms show that the probability of any two irradiances being incident on
neighboring pixels varies depending on the values of the irradiances. Also, the
probability of the same irradiance occuring at neighboring pixels is greater for
low irradiance values and decreases slowly as the irradiance value increases.
Finally, note that the histograms for different color channels differ slightly, il-
lustrating that the visual world has different distributions for different colors.
We have empirically observed that for any particular color channel, the joint
histogram looks very similar across camera models, especially when computed
for the extreme lens setting – smallest focal length and largest f-number. This is
not surprising, because the extreme setting is chosen by different camera models
for similar types of scenes. We quantified this similarity using the symmetric
Kullback-Leibler (KL) divergence between corresponding histograms. The sym-
metric KL divergence between distributions p and q is defined as
80 S. Kuthirummal et al.
q(i) p(i)
KLDivSym(p, q) = Σi q(i) log( ) + Σi p(i) log( ), (1)
p(i) q(i)
where p(i) and q(i) are the samples. For the Canon S1IS and Sony W1 camera
models, the symmetric KL divergence between corresponding joint histograms
for the extreme lens setting were 0.059 (red channel), 0.081 (green channel),
and 0.068 (blue channel). These small numbers illustrate that the histograms are
very similar across camera models. Therefore, we can use the joint histograms
computed for any one camera model as non-parametric priors on these statistics.
250 250
Irradiance
Irradiance
150 150
100 100
50 50
0 0
0 50 100 150 200 250 0 50 100 150 200 250
Intensity Intensity
(a) Red Channel of Sony W1 (b) Green Channel of Canon G5
250 250
Irradiance
150 150
100 100
50 50
0 0
0 50 100 150 200 250 0 50 100 150 200 250
Intensity Intensity
(c) Blue Channel of Casio Z120 (d) Red Channel of Minolta Z2
Sony W1 Canon G5 Casio Z120 Minolta Z2
Proposed [24] Proposed [24] Proposed [24] Proposed [24]
Red Channel 1.344% 2.587% 1.759% 2.553% 2.269% 1.518% 2.226% 4.914%
Green Channel 1.993% 1.243% 0.865% 3.396% 2.521% 1.155% 2.743% 3.237%
Blue Channel 1.164% 1.783% 2.523% 2.154% 2.051% 3.053% 2.653% 3.292%
(e)
Fig. 3. Estimated and ground truth inverse response functions of one channel for four
camera models – (a) Sony W1, (b) Canon G5, (c) Casio Z120, and (d) Minolta Z2. For
these estimates we used 17,819, 9,529, 1,315, and 3,600 photographs, respectively. (a)
also shows the initial guess used by our optimization. (e) RMS percentage errors of the
estimated inverse response functions for camera models from four different manufac-
turers obtained using our proposed method and the method of [24].
camera models. Due to space constraints, we only show the inverse responses of
one of their channels in Figures 3(a-d). For comparison we also show the ground
truth inverse response functions obtained using HDRShop [19]2 . As we can see,
the estimated curves are very close to the ground truth curves. The difference
between the two sets of curves is greater at higher image intensities, for which
HDRShop typically provides very noisy estimates.
The RMS estimation errors are shown in Figure 3(e). Even though our esti-
mation process uses a non-linear optimization, we have found it to be robust to
2
Inverse response functions can only be estimated up to scale. To compare the inverse
responses produced by our technique and HDRShop, we scaled the results from
HDRShop by a factor that minimizes the RMS error between the two curves.
82 S. Kuthirummal et al.
choices of the initial guess. For all our results we used the mean inverse response
from the EMoR database [34], shown in Figure 3(a), as the initial guess. For
comparison, Figure 3(e) also shows the estimation errors obtained when using
the method of Lin et al. [24] on large image sets (the same ones used by our
method) for robustness; the overall mean RMS error of their estimates is 28%
greater than ours. An interesting question to ask is: How many photographs
does our technique need to get a good estimate? We have found that only around
200 photographs are required to get an estimate with RMS error of about 2%.
In some cases, as few as 25 photographs are required. (See [35] for details.)
Note that M is known, while V and L are unknown. We assume that vignetting
is radially symmetric about the center of the image. Therefore, vignetting at
pixel (x, y) can be expressed as a function of the distance, r, of the pixel from
the image center. We model the log of the vignetting as a polynomial in r:
V (x, y) = N k=1 βk r , where βk are the coefficients and N is the degree of the
k
Relative Illuminance
Relative Illuminance
0.9 0.9 0.9
0.6
Ground Truth 0.6
Ground Truth 0.6
Ground Truth
Estimated Estimated Estimated
0.5 0.5 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Radial Distance Radial Distance Radial Distance
(a) f : 5.8 mm, N : 4.5 (b) f : 7.9 mm, N : 5.6 (c) f : 7.2 mm, N :4.0
1 1 1
Relative Illuminance
Relative Illuminance
Relative Illuminance
0.9 0.9 0.9
0.6
Ground Truth 0.6
Ground Truth 0.6
Ground Truth
Estimated Estimated Estimated
0.5 0.5 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Radial Distance Radial Distance Radial Distance
(d) f : 5.8 mm, N :2.8 (e) f : 7.9 mm, N :2.8 (f) f : 7.2 mm, N :2.0
Fig. 5. (a-f) Vignetting estimated for two lens settings each of Canon S1IS, Sony W1,
and Canon G5 cameras, using the bottom half of their respective average log-luminance
images. 15,550, 13,874, 17,819, 15,434, 12,153, and 6,324 photographs, respectively were
used for these estimates. (f and N stand for focal length and f-number respectively.)
(g) RMS and mean percentage errors of the estimated vignetting for two lens settings
each of three camera models; estimation errors are typically less than 2%.
Fig. 6. (a) Contrast enhanced luminance of the average of 1,186 photographs from a
particular Canon S1IS camera. (b) Zoomed in portions of the image in (a) in which
we can clearly see bad pixels that have very different intensities from their neighbors.
(c) A comparative study of the number of bad detector pixels in a particular camera
instance for three different camera models.
5 Conclusion
In this paper, we have presented priors on two aggregate statistics of large photo
collections, and exploited these statistics to recover the radiometric properties
of camera models entirely from publicly available photographs, without physi-
cal access to the cameras themselves. In future work, we would like to develop
statistics that reveal other camera properties such as radial distortion, chromatic
aberration, spatially varying lens softness, etc.. There are, of course, a number
of powerful and accurate approaches to camera calibration, and these existing
techniques have both advantages and disadvantages relative to ours. In that
86 S. Kuthirummal et al.
References
1. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections
in 3D. ACM Transactions on Graphics (SIGGRAPH), 835–846 (2006)
2. Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M.: Multi-View Stereo
for Community Photo Collections. In: ICCV (2007)
3. Hays, J., Efros, A.A.: Scene Completion Using Millions of Photographs. ACM
Transactions on Graphics (SIGGRAPH) (2007)
4. Lalonde, J.-F., Hoiem, D., Efros, A.A., Rother, C., Winn, J., Criminisi, A.: Photo
Clip Art. ACM Transactions on Graphics (SIGGRAPH) (2007)
5. Torralba, A., Fergus, R., Freeman, W.: Tiny Images. MIT Tech Report (2007)
6. Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive field properties by
learning a sparse code for nature images. In: Nature, pp. 607–609 (1996)
7. Simoncelli, E.: Statistical Models for Images: Compression, Restoration and Syn-
thesis. In: Asilomar Conference on Signals, Systems and Computers, pp. 673–678
(1997)
8. Switkes, E., Mayer, M.J., Sloan, J.A.: Spatial frequency analysis of the visual envi-
ronment: anisotropy and the carpentered environment hypothesis. Vision Research,
1393–1399 (1978)
9. Baddeley, R.: The Correlational Structure of Natural Images and the Calibration
of Spatial Representations. Cognitive Science, 351–372 (1997)
10. Burton, G.J., Moorhead, I.R.: Color and spatial structure in natural scenes. Ap-
plied Optics, 157–170 (1987)
Priors for Large Photo Collections and What They Reveal about Cameras 87
11. Field, D.: Relations between the statistics of natural images and the response
properties of cortical cells. J. of the Optical Society of America, 2379–2394 (1987)
12. Wackrow, R., Chandler, J.H., Bryan, P.: Geometric consistency and stability of
consumer-grade digital cameras for accurate spatial measurement. The Photogram-
metric Record, 121–134 (2007)
13. DxO Labs: www.dxo.com
14. PTLens: www.epaperpress.com/ptlens
15. Weiss, Y.: Deriving intrinsic images from image sequences. In: ICCV, pp. 68–75
(2001)
16. Tappen, M.F., Russell, B.C., Freeman, W.T.: Exploiting the sparse derivative prior
for super-resolution and image demosaicing. In: Workshop on Statistical and Com-
putational Theories of Vision (2003)
17. Fergus, R., Singh, B., Hertzmann, A., Roweis, S.T., Freeman, W.T.: Removing
Camera Shake From A Single Photograph. SIGGRAPH, 787–794 (2006)
18. Torralba, A., Oliva, A.: Statistics of Natural Images Categories. Network: Compu-
tation in Neural Systems 14, 391–412 (2003)
19. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from pho-
tographs. SIGGRAPH, 369–378 (1997)
20. Mitsunaga, T., Nayar, S.K.: Radiometric self calibration. CVPR, 1374–1380 (1999)
21. Grossberg, M.D., Nayar, S.K.: Determining the Camera Response from Images:
What is Knowable?. PAMI, 1455–1467 (2003)
22. Manders, C., Aimone, C., Mann, S.: Camera response function recovery from dif-
ferent illuminations of identical subject matter. ICIP, 2965–2968 (2004)
23. Farid, H.: Blind Inverse Gamma Correction. IEEE Transactions on Image Process-
ing, 1428–1433 (2001)
24. Lin, S., Gu, J., Yamazaki, S., Shum, H.-Y.: Radiometric Calibration Using a Single
Image. CVPR, 938–945 (2004)
25. Sawchuk, A.: Real-time correction of intensity nonlinearities in imaging systems.
IEEE Transactions on Computers, 34–39 (1977)
26. Kang, S.B., Weiss, R.: Can we calibrate a camera using an image of a flat textureless
lambertian surface? In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 640–
653. Springer, Heidelberg (2000)
27. Stumpfel, J., Jones, A., Wenger, A., Debevec, P.: Direct HDR capture of the sun
and sky. Afrigraph, 145–149 (2004)
28. Goldman, D.B., Chen, J.H.: Vignette and exposure calibration and compensation.
In: ICCV, pp. 899–906 (2005)
29. Litvinov, A., Schechner, Y.Y.: Addressing radiometric nonidealities: A unified
framework. CVPR, 52–59 (2005)
30. Jia, J., Tang, C.K.: Tensor voting for image correction by global and local intensity
alignment. IEEE Transactions PAMI 27(1), 36–50 (2005)
31. Zheng, Y., Lin, S., Kang, S.B.: Single-Image Vignetting Correction. CVPR (2006)
32. Dudas, J., Jung, C., Wu, L., Chapman, G.H., Koren, I., Koren, Z.: On-Line Map-
ping of In-Field Defects in Image Sensor Arrays. In: International Symposium on
Defect and Fault-Tolerance in VLSI Systems, pp. 439–447 (2006)
33. Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C:
The Art of Scientific Computing (1992)
34. Grossberg, M.D., Nayar, S.K.: What is the Space of Camera Response Functions?.
CVPR, 602–609 (2003)
35. http://www.cs.columbia.edu/CAVE/projects/photo priors/
Understanding Camera Trade-Offs
through a Bayesian Analysis of Light Field Projections
1 Introduction
The flexibility of computational imaging has led to a range of unconventional cam-
era designs. Cameras with coded apertures [1,2], plenoptic cameras [3,4], phase
plates [5,6], and multi-view systems [7] record different combinations of light rays. Re-
construction algorithms then convert the data to viewable images, estimate depth and
other quantites. These cameras involves tradeoffs among various quantites–spatial and
depth resolution, depth of focus or noise. This paper describes a theoretical framework
that will help to compare computational camera designs and understand their tradeoffs.
Computation is changing imaging in three ways. First, the information recorded at
the sensor may not be the final image, and the need for a decoding algorithm must be
taken into account to assess camera quality. Second, beyond 2D images, the new designs
enable the extraction of 4D light fields and depth information. Finally, new priors
can capture regularities of natural scenes to complement the sensor measurements and
amplify decoding algorithms. The traditional evaluation tools based on the image point
spread function (PSF) [8,9] are not able to fully model these effects. We seek tools for
comparing camera designs, taking into account those three aspects. We want to evaluate
the ability to recover a 2D image as well as depth or other information and we want to
model the decoding step and use natural-scene priors.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 88–101, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Understanding Camera Trade-Offs 89
A useful common denominator, across camera designs and scene information, is the
lightfield [7], which encodes the atomic entities (lightrays) reaching the camera. Light
fields naturally capture some of the more common photography goals such as high spa-
tial image resolution, and are tightly coupled with the targets of mid-level computer
vision: surface depth, texture, and illumination information. Therefore, we cast the re-
construction performed in computational imaging as light field inference. We then need
to extend prior models, traditionally studied for 2D images, to 4D light fields.
Camera sensors sum over sets of light rays, with the optics specifying the mapping
between rays and sensor elements. Thus, a camera provides a linear projection of the
4D light field where each projected coordinate corresponds to the measurement of one
pixel. The goal of decoding is to infer from such projections as much information as
possible about the 4D light field. Since the number of sensor elements is significantly
smaller than the dimensionality of the light field signal, prior knowledge about light
fields is essential. We analyze the limitations of traditional signal processing assump-
tions [10,11,12] and suggest a new prior on light field signals which explicitly accounts
for their structure. We then define a new metric of camera performance as follows:
Given a light field prior, how well can the light field be reconstructed from the data
measured by the camera? The number of sensor elements is of course a critical vari-
able, and we chose to standardize our comparisons by imposing a fixed budget of N
sensor elements to all cameras.
We focus on the information captured by each camera, and wish to avoid the con-
founding effect of camera-specific inference algorithms or the decoding complexity.
For clarity and computational efficiency we focus on the 2D version of the problem
(1D image/2D light field). We use simplified optical models and do not model lens
aberrations or diffraction (these effects would still follow a linear projection model and
can be accounted for with modifications to the light field projection function.)
Our framework captures the three major elements of the computational imaging
pipeline – optical setup, decoding algorithm, and priors – and enables a systematic
comparison on a common baseline.
a a
b plane b b
a plane
b b b
b b b
(g) Plenoptic camera (h) Coded aperture lens (i) Wavefront coding
Fig. 1. (a) Flat-world scene with 3 objects. (b) The light field, and (c)-(i) cameras and the light
rays integrated by each sensor element (distinguished by color).
Stereo [17] facilitate depth inference by recording 2 views (fig 1(g), to keep a constant
sensor budget, the resolution of each image is halved).
Plenoptic cameras capture multiple viewpoints using a microlens array [3,4]. If each
microlens covers k sensor elements one achieves k different views of the scene, but the
spatial resolution is reduced by a factor of k (k = 3 is shown in fig 1(g)).
Coded aperture [1,2] place a binary mask in the lens aperture (fig 1(h)). As with con-
ventional lenses, objects deviating from the focus depth are blurred, but according to
the aperture code. Since the blur scale is a function of depth, by searching for the code
scale which best explains the local image window, depth can be inferred. The blur can
also be inverted, increasing the depth of field.
Wavefront coding introduces an optical element with an unconventional shape so that
rays from any world point do not converge. Thus, integrating over a curve in light field
space (fig 1(i)), instead of the straight integration of lenses. This is designed to make
defocus at different depths almost identical, enabling deconvolution without depth in-
formation, thereby extending depth of field. To achieve this, a cubic lens shape (or phase
plate) is used. The light field integration curve, which is a function of the lens normal,
can be shown to be a parabola (fig 1(i)), which is slope invariant (see [18] for a deriva-
tion, also independently shown by M. Levoy and Z. Zhu, personal communication).
y = Tx + n (1)
where x is the light field, y is the captured image, n is an iid Gaussian noise n ∼
N (0, η 2 I) and T is the projection matrix, describing how light rays are mapped to
sensor elements. Referring to figure 1, T includes one row for each sensor element, and
this row has non-zero elements for the light field entries marked by the corresponding
color (e.g. a pinhole T matrix has a single non-zero element per row).
The set of realizable T matrices is limited by physical constraints. In particular,
the entries of T are all non-negative. To ensure equal noise conditions, we assume a
maximal integration time, and the maximal value for each entry of T is 1. The amount
of light reaching each sensor element is the sum of the entries in the corresponding T
row. It is usually better to collect more light to increase the SNR (a pinhole is noisier
because it has a single non-zero entry per row, while a lens has multiple ones).
To simplify notation, most of the following derivation will address a 2D slice in the
4D light field, but the 4D case is similar. While the light field is naturally continuous,
for simplicity we use a discrete representation.
Our goal is to understand how well we can recover the light field x from the noisy
projection y, and which T matrices (among the camera projections described in the
92 A. Levin, W.T. Freeman, and F. Durand
previous section) allow better reconstructions. That is, if one is allowed to take N mea-
surements (T can have N rows), which set of projections leads to better light field re-
construction? Our evaluation methodology can be adapted to a weight w which specifies
how much we care about reconstructing different parts of the light field. For example, if
the goal is an all-focused, high quality image from a single view point (as in wavefront
coding), we can assign zero weight to all but one light field row.
The number of measurements taken by most optical systems is significantly smaller
than the light field data, i.e. T contains many fewer rows than columns. As a result,
it is impossible to recover the light field without prior knowledge on light fields. We
therefore start by modeling a light field prior.
where fk,i denotes the kth high pass filter centered at the ith light field entry. In sec 5,
we will show that band limited assumptions and Gaussian priors indeed lead to equiva-
lent sampling conclusions.
More sophisticated prior choices replace the Gaussian prior of eq 2 with a heavy-
tailed prior [19]. However, as will be illustrated in section 3.4, such generic priors ignore
the very strong elongated structure of light fields, or the fact that the variance along the
disparity slope is significantly smaller than the spatial variance.
For a given slope field, our prior assumes that the light field is Gaussian, but has a
variance in the disparity direction that is significantly smaller than the spatial variance.
The covariance ΨS corresponding to a slope field S is then:
1 T 1 T 2
xT ΨS−1 x = |gS(i),i x|2 + |g0,i x| (3)
i
σ s σ 0
where gs,i is a derivative filter in orientation s centered at the ith light field entry (g0,i
is the derivative in the horizontal/spatial direction), and σs << σ0 , especially for non-
specular objects (in practice, we consider diffuse scenes and set σs = 0). Conditioning
on depth we have P (x|S) ∼ N (0, ΨS ).
We also need a prior P (S) on the slope field S. Given that depth is usually piecewise
smooth, our prior encourages piecewise smooth slope fields (like the regularization of
stereo algorithms). Note however that S and its prior are expressed in light-field space,
not image or object space. The resulting unconditional light field prior is an infinite
mixture of Gaussians (MOG) that sums over slope fields
P (x) = P (S)P (x|S) (4)
S
We note that while each mixture component is a Gaussian which can be evaluated in
closed form, marginalizing over the infinite set of slope fields S is intractable, and
approximation strategies are described below.
Now that we have modeled the probability of a light field x, we turn to the imaging
problem: Given a camera T and a noisy projection y we want to find a Baysian estimate
for the light field x. For this, we need to define P (x|y; T ), the probability that x is the
explanation of the measurement y. Using Bayes’ rule:
P (x|y; T ) = P (x, S|y; T ) = P (S|y; T )P (x|y, S; T ) (5)
S S
Inference. Given a camera T and an observation y we seek to recover the light field
x. In this section we consider MAP estimation, while in section 4 we approximate the
variance as well in an attempt to compare cameras. Even MAP estimation for x is hard,
94 A. Levin, W.T. Freeman, and F. Durand
−3
x 10
7 lens isotropic gaussian prior
isotropic sparse prior
pinhole
6 light fields prior
band−pass assumption
5 wave
front
4 coding stereo
coded
3 aperture
2
plenoptic
1
(a) Test image (b) light field and slope field (c) SSD error in reconstruction
as the integral in eq 5 is intractable. We approximate the MAP estimate for the slope
field S, and conditioning on this estimate, solve for the MAP light field x.
The slope field inference is essentially inferring the scene depth. Our inference gener-
alizes MRF stereo algorithms [17] or the depth regularization of the coded aperture [1].
Details regarding slope inference are provided in [18], but as a brief summary, we model
slope in local windows as constant or having one single discontinuity, and we then reg-
ularize the estimate using an MRF.
Given the estimated slope field S, our light field prior is Gaussian, and thus the
MAP estimate for the light field is the mean of the conditional Gaussian μS in eq 6.
This mean minimizes the projection error up to noise, and regularize the estimate by
minimizing the oriented variance ΨS . Note that in traditional stereo formulations the
multiple views are used only for depth estimation. In contrast, we seek a light field
that satisfies the projection in all views. Thus, if each view includes aliasing, we obtain
“super resolution”.
Fig. 3. Reconstructing a light field from projections. Top row: reconstruction with our MOG light
field prior. Middle row: slope field (estimated with MOG prior), plotted over ground truth. Note
slope changes at depth discontinuities. Bottom row: reconstruction with isotropic Gaussian prior.
resolution under an MOG prior. Thus, our goal in the next section is to analytically
evaluate the reconstruction accuracy of different cameras, and to understand how it is
affected by the choice of prior.
where W = diag(w) is a diagonal matrix specifying how much we care about different
light field entries, as discussed in sec 3.1.
Uncertainty computation. To simplify eq 7, recall that the average distance between
x0 and the elements of a Gaussian is the distance from the center, plus the variance:
E(|W (x − x0 )|2 |S; T ) = |W (μS − x0 )|2 + diag(W 2ΣS ) (8)
Since the integral in eq 9 can not be computed explicitly, we evaluate cameras using
synthetic light fields whose ground truth slope field is known, and evaluate an approxi-
mate uncertainty in the vicinity of the true solution. We use a discrete set of slope field
samples {S1 , ..., SK } obtained as perturbations around the ground truth slope field. We
approximate eq 9 using a discrete average:
1
E(|W (x − x0 )|2 ; T ) ≈ P (Sk |y)E(|W (x − x0 )|2 |Sk ; T ) (10)
K
k
Finally, we use a set of typical light fields x0t (generated using ray tracing) and eval-
uate the quality of a camera T as the expected squared error over these examples
E(T ) = E(|W (x − x0t )|2 ; T ) (11)
t
96 A. Levin, W.T. Freeman, and F. Durand
Note that this solely measures information captured by the optics together with the prior,
and omits the confounding effect of specific inference algorithms (like in sec 3.4).
Fig 4 shows light fields estimated by several cameras, assuming the true depth (and
therefore slope field), was successfully estimated. We also display the variance of the
estimated light field - the diagonal of ΣS (eq 6).
In the right part of the light field, the lens reconstruction is sharp, since it averages
rays emerging from a single object point. On the left, uncertainty is high, since it av-
erages light rays from multiple points.In contrast, integrating over a parabolic curve
(wavefront coding) achieves low uncertainties for both slopes, since a parabola “cov-
ers” all slopes (see [18,20] for derivation). A pinhole also behaves identically at all
depths, but it collects only a small amount of light and the uncertainty is high due to the
small SNR. Finally, the uncertainty increases in stereo and plenoptic cameras due to the
smaller number of spatial samples.
The central region of the light field demonstrates the utility of multiple viewpoint in
the presence of occlusion boundaries. Occluded parts which are not measured properly
Understanding Camera Trade-Offs 97
Pinhole
Lens
Wavefront coding
Stereo
Plenoptic
Fig. 4. Evaluating conditional uncertainty in light field estimate. Left: projection model. Middle:
estimated light field. Right: variance in estimate (equal intensity scale used for all cameras). Note
that while for visual clarity we plot perfect square samples, in our implementation samples were
convolved with low pass filters to simulate realistic optics blur.
lead to higher variance. The variance in the occluded part is minimized by the plenoptic
camera, the only one that spends measurements in this region of the light field.
Since we deal only with spatial resolution, our conclusions correspond to common
sense, which is a good sanity check. However, they cannot be derived from a naive Gaus-
sian model, which emphasizes the need for a prior such as as our new mixture model.
by [21], with the same physical size (stereo baseline shift doesn’t exceed aperture width)
both designs perform similarly, with DFD achieving < P (S 0 |y) >= 0.92.
Our probabilistic treatment of depth estimation goes beyond linear subspace con-
straints. For example, the average slope estimation score of a lens was < P (S 0 |y) >=
0.74, indicating that, while weaker than stereo, a single monocular image captured with
a standard lens contains some depth-from-defocus information as well. This result can-
not be derived using a disjoint-subspace argument, but if the full probability is consid-
ered, the Occam’s razor principle applies and the simpler explanation is preferred.
Finally, a pinhole camera-projection just slices a row out of the light field, and this
slice is invariant to the light field slope. The parabola filter of a wavefront coding lens
is also designed to be invariant to depth. Indeed, for these two cameras, the evaluated
distribution P (S|y) in our model is uniform over slopes.
Again, these results are not surprising but they are obtained within a general frame-
work that can qualitatively and quantitatively compare a variety of camera designs.
While comparisons such as DFD vs. stereo have been conducted in the past [21], our
framework encompasses a much broader family of cameras.
−3 −3
x 10 x 10
4 2.5
No depth discontinuities pinhole No depth discontinuities
Modest depth discontinuities Modest depth discontinuities
3.5 Many depth discontinuities Many depth discontinuities
lens
coded 2 plenoptic
3 aperture lens coded
pinhole aperture
wave plenoptic DFD
2.5 front wave stereo
coding front
1.5 coding
DFD
2 stereo
1.5
1
aperture are much more sensitive than others. While the depth discrimination of DFD
is similar to that of stereo (as discussed in sec 5.2), its overall error is slightly higher
since the wide apertures blur high frequencies.
The ranking in figs 5(a) agrees with the empirical prediction in fig 2(c). However,
while fig 5(a) measures inherent optics information, fig 2(c) folds-in inference errors as
well.
Single-image reconstruction. For single row reconstruction (fig 5(b)) one still has to
account for issues like defocus, depth of field, signal to noise ratio and spatial resolution.
A pinhole camera (recording this single row alone) is not ideal, and there is an advantage
for wide apertures collecting more light (recording multiple light field rows) despite not
being invariant to depth.
The parabola (wavefront coding) does not capture depth information and thus per-
forms very poorly for light field estimation. However, fig 5(b) suggests that for recov-
ering a single light field row, this filter outperforms all other cameras. The reason is
that since the filter is invariant to slope, a single central light field row can be recov-
ered without knowledge of depth. For this central row, it actually achieves high signal
to noise ratios for all depths, as demonstrated in figure 4. To validate this observation,
we have searched over a large set of lens curvatures, or light field integration curves,
parameterized as splines fitted to 6 key points. This family includes both slope sensitive
curves (in the spirit of [6] or a coded aperture), which identify slope and use it in the
estimation, and slope invariant curves (like the parabola [5]), which estimate the cen-
tral row regardless of slope. Our results show that, for the goal of recovering a single
light field row, the wavefront-coding parabola outperforms all other configurations. This
extends the arguments in previous wavefront coding publications which were derived
using optics reasoning and focus on depth-invariant approaches. It also agrees with the
motion domain analysis of [20], predicting that a parabolic integration curve provides
an optimal signal to noise ratio.
suppose we use a camera with a fixed N pixels resolution, how many different views
(N pixels each) do we actually need for a good ‘virtual reality’?
Figure 6 plots the expected reconstruction er-
ror as a function of the number of views for both 7 x 10 −3
6 Discussion
The growing variety of computational camera designs calls for a unified way to analyze
their tradeoffs. We show that all cameras can be analytically modeled by a linear mapping
of light rays to sensor elements. Thus, interpreting sensor measurements is the Bayesian
inference problem of inverting the ray mapping. We show that a proper prior on light
fields is critical for the successes of camera decoding. We analyze the limitations of tra-
ditional band-pass assumptions and suggest that a prior which explicitly accounts for the
elongated light field structure can significantly reduce sampling requirements.
Our Bayesian framework estimates both depth and image information, accounting
for noise and decoding uncertainty. This provides a tool to compare computational cam-
eras on a common baseline and provides a foundation for computational imaging. We
conclude that for diffuse scenes, the wavefront coding cubic lens (and the parabola light
field curve) is the optimal way to capture a scene from a single view point. For capturing
a full light field, a stereo camera outperformed other tested configurations.
We have focused on providing a common ground for all designs, at the cost of sim-
plifying optical and decoding aspects. This differs from traditional optics optimization
tools such as Zemax that provide fine-grain comparisons between subtly-different de-
signs (e.g. what if this spherical lens element is replaced by an aspherical one?). In
contrast, we are interested in the comparison between families of imaging designs (e.g.
stereo vs. plenoptic vs. coded aperture). We concentrate on measuring inherent informa-
tion captured by the optics, and do not evaluate camera-specific decoding algorithms.
The conclusions from our analysis are well connected to reality. For example, it
can predict the expected tradeoffs (which can not be derived using more naive light
field models) between aperture size, noise and spatial resolution discussed in sec 5.1. It
justifies the exact wavefront coding lens design derived using optics tools, and confirms
the prediction of [21] relating stereo to depth from defocus.
Understanding Camera Trade-Offs 101
Analytic camera evaluation tools may also permit the study of unexplored camera
designs. One might develop new cameras by searching for linear projections that yield
optimal light field inference, subject to physical implementation constraints. While the
camera score is a very non-convex function of its physical characteristics, defining cam-
era evaluation functions opens up these research directions.
References
1. Levin, A., Fergus, R., Durand, F., Freeman, W.: Image and depth from a conventional camera
with a coded aperture. SIGGRAPH (2007)
2. Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled photography:
Mask-enhanced cameras for heterodyned light fields and coded aperture refocusing. SIG-
GRAPH (2007)
3. Adelson, E.H., Wang, J.Y.A.: Single lens stereo with a plenoptic camera. PAMI (1992)
4. Ng, R., Levoy, M., Bredif, M., Duval, G., Horowitz, M., Hanrahan, P.: Light field photogra-
phy with a hand-held plenoptic camera. Stanford U. Tech. Rep. CSTR 2005-02 (2005)
5. Bradburn, S., Dowski, E., Cathey, W.: Realizations of focus invariance in optical-digital sys-
tems with wavefront coding. Applied optics 36, 9157–9166 (1997)
6. Dowski, E., Cathey, W.: Single-lens single-image incoherent passive-ranging systems. App.
Opt. (1994)
7. Levoy, M., Hanrahan, P.M.: Light field rendering. SIGGRAPH (1996)
8. Goodman, J.W.: Introduction to Fourier Optics. McGraw-Hill Book Company, New York
(1968)
9. Zemax: http://www.zemax.com
10. Chai, J., Tong, X., Chan, S., Shum, H.: Plenoptic sampling. SIGGRAPH (2000)
11. Isaksen, A., McMillan, L., Gortler, S.J.: Dynamically reparameterized light fields. SIG-
GRAPH (2000)
12. Ng, R.: Fourier slice photography. SIGGRAPH (2005)
13. Seitz, S., Kim, J.: The space of all stereo images. In: ICCV (2001)
14. Grossberg, M., Nayar, S.K.: The raxel imaging model and ray-based calibration. In: IJCV
(2005)
15. Kak, A.C., Slaney, M.: Principles of Computerized Tomographic Imaging
16. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. PAMI (2002)
17. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo corre-
spondence algorithms. Intl. J. Computer Vision 47(1), 7–42 (2002)
18. Levin, A., Freeman, W., Durand, F.: Understanding camera trade-offs through a bayesian
analysis of light field projections. MIT CSAIL TR 2008-049 (2008)
19. Roth, S., Black, M.J.: Fields of experts: A framework for learning image priors. In: CVPR
(2005)
20. Levin, A., Sand, P., Cho, T.S., Durand, F., Freeman, W.T.: Motion invariant photography.
SIGGRAPH (2008)
21. Schechner, Y., Kiryati, N.: Depth from defocus vs. stereo: How different really are they. IJCV
(2000)
CenSurE: Center Surround Extremas for
Realtime Feature Detection and Matching
1 Introduction
Image matching is the task of establishing correspondences between two images
of the same scene. This is an important problem in Computer Vision with appli-
cations in object recognition, image indexing, structure from motion and visual
localization – to name a few. Many of these applications have real-time constraints
and would benefit immensely from being able to match images in real time.
While the problem of image matching has been studied extensively for various
applications, our interest in it has been to be able to reliably match two images
in real time for camera motion estimation, especially in difficult off-road environ-
ments where there is large image motion between frames [1,2]. Vehicle dynamics
and outdoor scenery can make the problem of matching images very challenging.
The choice of a feature detector can have a large impact in the performance of
such systems.
We have identified two criteria that affect performance.
– Stability: the persistence of features across viewpoint change
– Accuracy: the consistent localization of a feature across viewpoint change
This material is based upon work supported by the United States Air Force under
Contract No. FA8650-04-C-7136. Any opinions, findings and conclusions or recom-
mendations expressed in this material are those of the author(s) and do not neces-
sarily reflect the views of the United States Air Force.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 102–115, 2008.
c Springer-Verlag Berlin Heidelberg 2008
CenSurE: Center Surround Extremas 103
The two scale-space detectors that are closest to our work, in technique and prac-
ticality, are SIFT [11] and SURF [13]. The main differences between approaches
is summarized in the table below.
CenSurE SIFT SURF
Spatial resolution at scale full subsampled subsampled
Scale-space operator Laplace Laplace Hessian
Approximation (Center-surround) (DOG) (DOB)
Edge filter Harris Hessian Hessian
Rotational invariance approximate yes no
The key difference is the full spatial resolution achieved by CenSurE at ev-
ery scale. Neither SIFT nor SURF computes responses at all pixels for larger
scales, and consquently do not detect extrema across all scales. Instead, they
consider each scale octave independently. Within an octave, they subsamples
the responses, and find extrema only at the subsampled pixels. At each suc-
cessive octave, the subsampling is increased, so that almost all computation is
spent on the first octave. Consequently, the accuracy of features at larger scales
is sacrificed, in the same way that it is for pyramid systems. While it would be
possible for SIFT and SURF to forego subsampling, it would then be inefficient,
with compute times growing much larger.
CenSurE also benefits from using an approximation to the Laplacian, which
has been shown to be better for scale selection [12]. The center-surround ap-
proximation is fast to compute, while being insensitive to rotation (unlike the
DOB Hessian approximation). Also, CenSurE uses a Harris edge filter, which
gives better edge rejection than the Hessian.
Several simple center-surround filters exist in the literature. The bi-level Lapla-
cian of Gaussian (BLoG) approximates the LoG filter using two levels. [14] de-
scribes circular BLoG filters and optimizes for the inner and outer radius to best
approximate the LoG filter. The drawback is that the cost of BLoG depends on
the size of the filter. Closer to our approach is that of Grabner et al. [15], who
describe a difference-of-boxes (DOB) filter that approximates the SIFT detector,
and is readily computed at all scales with integral images [16,17]. Contrary to the
results presented in [15], we demonstrate that our DOB filters outperform SIFT
in repeatability. This can be attributed to careful selection of filter sizes and using
the second moment matrix instead of the Hessian to filter out responses along a
line. In addition, the DOB filter is not invariant to rotation, and in this paper we
propose filters that have better properties.
The rest of the paper is organized as follows. We describe our CenSurE features
in detail in Section 2. We then discuss our modified upright SURF (MU-SURF)
in Section 3. We compare the performance of CenSurE against several other
feature detectors. Results of this comparison for image matching are presented
in Section 4.1 followed by results for visual odometry in Section 4.2. Finally,
Section 5 concludes this paper.
CenSurE: Center Surround Extremas 105
We replace the two circles in the circular BLoG with squares to form our
CenSurE-DOB. This results in a basic center-surround Haar wavelet. Figure
1(d) shows our generic center-surround wavelet of block size n. The inner box
is of size (2n + 1) × (2n + 1) and the outer box is of size (4n + 1) × (4n + 1).
Convolution is done by multiplication and summing. If In is the inner weight
106 M. Agrawal, K. Konolige, and M.R. Blas
and On is the weight in the outer box, then in order for the DC response of this
filter to be zero, we must have
We must also normalize for the difference in area of each wavelet across scale.
We use a set of seven scales for the center-surround Haar wavelet, with block
size n = [1, 2, 3, 4, 5, 6, 7]. Since the block sizes 1 and 7 are the boundary, the
lowest scale at which a feature is detected corresponds to a block size of 2. This
roughly corresponds to a LoG with a sigma of 1.885. These five scales cover 2 12
octaves, although the scales are linear. It is easy to add more filters with block
sizes 8,9, and so on.
Table 1. CenSurE-OCT: inner and outer octagon sizes for various scales
+
x’,y’
+
x,y x,y
Fig. 2. Using slanted integral images to con- Fig. 3. Regions and subregions for
struct trapezoidal areas. Left is a slanted in- MU-SURF descriptor. Each subregion
tegral image, where the pixel x, y is the sum (in blue) is 9x9 with an overlap of 2
of the shaded areas; α is 1. Right is a half- pixels at each boundary. All sizes are
trapezoid, from subtracting two slanted inte- relative to the scale of the feature s.
gral image pixels.
I is an intermediate representation for the image and contains the sum of gray
scale pixel values of image N with height y and width x, i.e.,
x
y
I(x, y) = N (x , y ) (4)
x =0 y =0
The integral image is computed recursively, requiring only one scan over the
image. Once the integral image is computed, it it takes only four additions to
calculate the sum of the intensities over any upright, rectangular area, indepen-
dent of its size.
Modified versions of integral images can be exploited to compute the other
polygonal filters. The idea here is that any trapezoidal area can be computed
in constant time using a combination of two different slanted integral images,
where the sum at a pixel represents an angled area sum. The degree of slant is
controlled by a parameter α:
x+α(y−y )
y
Iα (x, y) = N (x , y ). (5)
y =0 x =0
When α = 0, this is just the standard rectangular integral image. For α < 0, the
summed area slants to the left; for α > 0, it slants to the right (Figure 2, left).
Slanted integral images can be computed in the same time as rectangular ones,
using incremental techniques.
Adding two areas together with the same slant determines one end of a trape-
zoid with parallel horizontal sides (Figure 2, right); the other end is done sim-
ilarly, using a different slant. Each trapezoid requires three additions, just as
in the rectangular case. Finally, the polygonal filters can be decomposed into
1 (box), 2 (hexagon), and 3 (octagon) trapezoids, which is the relative cost of
computing these filters.
CenSurE: Center Surround Extremas 109
4 Experimental Results
We compare CenSurE-DOB and CenSurE-OCT to Harris, FAST, SIFT, and
SURF feature detectors for both image matching and visual odometry. Results
for image matching are presented in Section 4.1 and VO in Section 4.2.
1
Available from http://www.robots.ox.ac.uk/∼ vgg/research/affine/
CenSurE: Center Surround Extremas 111
100 700
Harris Harris
FAST FAST
90 SIFT SIFT
CenSurE−DOB 600 CenSurE−DOB
SURF SURF
80 CenSurE−OCT CenSurE−OCT
70 500
nb of correspondences
60
repeatibility %
400
50
300
40
30
200
20
100
10
0 0
10 20 30 40 50 60 70 20 25 30 35 40 45 50 55 60
viewpoint angle viewpoint angle
(a) (b)
100 600
Harris Harris
90 FAST FAST
SIFT SIFT
CenSurE−DOB 500 CenSurE−DOB
80
SURF SURF
CenSurE−OCT CenSurE−OCT
70
nb of correspondences
400
repeatibility %
60
50 300
40
200
30
20
100
10
0 0
1 2 3 4 5 6 7 2 2.5 3 3.5 4 4.5 5 5.5 6
image number image number
(c) (d)
Fig. 4. Repeatability and number of correspondences for different detectors for the
graffiti and boat sequences. The number of features is the same for each detector. (a)
& (b) graffiti sequence. (c) & (d) boat sequence.
Percentage of correct matches as a function of search radius. Number of features is fixed to 800
70
Relative performance of features
Harris
FAST 1
SIFT
60 CenSurE−DOB
SURF
CenSurE−OCT
0.8
FAST
Percentage inliers
0.6 Harris
40 SIFT
SURF
0.4 SURF+
30 DOB
OCT
0.2
20
0
10 percent inliers mean track length
1 2 3 4 5
Search radius
4. If the motion estimate is small and the percentage of inliers is large enough,
we discard the frame, since composing such small motions increases error. A
kept frame is called a key frame. The larger the distance between key frames,
the better the estimate will be.
5. The pose estimate is refined further in a sparse bundle adjustment (SBA)
framework [20,21].
The dataset for this experiment consists of 19K frames taken over the course of
a 3 km autonomous, rough-terrain run. The images have resolution 512x384, and
were taken at a 10 Hz rate; the mean motion between frames was about 0.1m.
The dataset also contains RTK GPS readings synchronized with the frames, so
ground truth to within about 10 cm is available for gauging accuracy.
We ran each of the operators under the same conditions and parameters for
visual odometry, and compared the results. Since the performance of an operator
is strongly dependent on the number of features found, we set a threshold of 400
features per image, and considered the highest-ranking 400 features for each
operator. We also tried hard to choose the best parameters for each operator.
For example, for SURF we used doubled images and a subsampling factor of 1,
since this gave the best performance (labeled “SURF+” in the figures).
The first set of statistics shows the raw performance of the detector on two
of the most important performance measures for VO: the average percentage of
inliers to the motion estimate, and the mean track length for a feature (Figure 6).
In general, the scale-space operators performed much better than the simple
corner detectors. CenSurE-OCT did the best, beating out SURF by a small
margin. CenSurE-DOB is also a good performer, but suffers from lack of radial
symmetry. Surprisingly, SIFT did not do very well, barely beating Harris corners.
Note that the performance of the scale-space operators is sensitive to the
sampling density. For standard SURF settings (no doubled image, subsampling
of 2) the performance is worse than the corner operators. Only when sampling
densely for 2 octaves, by using doubled images and setting subsampling to 1, does
performance approach that of CenSurE-OCT. Of course, this mode is much more
expensive to compute for SURF (see Section 4.3).
CenSurE: Center Surround Extremas 113
OCT
OCT
STD, meters
8
150
6
100
4
50
2
0 0
10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500
Inliers Interval distance, meters
Fig. 7. Accuracy statistics. Left: number of frames with inliers less than a certain
amount, out of 19K frames. For example, FAST and Harris both have around 50
frames with fewer than 30 inliers. Right: standard deviation from ground truth, over
trajectories of varying length.
The question to ask is: do these performance results translate into actual gains
in accuracy of the VO trajectory? We look at two measures of accuracy, the
number of frames with low inlier counts, and the deviation of the VO trajectory
from ground truth (Figure 7). The graph at the left of the figure can be used
to show how many frames are not matched, given a threshold for inliers. For
example, we typically use 30 inliers as a cutoff: any frames with fewer matches
are considered to have bad motion estimates. With this cutoff, SIFT, SURF+,
OCT, and DOB all have less than 10 missed frames, while Harris and FAST have
around 50. To show the influence of low-resolution localization, standard SURF
does very poorly here, as we expect from the previous performance graph.
Finally, we looked at the deviation of the VO estimates from ground truth,
for different trajectory lengths. At every 10 key frames along the VO trajectory,
we compared a trajectory of length N against the corresponding ground truth,
to give a dense sampling (about 1000 for each trajectory length). The standard
deviation is a measure of the goodness of the VO trajectory. Here, OCT, DOB
and Harris were all about equivalent, and gave the best estimates. Although
Harris does not do well in getting large numbers of inliers for difficult motions,
it is very well localized, and so gives good motion estimates. SIFT and SURF+
give equivalent results, and are penalized by their localization error.
Overall, CenSurE-OCT gives the best results in terms of accurate motion esti-
mates, and misses very few frames. Harris does very well in accuracy of motion,
but misses a large number of frames. SURF+ is a reasonable performer in terms
of missed frames, but is not as accurate as the CenSurE or Harris features.
detector descriptor
SURF+ SURF-1 SIFT SURF OCT DOB Harris U-SURF MU-SURF
3408 292 304 75 23 17 10 308 16
SURF has default parameters (no doubled image, subsampling of 2), whereas
SURF-1 has subsampling set to 1, and SURF+ is SURF-1 with a doubled image.
For the descriptor, both U-SURF and MU-SURF are given the same features
(about 1000 in number).
For VO the best performance is with SURF+. In this case, CenSurE-OCT
yields more than a hundred-fold improvement in timing. Our MU-SURF is also
more than twenty times faster than U-SURF. It is clear that feature detection
using CenSurE features and matching using MU-SURF descriptors can be easily
accomplished in real time.
5 Conclusion
We have presented two variants of center-surround feature detectors (CenSurE)
that outperform other state-of-the-art feature detectors for image registration in
general and visual odometry in particular. CenSurE features are computed at
the extrema of the center-surround filters over multiple scales, using the original
image resolution for each scale. They are an approximation to the scale-space
Laplacian of Gaussian and can be computed in real time using integral images.
Not only are CenSurE features efficient, but they are distinctive, stable and
repeatable in changes of viewpoint. For visual odometry, CenSurE features result
in longer track lengths, fewer frames where images fail to match, and better
motion estimates.
We have also presented a modified version of the upright SURF descriptor
(MU-SURF). Although the basic idea is same as the original SURF descriptor,
we have modified it so as to handle the boundaries better, and it is also faster. It
has been our experience that MU-SURF is well suited for visual odometry and
performs much better than normalized cross-correlation without much compu-
tational overhead.
CenSurE is in constant use on our outdoor robots for localization; our goal
is to ultimately be able to do visual SLAM in real time. Toward this end, we
are exploiting CenSurE features to recognize landmarks and previously visited
places in order to perform loop closure.
References
1. Konolige, K., Agrawal, M., Solà, J.: Large scale visual odometry for rough terrain.
In: Proc. International Symposium on Robotics Research (November 2007)
2. Agrawal, M., Konolige, K.: Real-time localization in outdoor environments using
stereo vision and inexpensive GPS. In: ICPR (August 2006)
CenSurE: Center Surround Extremas 115
3. Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision
Conference, pp. 147–151 (1988)
4. Shi, J., Tomasi, C.: Good features to track. In: Proc. Computer Vision and Pattern
Recognition (CVPR) (1994)
5. Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In:
European Conference on Computer Vision, vol. 1 (2006)
6. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking
2, 1508–1515 (2005)
7. Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F., Sayd, P.: Real time local-
ization and 3rd reconstruction. In: CVPR, vol. 1, pp. 363–370 (June 2006)
8. Nister, D., Naroditsky, O., Bergen, J.: Visual odometry. In: Proc. IEEE Conference
on Computer Vision and Pattern Recognition (June 2004)
9. Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Heyden,
A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350.
Springer, Heidelberg (2002)
10. Lindeberg, T.: Feature detection with automatic scale selection. International Jour-
nal of Computer Vision 30(2) (1998)
11. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision 60(2), 91–110 (2004)
12. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffal-
itzky, F., Kadir, T., Gool, L.V.: A comparison of affine region detectors. IJCV,
43–72 (2005)
13. Herbert Bay, T.T., Gool, L.V.: Surf: Speeded up robust features. In: European
Conference on Computer Vision (May 2006)
14. Pei, S.C., Horng, J.H.: Design of FIR bilevel Laplacian-of-Gaussian filter. Signal
Processing 82, 677–691 (2002)
15. Grabner, M., Grabner, H., Bischof, H.: Fast approximated SIFT. In: Narayanan,
P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3851, pp. 918–927.
Springer, Heidelberg (2006)
16. Viola, P., Jones, M.: Robust real-time face detection. In: ICCV 2001 (2001)
17. Lienhart, R., Maydt, J.: An extended set of Haar-like features for rapid object
detection. In: IEEE Conference on Image Processing (ICIP) (2002)
18. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In:
International Conference on Computer Vision (ICCV) (2001)
19. Fischler, M., Bolles, R.: Random sample consensus: a paradigm for model fitting
with application to image analysis and automated cartography. Commun. ACM 24,
381–395 (1981)
20. Engels, C., Stewénius, H., Nister, D.: Bundle adjustment rules. Photogrammetric
Computer Vision (September 2006)
21. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment
- a modern synthesis. In: Vision Algorithms: Theory and Practice. LNCS, pp. 298–
375. Springer, Heidelberg (2000)
Searching the World’s Herbaria: A System for
Visual Identification of Plant Species
1 Introduction
We have built a hand-held botanical identification system for use by botanists at
the Smithsonian Institution. Employing customized computer vision algorithms,
our system significantly speeds up the process of plant species identification.
The system requires only that the user photograph a leaf specimen, returning
within seconds images of the top matching species, along with supporting data
such as textual descriptions and high resolution type specimen images. By using
our system, a botanist in the field can quickly search entire collections of plant
species—a process that previously took hours can now be done in seconds.
To date, we have created three datasets for the system: one that provides
complete coverage of the flora of Plummers Island (an island in the Potomac
River owned by the National Park Service); a second that covers all woody
plants in published flora of the Baltimore-Washington, DC area; and a nearly
complete third dataset that covers all the trees of Central Park in NYC. The
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 116–129, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Searching the World’s Herbaria 117
Fig. 1. Left: A computer vision system for identifying temperate plants on the botani-
cally well-studied Plummers Island, Maryland, USA. Right: Congressman John Tanner
tries an augmented reality version of the system.
1.1 Motivation
Botanists in the field are racing to capture the complexity of the Earth’s flora
before climate change and development erase their living record. To greatly
speed up the process of plant species identification, collection, and monitoring,
botanists need to have the world’s herbaria at their fingertips. Tools are needed
to make the botanical information from the world’s herbaria accessible to anyone
with a laptop or cell phone, whether in a remote jungle or in NYC’s Central Park.
Only recently has the data required to produce these tools been made avail-
able. Volumes of biological information are just now going on-line: natural history
museums have recently provided on-line access to hundreds of thousands of im-
ages of specimens, including our own work in helping to digitize the complete
Type Specimen Collection of the US National Herbarium. These massive digiti-
zation efforts could make species data accessible to all sorts of people including
non-specialists, anywhere in the world.
Yet there is a critical shortfall in all these types of natural databases: finding
a species quickly requires that the searcher know in advance the name of the
species. Computer vision algorithms can remove this obstacle, allowing a user to
search through this data using algorithms that match images of newly collected
specimens with images of those previously discovered and described. Without
such tools, a dichotomous key must be painfully navigated to search the many
118 P.N. Belhumeur et al.
Fig. 2. A flow diagram of our plant identification system. A leaf from an unknown
species of plant is photographed by the user. The system then segments the leaf image
from its background, computes the IDSC shape representation used for matching, and
then displays the top matches, as they are computed.
branches and seemingly endless nodes of the taxonomic tree. The process of
identifying a single species using keys may take hours or days, even for specialists,
and is exceedingly difficult to impossible for non-scientists.
2 Related Work
2.1 Massive Digitization Efforts
The amount of digital information available on-line has recently increased dra-
matically. For example, our group has digitally photographed (at high
resolution) each of the 90,000 type specimens of vascular plants in the US
National Herbarium at the Smithsonian, where the images are now available
at http://botany.si.edu/types/. Complementary efforts include those of the
New York Botanical Garden (120,000 high resolution images), the Royal Botan-
ical Gardens, Kew (50,000 images, including 35,000 images of type specimens),
and the Missouri Botanical Garden (35,000 images of plants). Recently, a con-
sortium of museums and research institutions announced the creation of the
Encyclopedia of Life (http://www.eol.org) to someday house a webpage for
each species of organism on Earth.
3 Datasets
An important objective of our project is the development of standard, compre-
hensive datasets of images of individual leaves. Currently, the only large leaf
image dataset available to vision researchers is a collection of 15 species with 75
leaf images per species (Söderkvist [20]). This dataset is useful, but insufficient
for testing large-scale recognition algorithms needed for species identification.
The datasets that we have collected have an order of magnitude more species
and are well suited for testing the scalability of recognition algorithms. They also
provide complete coverage of species in a geographical area. We have made them
available for research use at http://herbarium.cs.columbia.edu/data.php.
Leaves were collected by field botanists covering all plant species native to a
particular region, and entered in the collections of the US National Herbarium.
The number of leaves per species varied with availability, but averaged about
30. After collection, each leaf was flattened by pressing and photographed with
a ruler and a color chart for calibration. Each side of each leaf was photographed
with top and bottom lighting. The leaf images were then automatically resized to
a maximum side dimension of 512 pixels. Because manual processing of multiple,
Searching the World’s Herbaria 121
Finally, it is often critical for botanists to access more complete type specimens
when identifying species. When a new species is discovered, a cutting of branches,
leaves, and possibly flowers and fruit is collected. This specimen becomes the type
specimen that is then used as the definitive representative of the species. Type
specimens are stored in herbaria around the world. As part of this work, we have
helped to complete the digitization of the complete Type Specimen collection of
vascular plants at the US National Herbarium:
4 Segmentation
In our automatic identification system, a user photographs a leaf so that its
shape may be matched to known species. To extract leaf shape, we must begin
by segmenting the leaf from its background. While segmentation is a well-studied
and difficult problem, we can simplify it in our system by requiring the user to
photograph an isolated leaf on a plain white background. However, while we can
require users to avoid complex backgrounds and extreme lighting conditions, a
useful segmentation algorithm must still be robust to some lighting variations
across the image and to some shadows cast by leaves.
Unfortunately, there is no single segmentation algorithm that is universally
robust and effective for off-the-shelf use. We have experimented with a number
of approaches and achieved good performance using a color-based EM algorithm
122 P.N. Belhumeur et al.
Fig. 3. The first and third images show input to the system, to the right of each
are segmentation results. We first show a typical, clean image, and then show that
segmentation also works with more complex backgrounds.
(see, e.g., Forsyth and Ponce [9]). To begin, we map each pixel to HSV color
space. Interestingly, we find that it is best to discard the hue, and represent
each pixel with saturation and value only. This is because in field tests in the
forest, we find that the light has a greenish hue that dominates the hue of an
otherwise white background. We experimented with other representations, and
colored paper backgrounds of different hues, but found that they presented some
problems in separating leaves from small shadows they cast.
Once we map each pixel to a 2D saturation-value space, we use EM to separate
pixels into two groups. First, during clustering we discard all pixels near the
boundary of the image, which can be noisy. We initialize EM using K-means
clustering with k = 2. We initialize K-means by setting the background cluster
to the median of pixels near the boundary, and setting the foreground cluster
to the mean of the central pixels. Then, in order to make the segmentation
real-time, we perform EM using 5% of the image pixels. Finally, we classify all
pixels using the two resulting Gaussian distributions. The leaf was identified as
the largest connected component of the foreground pixels, excluding components
that significantly overlap all sides of the image (sometimes, due to lighting effects,
the foreground pixels consist of the leaf and a separate connected component that
forms a band around the image). In sum, to get effective results with an EM-
based approach has required careful feature selection, initialization, sampling,
and segment classification. Figure 3 shows sample results.
Although we did not rigorously evaluate competing segmentation algorithms,
we would like to informally mention that we did encounter problems when at-
tempting to apply graph-based segmentation algorithms to these images (e.g.,
Shi and Malik [19], Galun et al. [10]). One reason for this is that these algo-
rithms have a strong bias to produce compact image segments. While this is
beneficial in many situations, it can create problems with leaves, in which the
stems and small leaflets or branches are often highly non-compact. The seg-
mentation algorithm that we use goes to the other extreme, and classifies every
pixel independently, with no shape prior, followed by the extraction of a single
connected component. It is an interesting question for future research to devise
segmentation algorithms that have shape models appropriate for objects such as
leaves that combine compact and thin, wiry structures with a great diversity of
shape.
Searching the World’s Herbaria 123
5 Shape Matching
Our system produces an ordered list of species that are most likely to match
the shape of a query leaf. It must be able to produce comparisons quickly for
a dataset containing about 8,000 leaves from approximately 250 species. It is
useful if we can show the user some initial results within a few seconds, and the
top ten matches within a few seconds more. It is also important that we produce
the correct species within the top ten matches as often as possible, since we are
limited by screen size in displaying matches.
To perform matching, we make use of the Inner Distance Shape Context
(IDSC, Ling and Jacobs [13]), which has produced close to the best published re-
sults for leaf recognition, and the best results among those methods quick enough
to support real-time performance. IDSC samples points along the boundary of
a shape, and builds a 2D histogram descriptor at each point. This histogram
represents the distance and angle from each point to all other points, along a
path restricted to lie entirely inside the leaf shape. Given n sample points, this
produces n 2D descriptors, which can be computed in O(n3 ) time, using an all
pairs shortest path algorithm. Note that this can be done off-line for all leaves
in the dataset, and must be done on-line only for the query. Consequently, this
run-time is not significant.
To compare two leaves, each sample point in each shape is compared to all
points in the other shape, and matched to the most similar sample point. A
shape distance is obtained by summing the χ2 distance of this match over all
sample points in both shapes, which requires O(n2 ) time.
Since IDSC comparison is quadratic in the number of sample points, we would
like to use as few sample points as possible. However, IDSC performance de-
creases due to aliasing if the shape is under-sampled. We can reduce aliasing ef-
fects and boost performance by smoothing the IDSC histograms. To do this, we
compute m histograms by beginning sampling at m different, uniformly spaced
locations, and average the results. This increases the computation of IDSC for a
single shape by a factor of m. However, it does not increase the size of the final
IDSC, and so does not affect the time required to compare two shapes, which is
our dominant cost.
We use a nearest neighbor classifier in which the species containing the most
similar leaf is ranked first. Because the shape comparison algorithm does not
imbed each shape into a vector space, we use a nearest neighbor algorithm de-
signed for non-Euclidean metric spaces. Our distance does not actually obey the
triangle inequality because it allows many-to-one matching, and so it is not really
a metric (eg., all of shape A might match part of C, while B matches a different
part of C, so A and B are both similar to C, but completely different from each
other). However, in a set of 1161 leaves, we find that the triangle inequality is
violated in only .025% of leaf triples, and these violations cause no errors in
the nearest neighbor algorithm we use, the AESA algorithm (Ruiz [17]; Vidal
[22]). In this method, we pre-compute and store the distance between all pairs
of leaves in the dataset. This requires O(N 2 ) space and time, for a dataset of N
leaves, which is manageable for our datasets. At run time, a query is compared
124 P.N. Belhumeur et al.
to one leaf, called a pivot. Based on the distance to the pivot, we can use the
triangle inequality to place upper and lower bounds on the distance to all leaves
and all species in the dataset. We select each pivot by choosing the leaf with
the lowest current upper bound. When one species has an upper bound distance
that is less than the lower bound to any other species, we can select this as the
best match and show it to the user. Continuing this process provides an ordered
list of matching species. In comparison to a brute force search, which takes nine
0.9 0.9
Correct result rate
0.7 0.7
0.6 0.6
256 sample points 256 sample points
64x16 sample points 64x16 sample points
0.5 0.5
64 sample points 64 sample points
0.4 0.4
2 4 6 8 10 2 4 6 8 10
Top k matches Top k matches
seconds with a dataset of 2004 leaves from 139 species, this nearest-neighbor
algorithm reduces the time required to find the ten best matching species by
a factor of 3, and reduces the time required to find the top three species by a
factor of 4.4.
We have tested our algorithm using both the Plummers Island and Baltimore-
Washington Woody Plants datasets. We perform a leave-one-out test, in which
each leaf is removed from the dataset and used as a query. Figure 4 shows per-
formance curves that indicate how often the correct species for a query is placed
among the top k matches, as k varies. In this experiment, we achieve best per-
formance using n = 256 sample points for IDSC. We reach nearly the same
performance by computing the histograms using n = 64 sample points averaged
over m = 16 starting points. The figure also shows that using n = 64 points
without smoothing significantly degrades performance. Using 64 sample points
is approximately 16 times faster than using 256 sample points. The correct an-
swer appears in the top ten about 95%–97% of the time for woody plants of
Baltimore-Washington and somewhat less (about 90% of the time) for the flora
of Plummers Island. This is in part because shape matching is not very effec-
tive at discriminating between different species of grass (which are not woody
plants). Overall, these results demonstrate effective performance. It seems that
most errors occur for species in which the overall leaf shape is not sufficiently
distinctive. We plan to address these issues by using additional cues, such as
small scale features of the leaf margin (e.g., toothed or smooth) and the shape
of the venation (vascular structure).
Searching the World’s Herbaria 125
original image in the search results pane. Each species result provides access to
the matched leaf, type specimens, voucher images and information about the
species in a ZUI to support detailed visual inspection and comparison, which
is necessary when matching is imperfect. Selecting a match button associates a
given species with the newly collected specimen in the collection database. The
history pane displays a visual history of each collected leaf, along with access to
previous search results, also in a ZUI. This represents the collection trip, which
can be exported for botanical research, and provides a reference for previously
collected specimens. Making this data available improves the long term use of
the system by aiding botanists in their research.
LeafView was built with C#, MatLab, and Piccolo (Bederson, et al. [4]).
Our first versions of the hardware used a Tablet PC with a separate Wi-Fi
or Bluetooth camera and a Bluetooth WAAS GPS. However, feedback from
botanists during field trials made it clear that it would be necessary to trade
off the greater display area/processing power of the Tablet PC for the smaller
size/weight of an Ultra-Mobile PC (UMPC) to make possible regular use in the
field. We currently use a Sony VAIO VGN-UX390N, a UMPC with an integrated
camera and small touch-sensitive screen, and an external GPS.
[12]) and ARTag (Fiala [7]), with a Creative Labs Notebook USB 2.0 camera
attached to the head-worn display.
Our prototypes have been evaluated in several ways during the course of the
project. These include user studies of the AR system, field tests on Plummers
Island, and expert feedback, building on previous work (White et al. [24]). In
May 2007, both LeafView and a Tangible AR prototype were demonstrated
and used to identify plants during the National Geographic BioBlitz in Rock
Creek Park, Washington, DC, a 24-hour species inventory. Hundreds of people,
from professional botanists to amateur naturalists, school children to congress-
men, have tried both systems. While we have focused on supporting professional
botanists, people from a diversity of backgrounds and interests have provided
valuable feedback for the design of future versions.
One goal of our project is to provide datasets that can serve as a challenge
problem for computer vision. While the immediate application of such datasets
is the identification of plant species, the datasets also provide a rich source of
data for a number of general 2D and silhouette recognition algorithms.
In particular, our website includes three image datasets covering more than
500 plant species, with more than 30 leaves per species on average. Algorithms
for recognition can be tested in a controlled fashion via leave-one-out tests, where
the algorithms can train on all but one of the leaf images for each species and test
on the one that has been removed. The web site also contains separate training
and test datasets in order to make fair comparisons. Our IDSC code can also be
obtained there, and other researchers can submit code and performance curves,
which we will post. We hope this will pose a challenge for the community, to
find the best algorithms for recognition in this domain.
Note that our system architecture for the electronic field guide is modular,
so that we can (and will, if given permission) directly use the best performing
methods for identification, broadening the impact of that work.
8 Future Plans
To date, we have focused on three regional floras. Yet, our goal is to expand the
coverage of our system in temperate climates to include all vascular plants of
the continental U.S. Other than the efforts involved in collecting the single leaf
datasets, there is nothing that would prevent us from building a system for the
U.S. flora. The visual search component of the system scales well: search can
always be limited to consider only those species likely to be found in the current
location, as directed by GPS.
128 P.N. Belhumeur et al.
Acknowledgements
This work was funded in part by National Science Foundation Grant IIS-03-
25867, An Electronic Field Guide: Plant Exploration and Discovery in the 21st
Century, and a gift from Microsoft Research.
References
1. Abbasi, S., Mokhtarian, F., Kittler, J.: Reliable classification of chrysanthemum
leaves through curvature scale space. In: ter Haar Romeny, B.M., Florack, L.M.J.,
Viergever, M.A. (eds.) Scale-Space 1997. LNCS, vol. 1252, pp. 284–295. Springer,
Heidelberg (1997)
2. Agarwal, G., Belhumeur, P., Feiner, S., Jacobs, D., Kress, W.J., Ramamoorthi,
R., Bourg, N., Dixit, N., Ling, H., Mahajan, D., Russell, R., Shirdhonkar, S.,
Sunkavalli, K., White, S.: First steps towards an electronic field guide for plants.
Taxon 55, 597–610 (2006)
3. Bederson, B.: PhotoMesa: A zoomable image browser using quantum treemaps and
bubblemaps. In: Proc. ACM UIST 2001, pp. 71–80 (2001)
4. Bederson, B., Grosjean, J., Meyer, J.: Toolkit design for interactive structured
graphics. IEEE Trans. on Soft. Eng. 30(8), 535–546 (2004)
5. Belongie, S., Malik, J., Puzicha, J.: Shape Matching and Object Recognition Using
Shape Context. IEEE Trans. on Patt. Anal. and Mach. Intell. 24(4), 509–522 (2002)
6. Edwards, M., Morse, D.R.: The potential for computer-aided identification in bio-
diversity research. Trends in Ecology and Evolution 10, 153–158 (1995)
7. Fiala, M.: ARTag, a fiducial marker system using digital techniques. In: Proc.
CVPR 2005, pp. 590–596 (2005)
8. Felzenszwalb, P., Schwartz, J.: Hierarchical matching of deformable shapes. In:
Proc. CVPR 2007, pp. 1–8 (2007)
9. Forsyth, D., Ponce, J.: Computer vision: A modern approach. Prentice Hall, Upper
Saddle River (2003)
Searching the World’s Herbaria 129
10. Galun, M., Sharon, E., Basri, R., Brandt, A.: Texture segmentation by multiscale
aggregation of filter responses and shape elements. In: Proc. CVPR, pp. 716–723
(2003)
11. Heidorn, P.B.: A tool for multipurpose use of online flora and fauna: The Biological
Information Browsing Environment (BIBE). First Monday 6(2) (2001),
http://firstmonday.org/issues/issue6 2/heidorn/index.html
12. Kato, H., Billinghurst, M., Poupyrev, I., Imamoto, K., Tachibana, K.: Virtual ob-
ject manipulation of a table-top AR environment. In: Proc. IEEE and ACM ISAR,
pp. 111–119 (2000)
13. Ling, H., Jacobs, D.: Shape Classification Using the Inner-Distance. IEEE Trans.
on Patt. Anal. and Mach. Intell. 29(2), 286–299 (2007)
14. Mokhtarian, F., Abbasi, S.: Matching shapes with self-intersections: Application
to leaf classification. Proc. IEEE Trans. on Image 13(5), 653–661 (2004)
15. Nilsback, M., Zisserman, A.: A visual vocabulary for flower classification. In: Proc.
CVPR, pp. 1447–1454 (2006)
16. Pankhurst, R.J.: Practical taxonomic computing. Cambridge University Press,
Cambridge (1991)
17. Ruiz, E.: An algorithm for finding nearest neighbours in (approximately) constant
average time. Patt. Rec. Lett. 4(3), 145–157 (1986)
18. Saitoh, T., Kaneko, T.: Automatic recognition of wild flowers. Proc. ICPR 2, 2507–
2510 (2000)
19. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Trans. on Patt.
Anal. and Mach. Intell. 22(8), 888–905 (2000)
20. Söderkvist, O.: Computer vision classification of leaves from Swedish trees. Master
Thesis, Linköping Univ. (2001)
21. Stevenson, R.D., Haber, W.A., Morris, R.A.: Electronic field guides and user
communities in the eco-informatics revolution. Conservation Ecology 7(3) (2003),
http://www.consecol.org/vol7/iss1/art3
22. Vidal, E.: New formulation and improvements of the nearest-neighbour approx-
imating and eliminating search algorithm (AESA). Patt. Rec. Lett. 15(1), 1–7
(1994)
23. Wang, Z., Chi, W., Feng, D.: Shape based leaf image retrieval. IEE Proc. Vision,
Image and Signal Processing 150(1), 34–43 (2003)
24. White, S., Feiner, S., Kopylec, J.: Virtual vouchers: Prototyping a mobile aug-
mented reality user interface for botanical species identification. In: Proc. IEEE
Symp. on 3DUI, pp. 119–126 (2006)
25. White, S., Marino, D., Feiner, S.: Designing a mobile user interface for automated
species identification. In: Proc. CHI 2007, pp. 291–294 (2007)
A Column-Pivoting Based Strategy for
Monomial Ordering in Numerical Gröbner Basis
Calculations
1 Introduction
A large number of geometric computer vision problems can be formulated in
terms of a system of polynomial equations in one or more variables. A typical
example of this is minimal problems of structure from motion [1,2]. This refers to
solving a specific problem with a minimal number of point correspondences. Fur-
ther examples of minimal problems are relative motion for cameras with radial
distortion [3] or for omnidirectional cameras [4]. Solvers for minimal problems
are often used in the inner loop of a RANSAC engine to find inliers in noisy data,
which means that they are run repeatedly a large number of times. There is thus
a need for fast and stable algorithms to solve systems of polynomial equations.
Another promising, but difficult pursuit in computer vision (and other fields)
is global optimization for e.g. optimal triangulation, resectioning and funda-
mental matrix estimation. See [5] and references therein. In some cases these
1
This work has been funded by the Swedish Research Council through grant no. 2005-
3230 ’Geometry of multi-camera systems’ and grant no. 2004-4579 ’Image-Based
Localization and Recognition of Scenes’.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 130–143, 2008.
c Springer-Verlag Berlin Heidelberg 2008
A Column-Pivoting Based Strategy for Monomial Ordering 131
where the hk ∈ C[x] are any polynomials. The reason for studying the ideal I is
that it has the same set of zeros as (1).
Consider now the space of equivalence classes modulo I. This space is denoted
C[x]/I and referred to as the quotient space. Two polynomials f and g are said
to be equivalent modulo I if f = g + h, where h ∈ I. The logic behind this
definition is that we get true equality, f (x) = g(x) on zeros of (1).
To do calculations in C[x]/I it will be necessary to compute unique repre-
sentatives of the equivalence classes in C[x]/I. Let [·] : C[x] → C[x]/I denote
the function that takes a polynomial f and returns the associated equivalence
class [f ]. We would now like to compose [·] with a mapping C[x]/I → C[x] that
associates to each equivalence class a unique representative in C[x]. The com-
posed map C[x] → C[x] should in other words take a polynomial f and return
the unique representative f for the equivalence class [f ] associated with f . As-
sume for now that we can compute such a mapping. This operation will here be
referred to as reduction modulo I.
A well known result from algebraic geometry now states that if the set of
equations (1) has r zeros, then C[x]/I will be a finite-dimensional linear space
with dimension r [8]. Moreover, an elegant trick based on calculations in C[x]/I
yields the complete set of zeros of (1) in the following way: Consider multipli-
cation by one of the variables xk . This is a linear mapping from C[x]/I to itself
and since we are in a finite-dimensional space, by selecting an appropriate basis,
this mapping can be represented as a matrix mxk . This matrix is known as the
action matrix and the eigenvalues of mxk correspond to xk evaluated at the
zeros of (1) [8]. Moreover, the eigenvectors of mxk correspond the vector of basis
monomials/polynomials evaluated at the same zeros and thus the complete set
of solutions can be directly read off from these eigenvectors. The action matrix
can be seen as a generalization of the companion matrix to the multivariate case.
Given a linear basis B = {[ei ]}ri=1 spanning C[x]/I, the action matrix mxk is
computed by calculating xk ei for each of the basis elements ei . Performing this
operation is the difficult part in the process. Traditionally, the reduction has
A Column-Pivoting Based Strategy for Monomial Ordering 133
been done by fixing a monomial ordering and then computing a Gröbner basis
G for I, which is a canonical set of polynomials that generate I. Computing f
G
is then done by polynomial division by G (usually written f ).
We now make two important observations: (i) We are not interested in finding
the Gröbner basis per se; it is enough to get a well defined mapping f and (ii)
it suffices to calculate reduction modulo I on the elements xk ei , i.e. we do not
need to know what f is on all of C[x]. Note that if for some i, xk ei ∈ B then
nothing needs to be done for that element. With this in mind, we denote by
R = xk B \ B the set of elements f for which we need to calculate representatives
f of their corresponding equivalence classes [f ] in C[x]/I.
Calculating the Gröbner basis of I is typically accomplished by Buchberger’s
algorithm. This works well in exact arithmetic. However, in floating point arith-
metic Buchberger’s algorithm very easily becomes unstable. There exist some
attempts to remedy this [15,16], but for more difficult cases it is necessary to
study a particular class of equations (e.g. relative orientation for omnidirectional
cameras [4], optimal three view triangulation [6], etc.) and use knowledge of what
the structure of the Gröbner basis should be to design a special purpose Gröbner
basis solver [9].
In this paper we move away from the goal of computing a Gröbner basis for
I and focus on computing f for f ∈ R as mentioned above. However, it should
be noted that the computations we do much resemble those necessary to get a
Gröbner basis.
CX = 0, (3)
# $t
where X = xα1 . . . xαn is a vector of monomials with the notation xαk =
xα1
k1
· · · xα
s
ks
and C is a matrix of coefficients. Elimination of leading terms now
translates to matrix operations and we then have access to a whole battery of
techniques from numerical linear algebra allowing us to perform many elimina-
tions at the same time with control on pivoting etc.
By combining this approach with knowledge about a specific problem obtained
in advance with a computer algebra system such as Macaulay2 [17] it is possible
to write down a fixed number of expansion/elimination steps that will generate
the necessary polynomials.
In this paper, we use a linear basis of monomials B = {xα1 , . . . , xαr } for
C[x]/I. Recall now that we need to compute xk xαi for xk xαi ∈ / B, i.e. for R.
This is the aim of the following calculations.
134 M. Byröd, K. Josephson, and K. Åström
The E-monomials are not in the basis and do not need to be reduced so we
eliminate them by an LU decomposition on Cexp yielding
⎡ ⎤
! " X
UE1 CR1 CB1 ⎣ E ⎦
XR = 0, (6)
0 UR2 CB2
XB
where UE1 and UR2 are upper triangular. We can now discard the top rows of
the coefficient matrix producing
! "
# $ XR
UR2 CB2 = 0, (7)
XB
from which we get the elements of the ideal I we need since equivalently, if the
submatrix UR2 is of full rank, we have
XR = −U−1
R2 CB2 XB (8)
and then the R-monomials can be expressed uniquely in terms of the B-monomials.
As previously mentioned, this is precisely what we need to compute the action ma-
trix mxk in C[x]/I. In other words, the property of UR2 as being of full rank is
sufficient to get the operation f on the relevant part of C[x]. Thus, in designing
the set of monomials to multiply with (the first step in the procedure) we can use
the rank of UR2 as a criterion for whether the set is large enough or not. How-
ever, the main problem in these computations is that even if UR2 is in principle
invertible, it can be very ill conditioned.
A technique introduced in [12], which alleviates much of these problems uses
basis selection for C[x]/I. The observation is that the right linear basis for C[x]/I
induces a reordering of the monomials, which has the potential to drastically
improve the conditioning of UR2 . Since Cexp depends on the data, the choice of
linear basis cannot be made on beforehand, but has to be computed adaptively
A Column-Pivoting Based Strategy for Monomial Ordering 135
each time the algorithm is run. This leads to the difficult optimisation problem
of selecting a linear basis so as to minimize the condition number of UR2 . In [12]
this problem was addressed by making use of SVD providing a numerically stable,
but computationally expensive solution.
The advantage of the above exposition is that it makes explicit the dependence
on the matrix UR2 , both in terms of rank and conditioning. In particular, the
above observations leads to the new fast strategy for basis selection which is the
topic of the next section and a major contribution of this paper.
! "! "
UR2 CP2 Π XR
= 0, (12)
0 U Π t XP
# We observe
$ that U is not quadratic and emphasize this by writing U =
UP3 CB2 , where UP3 is quadratic upper triangular. We also write CP2 Π =
# $ # $t
CP4 CB1 and Π t XP2 = XP XB yielding
⎡ ⎤
! " XR
UR2 CP4 CB1 ⎣
XP ⎦ = 0 (13)
0 UP3 CB2
XB
and finally
! " ! "−1 ! "
XR UR2 CP4 CB 1
=− XB (14)
XP 0 UP3 CB 2
is the equivalent of (8) and amounts to solving r upper triangular equation
systems which can be efficiently done by back substitution.
The reason why QR factorization fits so nicely within this framework is that
it simultaneously solves the two tasks of reduction to upper triangular form and
numerically sound column permutation and with comparable effort to normal
Gaussian elimination.
Furthermore, QR factorization with column pivoting is a widely used and
well studied algorithm and there exist free, highly optimized implementations,
making this an accessible approach.
Standard QR factorization successively eliminates elements below the main
diagonal by multiplying from the left with a sequence of orthogonal matrices
(usually Householder transformations). For matrices with more columns than
rows (under-determined systems) this algorithm can produce a rank-deficient
U which would then cause the computations in this section to break down.
QR with column pivoting solves this problem by, at iteration k, moving the
column with greatest 2-norm on the last m − k + 1 elements to position k and
then eliminating the last m − k elements of this column by multiplication with
an orthogonal matrix Qk .
A Column-Pivoting Based Strategy for Monomial Ordering 137
4 Experiments
The purpose of this section is to verify the speed and accuracy of the QR-
method. To this end, three different applications are studied. The first example is
relative pose for generalised cameras, first solved by Stewénius et al . in 2005 [19].
The second one is the previously unsolved minimal problem of pose estimation
with unknown focal length. The problem was formulated by Josephson et al .
in [20], but not solved in floating point arithmetic. The last problem is optimal
triangulation from three views [6].
Since the techniques described in this paper improve the numerical stability of
the solver itself, but do not affect the conditioning of the actual problem, there is
no point in considering the behavior under noise. Hence we will use synthetically
generated examples without noise to compare the intrinsic numerical stability of
the different methods.
In all three examples we compare with the “standard” method, by which we
mean to fix a monomial order (typically grevlex) and use the basis dictated
by that order together with straightforward gauss jordan elimination to express
138 M. Byröd, K. Josephson, and K. Åström
monomials in terms of the basis. Previous works have often used several ex-
pansion / elimination rounds. We have found this to have a negative effect on
numerical stability so to make the comparison fair, we have implemented the
standard method using a single elimination step in all cases.
For the adaptive truncation method, the threshold τ for the ratio between the
k:th diagonal element and the first was set to 10−8 .
A generalised camera is a camera with no common focal point. This e.g. serves
as a useful model for several ordinary cameras together with fixed relative loca-
tions [21]. For generalised cameras there is a minimal case for relative pose with
two cameras and six points. This problem was solved in [19] and has 64 solu-
tions. In [12] this problem was used to show how the SVD-method improved the
numerics. We follow the methods of the later paper to get a single elimination
step. This gives an expanded coefficient matrix of size 101×165 with the columns
representing monomials up to degree eight in three variables. For details see [19]
and [12].
The examples for this experiment were generated by picking six points from
a normal distribution centered at the origin. Then six randomly chosen lines
through these point were associated to each camera. This made up two gener-
alised cameras with a relative orientation and translation.
Following this recipe, 10000 examples were generated and solved with the
standard, QR- and SVD-method. The angular errors between true and estimated
motion were measured. The results are shown in Figure 1.
The method with variable basis size was also implemented, but for this exam-
ple the UR2 (see Equation 7) part of the coefficient matrix was always reasonably
conditioned and hence the basis size was 64 in all 10000 test examples. There
were no large errors for neither the SVD nor the QR method.
0.4
Standard
0.35 SVD
QR
0.3
0.25
Frequency
0.2
0.15
0.1
0.05
0
−15 −10 −5 0
Log of angular error in degrees
10
Fig. 1. Error distributions for the problem of relative pose with generalised cameras.
The SVD-method yields the best results but the faster QR-method is not far behind
and also eliminates all large errors.
A Column-Pivoting Based Strategy for Monomial Ordering 139
This problem was introduced in [20]. The problem is to find the pose of a cali-
brated camera with unknown focal length. One minimal setup for this problem
is three point-correspondences with known world points and one correspondence
to a world line. The last feature is equivalent to having a point correspondence
with another camera. These types of mixed features are called hybrid features.
In [20], the authors propose a parameterisation of the problem but no solution
was given apart from showing that the problem has 36 solutions.
The parameterisation in [20] gives four equations in four unknowns. The un-
knowns are three quaternion parameters and the focal length. The equation
derived from the line correspondence is of degree 6 and those obtained from the
3D points are of degree 3. The coefficient matrix Cexp is then constructed by
expanding all equations up to degree 10. This means that the equation derived
from the line is multiplied with all monomials up to degree 4, but no single
variable in the monomials is of higher degree than 2. In the same manner the
point correspondence equations are multiplied with monomials up to degree 7
but no single variable of degree more than 5. The described expansion gives 980
equations in 873 monomials.
The next step is to reorder the monomials according to (5). In this problem
CP corresponds to all monomials up to degree 4 except f 4 where f is the focal
length, this gives 69 columns in CP . The part CR corresponds to the 5:th degree
monomials that appears when the monomials in B are multiplied with the first
of the unknown quaternion parameters.
For this problem, we were not able to obtain a standard numerical solver. The
reason for this was that even going to significantly higher degrees than mentioned
above, we did not obtain an invertible UR2 . In fact, with an exact linear basis
(same number of basis elements as solutions), even the QR and SVD methods
failed and truncation had to be used.
In this example we found that increasing the linear basis of C[x]/I by a few
elements over what was produced by the adaptive criterion was beneficial for the
stability. In this experiment, we added three basis elements to the automatically
produced basis. To get a working version of the SVD solver we had to adapt the
truncation method to the SVD case as well. We did this by looking at the ratio
of the singular values.
The synthetic experiments for this problem were generated by randomly draw-
ing four points from a cube with side length 1000 centered at the origin and two
cameras with a distance of approximately 1000 to the origin. One of these cam-
eras was treated as unknown and one was used to get the camera to camera
point correspondence. This gives one unknown camera with three point corre-
spondences and one line correspondence. The experiment was run 10000 times.
In Figure 2 (right) the distribution of basis sizes is shown for the QR-method.
For the SVD-method the basis size was identical to the QR-method in over 97%
of the cases and never differed by more than one element.
Figure 2 (left) gives the distribution of relative errors in the estimated focal
length. It can be seen that both the SVD-method and the faster QR-method
140 M. Byröd, K. Josephson, and K. Åström
Fig. 2. Left: Relative error in focal length for pose estimation with unknown focal
length. Both the SVD- and QR-methods uses adaptive truncatation. Right: The size of
the adaptively chosen basis for the QR-method. For the SVD-method the size differs
from this in less than 3% of the cases and by at most one element.
give useful results. We emphasize that we were not able to construct a solver
with the standard method and hence no error distribution for that method is
available.
distance 1000 from origin and the focal lengths were also set to around 1000. The
error in 3D placement over 10000 iterations is shown in Figure 3. It can be seen
that the QR-method is almost as accurate as the SVD-method.
One important property of a solver is that the number of large errors is small.
Thus in Table 1 the number of large errors are shown. The results show that the
QR-method is better at suppressing large errors, probably due to the variable
size of the basis.
Fig. 3. The distribution of the error in 3D placement of the unknown point using op-
timal three view triangulation. The experiment was run 10000 times. The QR-method
gives nearly identical results compared to the SVD-method.
Table 1. Number of errors larger than some levels. This shows that the QR-method
gives fewer large errors probably due to the variable size of the basis.
Table 2. Number of times a certain basis size appears in 10000 iterations. The largest
basis size obtained in the experiment was 66.
Basis size 50 51 52 53 54 55 ≥ 56
# 9471 327 62 34 26 17 58
In the problem of optimal three view triangulation the execution times for the
three different algorithms were measured. Since the implementations were done
in Matlab it was necessary to take care to eliminate the effect of Matlab being an
interpreted language. To do this only the time after construction of the coefficient
matrix was taken into account. This is because the construction of the coefficient
142 M. Byröd, K. Josephson, and K. Åström
matrix essentially amounts to copying coefficients to the right places which can
be done extremely fast in e.g. a C language implementation.
In the routines that were measured no subroutines were called that were not
built-in functions in Matlab. The measurements were done with Matlab’s profiler.
The time measurements were done on an Intel Core 2 2.13 GHz machine
with 2 GB memory. Each algorithm was executed with 1000 different coefficient
matrices, these were constructed from the same type of scene setup as in the
previous section. The same set of coefficient matrices was used for each method.
The result is given in Table 3. Our results show that the QR-method with adap-
tive truncation is approximately four times faster than the SVD-method but
40% slower than the standard method. It should however be noted that here,
the standard method is by far too inaccurate to be of any practical value.
Table 3. Time consumed in the solver part for the three different methods. The time
is an average over 1000 calls.
5 Conclusions
In this paper we have presented a new fast strategy for improving numerical
stability of Gröbner basis polynomial equation solvers. The key contribution is a
clarification of the exact matrix operations involved in computing an action ma-
trix for C[x]/I and the use of numerically sound QR factorization with column
pivoting to obtain a simultaneous basis selection for C[x]/I and reduction to up-
per triangular form. We demonstrate a nearly fourfold decrease in computation
time compared to the previous SVD based method while retaining good nu-
merical stability. Moreover, since the method is based on the well studied, freely
available QR algorithm it is reasonably simple to implement and not much slower
than using no basis selection at all.
The conclusion is thus that whenever polynomial systems arise and numerical
stability is a concern, this method should be of interest.
References
1 Introduction
Establishing correspondences between image pairs is one of the fundamental and
crucial issues for many vision problems. Although the development of various
kinds of local invariant features [1,2,3] have brought about notable progress
in this area, their local ambiguities remain hard to be solved. Thus, domain
specific knowledge or human supervision has been generally required for accurate
matching. Obviously, the best promising strategy to eliminate the ambiguities
from local feature correspondences is to go beyond locality [4,5,6,7]. The larger
image regions we exploit, the more reliable correspondences we can obtain. In
this work we propose a novel data-driven Monte Carlo framework to augment
naive local region correspondences to reliable object-level correspondences in an
arbitrary image pair. Our method establishes multiple coherent clusters of dense
correspondences to achieve recognition and segmentation of multiple common
objects without any prior knowledge of specific objects.
For the purpose, we introduce a perceptually meaningful entity, which can be
interpreted as a common object or visual pattern. We will refer to the entity in an
image pair as a Maximal Common Saliency (MCS) and define it as follows: (1) An
MCS is a semi-global region pair, composed of local region matches between the
image pair. (2) The region pair should be mutually consistent in geometry and
photometry. (3) Each region of the pair should be maximal in size. Now, the goal
of our work is defined to obtain the set of MCSs from an image pair. According
to the naming conventions of some related works [5,8], we term it co-recognition.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 144–157, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Co-recognition of Image Pairs 145
Fig. 1. Result of co-recognition on our dataset Mickey’s. Given an image pair, co-
recognition detects all Maximal Common Saliencies without any supervision or prior
knowledge. Each color represents each identity of the MCS, which means an object in
this case. Note that the book (blue) is separated by occlusion but identified as one
object. See the text for details.
(a) Overview of our approach (b) Initial matching and latent regions
Fig. 2. (a) Given two images, data-driven Monte Carlo image exploration solves co-
recognition problem of the image pair. See the text for details. (b) Top: Several different
types of local features can be used for initial matches. Bottom: Overlapping circular
regions are generated covering the whole reference image for latent regions.
Our method has been inspired by the image exploration method for object
recognition and segmentation, proposed by Ferrari et al [6]. The method is based
on propagating initial local matches to neighboring regions by their affine homog-
raphy. Even with few true initial matches, their iterative algorithm expands in-
liers and contracts outliers so that the recognition can be highly improved.2 The
similar correspondence growing approaches were proposed also in [4,7] for non-
rigid image registration. Our new exploration scheme leads the image exploration
strategy of [6] to unsupervised multi-object image matching by the Bayesian for-
mulation and the DDMCMC framework [9]. Therefore, the co-recognition prob-
lem addressed by this paper can be viewed as a generalization of several other
problems reported in the literature [5,6,8,11].
a grid of overlapping circular regions covering the whole reference image. All
the overlapping regions are placed into a latent region set Λ, in which each
element region waits to be included in one of the existing clusters (Fig. 2(b)).
After these initialization steps, our data-driven Monte Carlo image exploration
algorithm starts to search for the set of MCSs by two pairs of reversible moves;
expansion/contraction and merge/split. In the expansion/contraction moves, a
cluster obtains a new match or lose one. In merge/split moves, two clusters are
combined into one cluster, or one cluster is divided into two clusters. Utilizing all
these moves in a stochastic manner, our algorithm traverses the solution space
efficiently to find the set of MCSs. The final solution is obtained by eliminating
trivial MCSs from the result.
Fig. 3. (a) 1 should be on the same side of the directed line from 2 to 3 in both images.
(b) 4,5 and 1 satisfies sidedness constraint, while it does not lies on the same object.
We can filter out this outlier triplet by checking if orientation(red arrow) changes in
the triplet are mutually consistent.
means that the side of cj w.r.t the directed line (ck × cl ) should be just the same
as the side of cj w.r.t the directed line (ck × cl ) (Fig. 3(a)). This constraint
holds for all correctly matching triplets of coplanar regions. Since the sidedness
constraint is valid even for most non-planar regions, it is useful for sorting out
triplets on a common surface. As illustrated in Fig. 3(b), we reinforce it with
orientation consistency to deal with multiple common surfaces for our problem
as follows:
∀(m, n) ∈ {(j, k), (j, l), (k, l)}, |angle(angle(om , om ), angle(on , on ))| < δori
(5)
where om means the dominant orientation of Rm in radian, while angle() denotes
the function which calculates the clockwise angle diffrence in radian. Hence, the
reinforced sidedness error with the orientation consistency is defined by
0 if (4) and (5) hold
errside (Rj , Rk , Rl ) = (6)
1 otherwise
A triple violating the reinforced sidedness constraint has higher chances of having
one or more mismatches in it. The geometric error of Rj (∈ Γi ) is defined by the
share of violations in its own cluster such that
1
errgeo (Rj ) = errside (Rj , Rk , Rl ), (7)
v
Rk ,Rl ∈Γi \Rj ,k>l
where v = (Li −1)(Li −2)/2 is the normalization factor that counts the maximum
number of violations. When Li < 3, errgeo (Rj ) is defined as 1 if the cluster
Γi ( Rj ) violates the orientation consistency, otherwise 0.
The geometric error of a cluster is then defined by the sum of errors for all
members in the cluster as follows:
Li
errgeo (Γi ) = errgeo (Rj ). (8)
j=1
matches in each cluster since all the latent regions have the same area and the
number is constant after initialization. The maximality error is formulated as
K
Li 0.8 Li
errmaxi (θ) = ( ) − , (9)
i=1
N N
where N is the initial number of the latent region set Λ. The first term encourages
the clusters of θ to merge, and the second term makes each cluster of θ to expand.
dRGB(R1 , R2 )
dissim(R1 , R2 ) = 1−NCC(R1 , R2 ) + , (10)
100
where NCC is the normalized cross-correlation between the gray patterns, while
dRGB is the average pixel-wise Euclidean distance in RGB color-space after in-
dependent normalization of the 3 colorbands for photometric invariance [6]. R1
and R2 are normalized to unit circles with the same orientation before compu-
tation. Since a cluster of matches should have low dissimilarity in each match,
the overall photometric error of a cluster is defined as follows.
Li
errphoto (Γi ) = dissim(Rj , Rj )2 . (11)
j=1
From (8), (9), and (12), MCSs in a given image pair I can be obtained by
maximizing the following posterior probability:
K
K
p(θ|I) ∝ exp −λgeo errgeo (Γi ) − λmaxi errmaxi (θ) − λphoto errphoto (Γi ) .
i=1 i=1
(13)
This posterior probability reflects how well the solution generates the set of
MCSs from the given image pair.
150 M. Cho, Y.M. Shin, and K.M. Lee
Fig. 4. (a) At the top, a support match (red dotted) propagates one of the latent regions
(blue solid) by affine homography F. At the bottom, by adjusting the parameter of the
ellipse, the initially propagated region (blue dotted) is refined into the more accurate
region (green solid). (b) Each of the present clusters has its own mergence tree, which
stores hierarchical information of the preceding clusters of itself. It helps to propose a
simple and reversible merge/split moves at low cost.
dist(R ,R)
R∈Λ exp − 2σ2 j , where dist() denotes the Euclidean distance between
expand
the region centers. In this stochastic selection, the supports that have more la-
tent regions at nearer distance are favored. Finally, a latent region to prop-
agate
by the support
is chosen with the probability q(Rk |Rj , Γi , expand) ∝
dist(Rk ,Rj )2
exp − 2
2σexpand
, which means a preference to closer ones.
cluster,2 one match 2is selected with the probability q(Rk |Γi , contract) ∝
in the
errgeo (Rk ) +errphoto (Rk )
exp 2
2σcontract
, favoring the matches with higher error in geom-
etry and photometry.
At each sampling step, the algorithm chooses a move m with probability q(m),
then the sub-kernel of the move m is performed. The proposed move along its
pathway is accepted with the acceptance probability (14). If the move is accepted,
the current state jumps from θ to θ . Otherwise, the current state is retained.
In the early stage of sampling, we perform only expansion/contration moves
without merge/split moves because the unexpanded clusters in the early stage
are prone to unreliable merge/split moves. After enough iterations, merge/split
moves incorporate with expansion/contraction moves, helping the Markov chains
to have better chances of proposing reliable expansion/contraction moves and
estimating correct MCSs.
To evaluate the reliability of MCSs in the best sample θ∗ , we define the ex-
pansion ratio of an MCS as the expanded area of the MCS divided by the entire
image area. Since a reliable MCS is likely to expand enough, we determine the
reliable MCSs as those expanded more than the threshold ratio in both of two
images. This criterion of our method eliminates the trivial or false correspon-
dences effectively.
5 Experiments
We have conducted two experiments: (i) unsupervised recognition and segmen-
tation of multiple common objects and (ii) image retrieval for place recognition.
and complex clutters. The ground truth segmentation of the common objects has
been achieved manually5 . Figure 5 and 1 show some of co-recognition results on
them. Each color of the boundary represents identity of each MCS. The inferred
MCSs, their segmentations (the 2nd column), and their dense correspondences
(the 3rd column) are of good quality in all pairs of the dataset. On the average,
the correct match ratio started from less than 5% in naive NN matches, growing
to 42.2% after initial matching step, and finally reached to 92.8% in final reliable
MCSs. The number of correct matches increased to 651%.
We evaluated segmentation accuracy by hit ratio hr and background ratio
br .6 The results are summarized in Table 1. It also shows high accuracy in
segmentation. For example, the dataset Bulletins is borrowed from [11], and our
result of hr = 0.91, br = 0.17 is much better than their result of hr = 0.76, br =
0.29 in [11]. Moreover, note that our method provides object-level identities and
5
The dataset with ground truth is available at http://cv.snu.ac.kr/∼corecognition.
6
hr = |GroundTruth∩Result|
|GroundTruth|
, br = |Result|−|Result∩GroundTruth|
|Result|
.
Co-recognition of Image Pairs 155
Fig. 6. Co-recognition on all combination pairs of 5 test images from the ETHZ Toys
dataset. Both the detection rate and the precision are 93%.
dense correspondences, which are not provided by the method of [11]. Most of
the over-expanded regions increasing the background ratio result from mutually
similar background regions.
To demonstrate the unsupervised detection performance of co-recognition in
view changes or deformation, we tested on all combination pairs of 5 complex
images from the ETHZ toys dataset7 . None of the model images in the dataset
are included in this experiment. As shown in Fig. 6, although this task is very
challenging even for human eyes, our method detected 13 true ones and 1 false
one among 14 common object correspondences in the combination pairs. The de-
tection rate and the precision are all 93%. Note that our method can recognize
the separate regions as one MCS if mutual geometry of the regions is consistent
according to the reinforced sidedness constraint (6). Thus, it can deal with com-
plex partial occlusion which separates the objects into fragments. This allows
us to estimate the correct number of identical entities of separate regions as in
result of Fig. 1 and Fig. 6.
7
http://www.robots.ox.ac.uk/∼ferrari/datasets.html
156 M. Cho, Y.M. Shin, and K.M. Lee
6 Conclusion
We have presented a novel notion of co-recognition and the algorithm, which
recognizes and segments all the common salient region pairs with their maximal
sizes in an arbitrary image pair. The problem is formulated as a Bayesian MAP
problem and the solution is obtained by our stochastic image exploration al-
gorithm using DDMCMC paradigm. Experiments on challenging datasets show
promising results on the problem, some of which even humans cannot achieve
easily. The proposed co-recognition has various applications for high-level image
matching such as object-driven image retrieval.
8
http://research.microsoft.com/iccv2005/Contest/
Co-recognition of Image Pairs 157
Acknowledgements
This research was supported in part by the Defense Acquisition Program Admin-
istration and Agency for Defense Development, Korea, through the Image Infor-
mation Research Center under the contract UD070007AD, and in part by the
MKE (Ministry of Knowledge Economy), Korea under the ITRC (Information
Technolgy Research Center) Support program supervised by the IITA (Institute
of Information Technology Advancement) (IITA-2008-C1090-0801-0018).
References
1. Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, pp.
1150–1157 (1999)
2. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from
maximally stable extremal regions. In: BMVC (2002)
3. Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Heyden,
A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp.
128–142. Springer, Heidelberg (2002)
4. Vedaldi, A., Soatto, S.: Local features, all grown up. In: CVPR, pp. 1753–1760
(2006)
5. Toshev, A., Shi, J., Daniilidis, K.: Image matching via saliency region correspon-
dences. In: CVPR (2007)
6. Ferrari, V., Tuytelaars, T., Gool, L.: Simultaneous object recognition and segmen-
tation from single or multiple model views. IJCV 67(2), 159–188 (2006)
7. Yang, G., Stewart, C.V., Michal Sofka, C.L.T.: Registration of challenging image
pairs:initialization, estimation, and decision. PAMI 29(11), 1973–1989 (2007)
8. Rother, C., Minka, T.P., Blake, A., Kolmogorov, V.: Cosegmentation of image pairs
by histogram matching - incorporating a global constraint into MRFs. In: CVPR,
pp. 993–1000 (2006)
9. Tu, Z., Chen, X., Yuille, A.L., Zhu, S.C.: Image parsing: unifying segmentation,
detection, and recognition. In: ICCV, vol. 1, pp. 18–25 (2003)
10. Green, P.: Reversible jump markov chain monte carlo computation and bayesian
model determination. Biometrica 82, 711–732 (1995)
11. Yuan, J., Wu, Y.: Spatial random partition for common visual pattern discovery.
In: ICCV, pp. 1–8 (2007)
12. Simon, I., Seitz, S.M.: A probabilistic model for object recognition, segmentation,
and non-rigid correspondence. In: CVPR (2007)
13. Cho, M., Lee, K.M.: Partially occluded object-specific segmentation in view-based
recognition. In: CVPR (2007)
Movie/Script: Alignment and Parsing
of Video and Text Transcription
Abstract. Movies and TV are a rich source of diverse and complex video of peo-
ple, objects, actions and locales “in the wild”. Harvesting automatically labeled
sequences of actions from video would enable creation of large-scale and highly-
varied datasets. To enable such collection, we focus on the task of recovering
scene structure in movies and TV series for object tracking and action retrieval.
We present a weakly supervised algorithm that uses the screenplay and closed
captions to parse a movie into a hierarchy of shots and scenes. Scene boundaries
in the movie are aligned with screenplay scene labels and shots are reordered
into a sequence of long continuous tracks or threads which allow for more ac-
curate tracking of people, actions and objects. Scene segmentation, alignment,
and shot threading are formulated as inference in a unified generative model and
a novel hierarchical dynamic programming algorithm that can handle alignment
and jump-limited reorderings in linear time is presented. We present quantitative
and qualitative results on movie alignment and parsing, and use the recovered
structure to improve character naming and retrieval of common actions in several
episodes of popular TV series.
1 Introduction
Hand-labeling images of people and objects is a laborious task that is difficult to scale
up. Several recent papers [1,2] have successfully collected very large-scale, diverse
datasets of faces “in the wild” using weakly supervised techniques. These datasets
contain a wide variation in subject, pose, lighting, expression, and occlusions which
is not matched by any previous hand-built dataset. Labeling and segmenting actions is
perhaps an even more painstaking endeavor, where curated datasets are more limited.
Automatically extracting large collections of actions is of paramount importance. In
this paper, we argue that using movies and TV shows precisely aligned with easily ob-
tainable screenplays can pave a way to building such large-scale collections. Figure 1
illustrates this goal, showing the top 6 retrieved video snippets for 2 actions (walk,
turn) in TV series LOST using our system. The screenplay is parsed into a temporally
aligned sequence of action frames (subject verb object), and matched to detected and
named characters in the video sequence. Simultaneous work[3] explores similar goals
in a more supervised fashion. In order to enable accurately localized action retrieval,
we propose a much deeper analysis of the structure and syntax of both movies and
transcriptions.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 158–171, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Movie/Script: Alignment and Parsing of Video and Text Transcription 159
Fig. 1. Action retrieval using alignment between video and parsed screenplay. For each ac-
tion verb (top: walk, bottom: turn), we display the top 6 retrieved video snippets in TV se-
ries LOST using our system. The screenplay and closed captions are parsed into a tempo-
rally aligned sequence of verb frames (subject-verb-object), and then matched to detected and
named characters in the video sequence. The third retrieval, second row (“Jack turns”) is
counted as an error, since the face shows Boone instead of Jack. Additional results appear under
www.seas.upenn.edu/∼{}timothee.
Movies, TV series, news clips, and nowadays plentiful amateur videos, are designed
to effectively communicate events and stories. A visual narrative is conveyed from mul-
tiple camera angles that are carefully composed and interleaved to create seamless ac-
tion. Strong coherence cues and continuity editing rules are (typically) used to orient
the viewer, guide attention and help follow the action and geometry of the scene. Video
shots, much like words in sentences and paragraphs, must fit together to minimize per-
ceptual discontinuity across cuts and produce a meaningful scene. We attempt to un-
cover elements of the inherent structure of scenes and shots in video narratives. This
uncovered structure can be used to analyze the content of the video for tracking objects
across cuts, action retrieval, as well as enriching browsing and editing interfaces.
We present a framework for automatic parsing of a movie or video into a hierarchy
of shots and scenes and recovery of the shot interconnection structure. Our algorithm
makes use of both the input image sequence, closed captions and the screenplay of
the movie. We assume a hierarchical organization of movies into shots, threads and
scenes, where each scene is composed of a set of interlaced threads of shots with smooth
transitions of camera viewpoint inside each thread. To model the scene structure, we
propose a unified generative model for joint scene segmentation and shot threading.
We show that inference in the model to recover latent structure amounts to finding
a Hamiltonian path in the sequence of shots that maximizes the “head to tail” shot
similarity along the path, given the scene boundaries. Finding the maximum weight
Hamiltonian path (reducible to the Traveling Salesman Problem or TSP) is intractable
in general, but in our case, limited memory constraints on the paths make it tractable.
In fact we show how to jointly optimize scene boundaries and shot threading in linear
time in the number of shots using a novel hierarchical dynamic program.
We introduce textual features to inform the model with scene segmentation, via tem-
poral alignment with screenplay and closed captions, see figure 2. Such text data has
been used for character naming [4,5] and is widely available, which makes our approach
applicable to a large number of movies and TV series. In order to retrieve temporally-
aligned actions, we delve deeper into resolving textual ambiguities with pronoun reso-
lution (determining whom or what ‘he’, ‘she’, ‘it’, etc. refer to in the screenplay) and
extraction of verb frames. By detecting and naming characters, and resolving pronouns,
we show promising results for more accurate action retrieval for several common verbs.
We present quantitative and qualitative results for scene segmentation/alignment, shot
160 T. Cour et al.
segmentation/threading, tracking and character naming across shots and action retrieval
in numerous episodes of popular TV series, and illustrate that shot reordering provides
much improved character naming.
The main contributions of the paper are: 1) novel probabilistic model and inference
procedure for shot threading and scene alignment driven by text, 2) extraction of verb
frames and pronoun resolution from screenplay, and 3) retrieval of the corresponding
actions informed by scene sctructure and character naming.
The paper is organized as follows. Section 2 proposes a hierarchical organization of
movies into shots, threads and scenes. Sections 3 and 4 introduce a generative model
for joint scene segmentation and shot threading, and a hierarchical dynamic program to
solve it as a restricted TSP variant. Section 5 addresses the textual features used in our
model. We report results in section 6 and conclude in section 7.
(a) (b)
Fig. 2. (a) Alignment between video, screenplay and closed captions; (b) Deconstruction pipeline
Shot boundaries. The aim of shot segmentation is to segment the input frames into a
sequence of shots (single unbroken video recordings) by detecting camera viewpoint
discontinuities. A popular technique is to compute a set of localized color histograms
for each image and use a histogram distance function to detect boundaries [6,7].
Shot threads. Scenes are often modeled as a sequence of shots represented as letters:
ABABAB represents a typical dialogue scene alternating between two camera points
of view A and B. More complex patterns are usually observed and in practice, the clus-
tering of the shots into letters (camera angles/poses) is not always a very well defined
problem, as smooth transitions between shots occur. Nevertheless we assume in our
case that each shot in a scene is either a novel camera viewpoint or is generated from
Movie/Script: Alignment and Parsing of Video and Text Transcription 161
(similar to) a previous shot in the scene. This makes weaker assumptions about the
scene construction and doesn’t require reasoning about the number of clusters. In the
example above, the first A and B are novel viewpoints, and each subsequent A and B is
generated by the previous A or B. Figure 5 shows a more complex structure.
Fig. 3. Graphical model for joint scene segmentation and shot reordering, see text for details
162 T. Cour et al.
each scene. The shot appearance model P (si |spj [i] ) is treated next (we set it to uni-
form for the root of scene j where pj [i] = NULL). This model encourages (1) smooth
shot transitions within a scene and (2) scene breaks between shots with low similarity,
since the model doesn’t penalize transitions across scenes.
Shot appearance model (P (si |si )). In order to obtain smooth transitions and al-
low tracking of objects throughout reordered shots, we require that P (si |si ) depends
on the similarity between the last frame of shot si (I = slast i ) and the first frame of
shot si (I = sfirst
i ). Treating each shot as a word in a finiteset, we parameterize the
shot similarity term as P (si |si ) = exp(−dshot (si , si ))/ i exp(−dshot (si , si ))
where dshot (si , si ) = dframe (I, I ) is the chi-squared distance in color histogram be-
tween frames I, I . Note, dshot (si , si ) is not symmetric, even though dframe (I, I ) is.
where Wii = log P (si |si ) and π ∈ P[1,n] denotes a permutation of [1, n] defined
recursively from the parent variable p as follows: p[πt ] = πt−1 , with π1 indicating
the root. This amounts to finding a maximum weight Hamiltonian Path or Traveling
Salesman Problem (TSP), with πt indicating which shot is visited at time t on a virtual
tour. TSPs are intractable in general, so we make one additional assumption restricting
the set of feasible permutations.
Fig. 4. Top: a feasible solution for the restricted TSP with k = 2. Bottom: an infeasible solution,
violating the precedence constraint (shaded cities). Middle: the constraint limits the range of the
permutation: πt ∈ [t − (k − 1), t + (k − 1)]. Right: the constraint implies a banded structure on
the similarity matrix W = (Wii ): i − (2k − 3) ≤ i ≤ i + 2k − 1.
Because of the precedence constraint, the pair (S, i ) can take at most (k + 1)2k−2 pos-
sible values at any given time t (instead of n−1t−1 n without the constraint). The idea
is to construct a directed weighted graph Gkn with n layers of nodes, one layer per
position in the path, with paths in the graph joining layer 1 to layer n corresponding
to feasible hamiltonian paths, and shortest paths joining layer 1 to n corresponding to
optimal hamiltonian paths. Since there are at most k incoming edges per node (corre-
sponding to valid transitions πt−1 → πt ), the total complexity of the dynamic program
is O(k(k + 1)2k−2 · n), exponential in k (fixed) but linear in n, see [11] for details.
Naive solution. One can solve (6) as follows: for each interval I ⊂ [1, n], pre-compute
the optimal path πI∗ ∈ PIk using 4, and then use a straightforward dynamic program-
ming algorithm to compute the optimal concatenation of m such paths to form the
optimal solution. Letting f (k) = k(k + 1)2k−2 , the complexity of this algorithm is
164 T. Cour et al.
O( 1≤i≤i ≤n f (k) · (i − i + 1)) = O(f (k)n(n + 1)(n + 2)/6) for the precomputation
and O(mn(n + 1)/2) for the dynamic program, which totals to O(f (k)n3 /6). The next
paragraph introduces our joint dynamic programming over scene segmentation and shot
threading, which reduces computational complexity by a factor n (number of shots).
Joint dynamic program over scene breaks and shot threading. We exploit the pres-
ence of overlapping subproblems. We construct a single tour π, walking over the joint
space of shots and scene labels. Our approach is based on the (categorical) product
graph Gkn × Cm where Gkn is the graph from 4.2 and Cm is the chain graph of order m.
A node (u, j) ∈ Gkn × Cm represents the node u ∈ Gkn in the j th scene. Given two
connected nodes u = (S, i, t) and u = (S , i , t + 1) in Gkn , there are two types of
connections in the product graph. The first connections correspond to shots i, i both
being in the j th scene:
and only happen when u = (S, i, t) satisfies max(i, max(S)) = t, to make sure the
tour decomposes into a tour of each scene (we can switch to the next scene when the
set of shots visited up to time t is exactly {1, ..., t}).
The solution to (6) similarly uses a dynamic program to find the shortest path in Gkn ×
Cm (and backtracking to recover the arg max). Since there are m times as many nodes
in the graph as in Gkn and at most twice as many incoming connections per node (nodes
from the previous scene or from the same scene), the total complexity is: O(2k(k +
1)2k−2 mn) = O(2f (k)mn).
Comparison. We manually labeled shot and scene breaks for a number of movies
and TV series and found that a typical scene contains on average about 11 shots,
i.e.m ≈ n/11. So the reduction in complexity between the naive algorithm and our
3
joint dynamic program is: O( f2f(k)n /6
(k)mn ) = O(n /(12m)) ≈ n, which is a huge gain,
2
especially given typical values of n = 600. The resulting complexity is linear in n and
m and in practice takes about 1 minute as opposed to 11 hours for an entire episode,
given pre-computed shot similarity.
Fig. 6. Left: pronoun resolution and verb frames obtained from the parsed screenplay narrations.
Right: statistics collected from 24 parsed screenplays (1 season of LOST).
Output verb frames: (Sun - watches - something) (Jin - reaches out - ) (Jin - touches
- chin) (Sun - looks - at Jin) . (Sun - pulls away - ) (Jin - puts down - hand).
We report pronoun resolution accuracy on screenplay narrations of 3 different TV
series (about half a screenplay for each), see table 1.
6 Results
We experimented with our framework on a significant amount of data, composed of TV
series (19 episodes from one season of LOST, several episodes of CSI), one feature length
movie “The Fifth Element”, and one animation movie “Aladdin”, representing about 20
hours of video at DVD resolution. We report results on scene segmentation/alignment,
character naming and tracking, as well as retrieval of query action verbs.
Shot segmentation. We obtain 97% F-score (harmonic mean of precision and recall)
for shot segmentation, using standard color histogram based methods.
Scene segmentation and alignment. We hand labeled scene boundaries in one episode
of LOST and one episode of CSI based on manual alignment of the frames with the
screenplay. The accuracy for predicting the scene label of each shot was 97% for LOST
and 91% for CSI. The F-score for scene boundary detection was 86% for LOST and
75% for CSI, see figure 7. We used k = 9 for the memory width, a value similar to the
buffer size used in [10] for computing shot coherence. We also analyzed the effect on
performance of the memory width k, and report results with and without alignment to
screenplay in table 2. In comparison, we obtained an F-score of 43% for scene bound-
ary detection using a model based on backward shot coherence [10] uninformed by
screenplay, but optimized over buffer size and non-maximum suppression window size.
Scene content analysis. We manually labeled the scene layout in the same episodes
of LOST and CSI, providing for each shot in a scene its generating shot (including
Table 2. % F-score (first number) for scene boundary detection and % accuracy (second number)
for predicting scene label of shots (on 1 episode of LOST) as a function of the memory width k
used in the TSP, and the prior P (b). The case k = 1 corresponds to no reordering at all. Line 1:
P (b) informed by screenplay; line 2: P (b) uniform; line 3: total computation time.
Fig. 7. Movie at a glance: scene segmentation-alignment and shot reordering for an episode of
LOST (only a portion shown for readability). Scene boundaries are in red, together with the set
of characters appearing in each scene, in blue.
Fig. 8. Character naming using screenplay alignment and shot threading. Top 3 rows: correctly
named faces; bottom row: incorrectly named faces. We detect face tracks in each shot and reorder
them according to the shot threading permutation. Some face tracks are assigned a name prior
based on the alignment between dialogues and mouth motion. We compute a joint assignment of
names to face tracks using an HMM on the reordered face tracks.
the special case when this is a new viewpoint). We obtain a precision/recall of 75%
for predicting the generating parent shot. See figure 5 for a sample of the results on 3
scenes. Note, to obtain longer tracks in figure 5, we recursively applied the memory
limited TSP until convergence (typically a few iterations).
Fig. 9. Top 10 retrieved video snippets for 15 query action verbs: close eyes, grab, kiss, kneel,
open, stand, cry, open door, phone, point, shout, sit, sleep, smile, take breath. Please zoom in to
see screenplay annotation (and its parsing into verb frames for the first 6 verbs).
corresponding to assignments of face tracks to character names. The face tracks are
ordered according to the shot threading permutation, and as a result there are much
fewer changes of character name along this ordering. Following [14], we detect on-
screen speakers as follows: 1) locate mouth for each face track using a mouth detector
based on Viola-Jones, 2) compute a mouth motion score based on the normalized cross
correlation between consecutive windows of the mouth track, averaged over temporal
segments corresponding to speech portions of the screenplay. Finally we label the face
tracks using Viterbi decoding for the Maximum a Posteriori (MAP) assignment (see
website for more details). We computed groundtruth face names for one episode of
LOST and compared our method against the following baseline that does not use shot
170 T. Cour et al.
reordering: each unlabeled face track (without a detected speaking character on screen)
is labeled using the closest labeled face track in feature space (position of face track and
color histogram). The accuracy over an episode of LOST is 76% for mainly dialogue
scenes and 66% for the entire episode, as evaluated against groundtruth. The baseline
model based using nearest neighbor performs at resp. 43% and 39%.
Retrieval of actions in videos. We consider a query-by-action verb retrieval task for
15 query verbs across 10 episodes of LOST, see figure 9. The screenplay is parsed into
verb frames (subject-verb-object) with pronoun resolution, as discussed earlier. Each
verb frame is assigned a temporal interval based on time-stamped intervening dialogues
and tightened with nearby shot/scene boundaries. Queries are further refined to match
the subject of the verb frame with a named character face. We report retrieval results
as follows: for each of the following action verbs, we measure the number of times
(out of 10) the retrieved video snippet correctly shows the actor on screen performing
the action (we penalize for wrong naming): close eyes (9/10), grab (9/10), kiss (8/10),
kneel (9/10), open (9/10), stand (9/10), cry (9/10), open door (10/10), phone (10/10),
point (10/10), shout (7/10), sit (10/10), sleep (8/10), smile (9/10), take breath (9/10).
The average is 90/100. Two additional queries are shown in figure 1 along with the de-
tected and identified characters. We created a large dataset of retrieved action sequences
combined with character naming for improved temporal and spatial localization, see
www.seas.upenn.edu/∼{}timothee for results and matlab code.
7 Conclusion
In this work we have addressed basic elements of movie structure: hierarchy of scenes
and shots and continuity of shot threads. We believe that this structure can be useful
for many intelligent movie manipulation tasks, such as semantic retrieval and indexing,
browsing by character or object, re-editing and many more. We plan to extend our work
to provide more fine-grained alignment of movies and screenplay, using coarse scene
geometry, gaze and pose estimation.
References
1. Huang, G., Jain, V., Learned-Miller, E.: Unsupervised joint alignment of complex images.
In: International Conference on Computer Vision, pp. 1–8 (2007)
2. Ramanan, D., Baker, S., Kakade, S.: Leveraging archival video for building face datasets. In:
International Conference on Computer Vision, pp. 1–8 (2007)
3. Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from
movies. In: IEEE Conference on Computer Vision and Pattern Recognition (2008),
http://lear.inrialpes.fr/pubs/2008/LMSR08
4. Sivic, J., Everingham, M., Zisserman, A.: Person spotting: video shot retrieval for face sets.
In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR
2005. LNCS, vol. 3568, Springer, Heidelberg (2005)
5. Everingham, M., Sivic, J., Zisserman, A.: Hello! my name is.. buffy – automatic naming of
characters in tv video. In: Proceedings of the British Machine Vision Conference (2006)
6. Lienhart, R.: Reliable transition detection in videos: A survey and practitioner’s guide. Int.
Journal of Image and Graphics (2001)
Movie/Script: Alignment and Parsing of Video and Text Transcription 171
7. Ngo, C.-W., Pong, T.C., Zhang, H.J.: Recent advances in content-based video analysis. In-
ternational Journal of Image and Graphics 1, 445–468 (2001)
8. Zhai, Y., Shah, M.: Video scene segmentation using markov chain monte carlo. IEEE Trans-
actions on Multimedia 8, 686–697 (2006)
9. Yeung, M., Yeo, B.L., Liu, B.: Segmentation of video by clustering and graph analysis.
Comp. Vision Image Understanding (1998)
10. Kender, J., Yeo, B.: Video scene segmentation via continuous video coherence. In: IEEE
Conference on Computer Vision and Pattern Recognition (1998)
11. Balas, E., Simonetti, N.: Linear time dynamic programming algorithms for new classes of
restricted tsps: A computational study. INFORMS Journal on Computing 13, 56–75 (2001)
12. Myers, C.S., Rabiner, L.R.: A comparative study of several dynamic time-warping algorithms
for connected word recognition. The Bell System Technical Journal (1981)
13. Viola, P.A., Jones, M.J.: Robust real-time face detection. International Journal of Computer
Vision 57, 137–154 (2004)
14. Everingham, M.R., Sivic, J., Zisserman, A.: Hello! my name is buffy: Automatic naming of
characters in tv video. In: BMVC, vol. III, p. 899 (2006)
Using 3D Line Segments for Robust and Efficient
Change Detection from Multiple Noisy Images
Division of Engineering
Brown University
Providence, RI, USA
{ieden,cooper}@lems.brown.edu
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 172–185, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Using 3D Line Segments for Robust and Efficient Change Detection 173
Fig. 1. Our line segment based change detection result after training on a sequence of 5 images.
(A) A sample training image. (B) The test image. (C) Hand-marked ground truth for change
where the new object is shown in “red” and the disappeared object is shown in “blue”. (D) Result
of our method. Lines associated with the new object are shown in “red” and lines associated with
the disappeared object are shown in “blue”. Two major change regions are detected with only a
few false alarms due to specular highlights and object shadows. (This is a color image)
is that images can be taken at arbitrary time, under arbitrary lighting conditions and
from arbitrary view points. Furthermore, they are usually single images and not video.
For example, if they are taken from a flying aircraft, a 3D point in the scene is usually
seen in one image and not in the immediately preceding or succeeding images, and is
not seen again until the aircraft returns at some later time or until some other aircraft or
satellite or moving camera on the ground sees the point at some later time.
In this paper, we assume n images are taken of a scene, and we then look for a change
in the n + 1st image, and if one has occurred we try to explain its type (resulting from
the arrival or from the departure of a 3D object). The learning is done in an unsupervised
mode. We do not restrict ourselves to the case of buildings where the 3D lines are long,
easy to detect, easy to estimate and are modest in number. Rather, we are interested
in the case of many short lines where the lines can be portions of long curves or can
be short straight line segments associated with complicated 3D objects, e.g., vehicles,
scenes of damaged urban-scapes, natural structure, people, etc...
Why do we restrict this study to straight lines? We could deal with curves, but since
curves can be decomposed into straight lines, and since straight lines – especially short
174 I. Eden and D.B. Cooper
line segments - appear extensively in 3D scenes and in images, we decided to start with
those. The important thing is that estimating 3D structure and the associated BRDF
can often be done in theory, but this is usually difficult to do computationally. On the
other hand, estimating 3D line segments is much more tractable and can be considered
as a system in its own right or as contributing to applications that require efficient 3D
structure estimation.
Our paper consists of the following. Given n images, we estimate all 3D lines that
appear in three or more images. Our approach to 3D line estimation emphasizes com-
putational speed and accuracy. For very short lines, accuracy is greatly improved by
making use of incidence relations among the lines. For change detection we look for
the appearance or disappearance of one or more line segments in the n + 1st image.
This procedure depends on the camera position of the new image and the set of re-
constructed 3D line segments in the learning period, and therefore an interpretation of
whether a line is not seen because of self occlusion within the scene or because of a 3D
change. Usually, but not always, if an existing 3D line should be visible in the n + 1st
image and is not, the reason is because of occlusion by the arrival of a new object or
departure of an existing object. If a new object arrives, there will usually be new lines
that appear because of it, but it is possible that no new straight lines appear. Hence,
detecting and interpreting change, if it occurs, based on straight line segments is not
clear cut, and we deal with that problem in this paper.
2 Related Work
Some of the earlier work on change detection focuses on image sequences taken from
stationary cameras. The main drawback of these methods is their likelihood to cre-
ate false alarms in cases where pixel values are affected by viewpoint, illumination,
seasonal and atmospheric changes. This is the reason why pixel (intensity) and block
(histogram) based change detection algorithms such as image differencing [1,2] and
background modeling methods [3] fail in some applications.
Meanwhile, there exist change detection methods designed for non-stationary image
sequences. There has been a lot of work in the literature on methods based on detect-
ing moving objects [4,5], but these methods assume one or more moving objects in
a continuous video sequence. On the other hand, 3D voxel based methods [6] where
distributions of surface occupancy and associated BRDF are stored in each voxel can
manage complex and changing surfaces, but these methods suffer from sudden illumi-
nation changes, perform poorly around specular highlights and object boundaries.
To our knowledge, line segment based change detection methods have rarely been
studied in computer vision literature. Rowe and Grewe [7] make use of 2D line seg-
ments in their algorithm, but their method is specifically designed for aerial images
where the images can be registered using an affine transformation. Li et al. [8] provided
a method of detecting urban changes from a pair of satellite images by identifying
changed line segments over time. Their method does not estimate the 3D geometry
associated with the line segments and takes a pair of satellite (aerial) images as input
where line matching can be done by estimating the homography between the two im-
ages. The change detection method we propose in this work is more generic, it can
Using 3D Line Segments for Robust and Efficient Change Detection 175
work on non-sequential image sequences where the viewpoint can change drastically
between pairs of images and it is not based on any prior assumptions on the set of
training images.
Line segment matching over multiple images is known to be a difficult problem due to
its exponential complexity requirement and challenging inputs. As a result of imper-
fections in edge detection and line fitting algorithms, lines are fragmented into small
segments that diverge from the original line segments. When unreliable endpoints and
topological relationships are given as inputs, exponential complexity search algorithms
may fail to produce exact segment matching.
In this section, we present a generic, reliable and efficient method for multi-view
line matching and reconstruction. Although our method is also suitable for small base-
line problems (e.g. aerial images, continuous video sequences), such cases are not our
primary focus as their line ordering along the epipolar direction does not change much
and they can be solved efficiently by using planar homographies. In this paper, we fo-
cus on large baseline matching and reconstruction problems, where sudden illumination
changes and specular highlights make it more difficult obtain consistent line segments
in images of the same scene. These problems are more challenging as the line ordering
in different images change due to differences in viewing angles. The following subsec-
tions describe three steps of our 3D line segment reconstruction method: an efficient
line segment matching algorithm, reconstruction of single 3D lines segments and re-
construction of free form wire-frame structures.
In general, the line matching problem is known to be exponential in the number of im-
ages. That is to say, given there are n images of the same scene and approximately m
lines in each image, the total complexity of the line matching problem (the size of the
search space) is O(mn ). One way to reduce the combinatorial expansion of the match-
ing problem is to use the epipolar beam [9,10]. Given the line l = (x1 , x2 ) in I, the
corresponding line in I should lie between l1 = F x1 and l2 = F x2 where F is the
fundamental matrix between I and I (see figure 2). While the epipolar beam reduces
the combinatorial expansion of the matching algorithm, this reduction highly depends
on the general alignment of line segments relative to epipolar lines. Plane sweep meth-
ods are also used to avoid the combinatorial expansion of the matching problem [11],
but these methods do not perform well when the endpoints of 2D line segments are not
consistent in different images of the same scene. Another way to increase the matching
efficiency is to use color histogram based feature descriptors for 2D line segments [12],
but these methods assume that colors only undergo slight changes and the data does not
contain specular highlights. Our work focuses on more challenging real world problems
where the above assumptions do not hold.
In this paper, we propose a new method that improves the multi-view line matching
efficiency. Our method is based on the assumption that the 3D region of interest (ROI)
176 I. Eden and D.B. Cooper
Fig. 2. An example of epipolar beam. (A) I1 : “selected line” for matching is shown in “red”.
(B) I2 : the epipolar beam associated with the selected line in I1 is marked by “blue” and line
segments that lie inside the epipolar beam (i.e., candidates for matching) are shown in “red”. The
epipolar beam in image (B) reduces the search space by 5.55. (This is a color image)
is approximately known, however this assumption is not a limitation for most multi-
view applications, since the 3D ROI can be obtained by intersecting the viewing cones
of the input images. The basic idea of our approach is to divide the 3D ROI into smaller
cubes, and solve the matching problem for the line segments that lie inside each cube.
The matching algorithm iteratively projects each cube into the set of training images,
and extracts the set of 2D line segments in each image that lie (completely or partially)
inside the convex polygon associated with the 3D cube.
Assuming that lines are distributed on the cubes homogeneously, the estimated num-
ber of lines inside each cube is ( m
C ), where C is the total number of cubes, and the
mn
algorithmic complexity of the introduced matching problem is O( C n ). It must be noted
that under the assumption of homogenous distribution of lines segments in the 3D ROI
the total matching complexity is reduced by a factor of C1n , where the efficiency gain
is exponential. On the other hand, even in the existence of some dispersion over multi-
ple cubes, our proposed algorithm substantially reduces the computational complexity
of the matching algorithm. Figure 3 illustrates the quantitative comparison of different
matching algorithms for 4 images of the same scene. The matching method we use in
this work is a mixture of the algorithm described above and the epipolar beam method
(EB+Cubes).
Fig. 3. Quantitative comparison of four different line segment matching algorithms using four im-
ages of the same scene. Brute-force: the simplest matching method, all combinations are checked
over all images. EB: epipolar beam is used to reduce the search space. Cubes: the 3D space is
splitted into smaller sections and matching is done for each section separately. EB+Cubes: (the
method we use in this work) a combination of “Cubes” and “EB”. It is shown that “Cubes+EB”
method outperforms individual “EB” and “Cubes” methods and Brute-Force search. Notice that
the size of the search space is given in logarithmic scale.
where li = (Mi X1 ) × (Mi X2 ) is the projection of L to the ith image as an infinite line,
li = (Mi X1 , Mi X2 ) is the projection as a line segment, Mi is the projection matrix
for the ith image, dl is the distance metric between a line and a line segment and ds
is the distance metric between two line segments. The distance metrics dl and ds are
defined as
1 2
dl (l, l ) = dp (p, l )
|l|
p∈l
1 2
1 2
ds (l, l ) = dps (p, l ) + d (p , l)
|l| |l | ps
p∈l p ∈l
where dp (p, l) is the perpendicular distance of a point (p) to an infinite 2D line and
dps (p, l) is the distance of a point to a line segment.
Note that, β in equation 1, is used to control the convergence of the local search
algorithm. β is typically selected to be a number close to zero (0 < β << 1), so
the first part of the objective function dominates the local search until the algorithm
converges to the correct infinite line. Later, the second part (succeeded by β) of the
objective function starts to dominate the local search algorithm in order to find the
optimal end points for the 3D line segment.
178 I. Eden and D.B. Cooper
where NV is the number of vertices in a wire-frame model and lie is the line in ith
image associated with edge e ∈ E in graph G = (V, E).
Note that, the wireframe model introduces more constraints, therefore the sum of
squares error of the individual line segment reconstruction is always smaller then the
error of the wireframe model reconstruction. On the other hand, additional constraints
provide better estimation of 3D line segments in terms of the actual 3D geometry. The
result of 3D line segment reconstruction using wire-frame models is given in figure 4.
4 Change Detection
Our change detection method is based on the appearance and disappearance of line
segments throughout an image sequence. We compare the geometry of lines rather than
the gray levels of image pixels to avoid the computationally intensive, and some-times
impossible, tasks of estimating 3D surfaces and their associated BRDFs in the model-
building stage. Estimating 3D lines is computationally much less costly and is more
Using 3D Line Segments for Robust and Efficient Change Detection 179
Fig. 4. Reconstruction results for 3D line segments of a Jeep. (A)-(B) The Jeep is shown from two
different views. (C) Reconstruction results of single line segment reconstruction as explained in
section 3.2. (D) Reconstruction results of the wire-frame model reconstruction as proposed in
section 3.3.
The definition of the general change detection problem is the following: A 3D sur-
face model and BRDF are estimated for a region from a number of images taken by
calibrated cameras in various positions at distinctly different times. This permits the
prediction of the appearance of the entire region from a camera in an arbitrary new
position. When a new image is taken by a calibrated camera from a new position, the
computer must make a decision as whether a new object has appeared in the region or
the image is of the same region [17]. In our change detection problem, which differs
from the preceding, the algorithm must decide whether a new object has appeared in
the region or whether an object in the region has left. We predict the 3D model which
consists only of long and short straight lines, since estimating the complete 3D sur-
face under varying illumination conditions and in the existence of specular highlights
is often impractical. Moreover, image edges resulting from reflectance edges or from
3D surface ridges are less sensitive to image view direction and surface illumination,
hence so are the 3D curves associated with these edges. For man-made objects and for
general 3D curves, straight line approximations are usually appropriate and effective.
Our method detects changes by interpreting reconstructed 3D line segments and 2D line
segments detected in training and test images.
Our change detection method assigns a “state” to each 2D line segment in the test image
and each reconstructed 3D line segment from the training images. These states are:
180 I. Eden and D.B. Cooper
Change Detection for 2D Lines in the Test Image. We estimate the “state” of each 2D
line segment in the new image using two statistical tests T1 and T2 . The classification
scheme for the 2D case is given in figure 5.
Fig. 5. General scheme for line segment based change detection. (Left) Change detection for 2D
line segment in the test image. (Right) Change detection for reconstructed 3D line segments from
training images. Threshold values t1 , t2 , t3 and t4 are selected to produce the desired change
detection rate.
First we apply T1 to test how well a 2D line segment in In+1 fits to the 3D model
Wn . Basically T1 is the distance of the 2D line segment in the new image to the closest
projection into In+1 of the 3D line segment in the 3D model.
T1 = min ds (l, ln+1 )
L∈Wn
The second step is, if necessary, to apply T2 to test if there exists a 2D line segment
in one of the past images that has taken from a similar viewing direction. Let’s define
the index of the image in the training set that has the closest camera projection matrix
compared to the test image as c∗ .
Here we assume that for large training sets, there exists a camera close to the camera
that took the test image assuming that camera matrices are normalized.
Using 3D Line Segments for Robust and Efficient Change Detection 181
here ln+1 is the projection of L to the (n + 1)st image.
The second step is, if necessary, to apply T4 to test if there is an existing 3D line
segment in Wn that occludes the current 3D line L.
T4 = min ds (gn+1 , ln+1 )Z(Mn+1 , L, G)
G∈Wn \{L}
Here gn+1 is the projection of G into In+1 as a line segment and Z(Mn+1 , L, G) returns
1 if both endpoints of G are closer to the camera center of Mn+1 than both endpoints
of L, otherwise it returns ∞.
5 Experimental Results
In this section, we present the results of our change detection algorithm in three dif-
ferent image sequences (experiments). It is assumed that the scene geometry does not
change during the training period. The reconstruction of the 3D model and the change
detection is done using the methods explained in section 3 and section 4. The aim of
each experiment is to show that our change detection method successfully detects the
changes and type of changes in test images in which the scene geometry is significantly
different than the training images.
The first sequence is a collection of 5 training images and 1 test image, all of which
are taken in a two hour interval. The result of this experiment is shown in figure 1. The
test image is taken at a different time of the day and from a different camera position;
hence the illumination and viewpoint direction is significantly different compared to
the training images. There are two important changes that have taken place in this test
image. The first is the disappearance of a vehicle that was previously parked in the
training images. The second change is the appearance of a vehicle close to the empty
parking spot. Both major changes (and their types) are detected accurately with a low
level of false alarm rate and main regions of change have been successfully predicted.
Notice that small “new lines” (shown in “red”) in the front row of the parking place are
due to the specular highlights that did not exist in the set of training images. There is
significant illumination difference between the test and training images, since the test
image is taken a few hours after the training images. The red line on the ground of the
empty parking spot is due to the shadow of another car. Geometrically, that line was
occluded by the car that has left the scene, so it can be thought as an existing line which
did not show up in training images due to self occlusion. The result of this experiment
shows that our method is robust to severe illumination and viewpoint changes.
182 I. Eden and D.B. Cooper
Fig. 6. Change detection results for an urban area after training on a sequence of 20 images. (A)
A sample training image. (B) The test image. (C) Hand-marked ground truth for change where
the new objects are labeled with “red” and the removed objects are labeled with “blue”. (D) Line
segment based change detection results in which “new” lines are shown in “red” and “removed”
lines are shown in “blue”. Our method detects the permanent change regions successfully and
also recognizes a moving vehicle as an instance of temporal change. (This is a color image)
Unlike the first image sequence, the second and third image sequences do not have
significant viewpoint and illumination changes between the training and test images.
The result of the second experiment is shown in figure 6. To test our method, we man-
ually created a few changes using image manipulation tools. We removed the town bell
and added another chimney to a building. These change areas are marked with blue and
red respectively in figure 6-C. Also the building at the bottom right corner of the test
image is removed from the scene. This change is also marked with blue in figure 6-C.
These major changes are successfully detected in this experiment. Also, despite the
relatively small sizes of cars in this dataset, the change related to a moving vehicle
(temporal change) is detected successfully.
The results of the third experiment is shown in figure 7. In this experiment, two
new vehicles appear in the road and these changes are detected successfully. Similarly,
we created manual changes in the scene geometry using image manipulation tools. We
removed a few objects that existed on the terrace of the building in the bottom right part
of the test image. These changes are also detected successfully by our change detection
algorithm.
We also applied the Grimson change detection algorithm [3] to ground registered im-
ages for all sequences (see figure 8). Our 3D geometry based change detection method
performs well for the first image sequence under significant viewpoint and illumination
Using 3D Line Segments for Robust and Efficient Change Detection 183
Fig. 7. Change detection results for an urban area after training on a sequence of 10 images. (A)
A sample training image. (B) The test image. (C) Hand-marked ground truth for change where
the new objects are labeled with “red” and the removed objects are labeled with “blue”. (D) Line
segment based change detection results in which “new” lines are shown in “red” and “removed”
lines are shown in “blue”. Our method detects the permanent change regions successfully and
also recognizes two moving vehicles as instances of temporal change. (This is a color image)
Fig. 8. Results of Grimson change detection algorithm applied to ground registered images. (A-B)
Ground truth change and the result of the Grimson algorithm for the first sequence. (C-D) Ground
truth change and the result of the Grimson algorithm for the second sequence. (E-F) Ground truth
change and the result of the Grimson algorithm for the third sequence. (This is a color image)
differences between the training and test images. On the other hand, the Grimson
method fails to detect changes reasonably due to viewpoint and illumination changes,
and the existence of specular highlights. For second and third image sequences, the
184 I. Eden and D.B. Cooper
Grimson method successfully detects changes except insignificant false alarms caused
by the small viewpoint change between the training and test images.
To summarize our experimental results, we have shown that the significant change re-
gions are successfully detected in all 3 experiments. The first experimental setup shows
that our method can detect changes regardless of the changing viewpoint and illumi-
nation conditions. However, there are a few cases of insignificant false alarms possibly
caused by shadows, specular highlights and lack of 3D geometry of newly exposed
line segments in the test image. Also, it must be noted that unlike other change detec-
tion methods, our method detects the type of major changes successfully in almost all
experiments.
Acknowledgments. This work was funded, in part, by the Lockheed Martin Corpora-
tion. We are grateful to Joseph L. Mundy for many insightful discussions on this work.
We also thank Ece Kamar for her helpful comments on earlier drafts of this paper.
References
1. Bruzzone, L., Prieto, D.F.: Automatic analysis of the difference image for unsupervised
change detection. IEEE Transactions on Geoscience and Remote Sensing 38(3), 1171–1182
(2000)
2. Bruzzone, L., Prieto, D.F.: An adaptive semiparametric and context-based approach to un-
supervised change detection in multitemporal remote-sensing images. IEEE Transactions on
Image Processing 11(4), 452–466 (2002)
Using 3D Line Segments for Robust and Efficient Change Detection 185
3. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking.
In: CVPR, pp. 246–252 (1999)
4. Yalcin, H., Hebert, M., Collins, R.T., Black, M.J.: A flow-based approach to vehicle detection
and background mosaicking in airborne video. In: CVPR, vol. II, p. 1202 (2005)
5. Broadhurst, A., Drummond, T., Cipolla, R.: A probabilistic framework for space carving. In:
ICCV, pp. 388–393 (2001)
6. Pollard, T., Mundy, J.L.: Change detection in a 3-d world. In: CVPR, pp. 1–6 (2007)
7. Rowe, N.C., Grewe, L.L.: Change detection for linear features in aerial photographs us-
ing edge-finding. IEEE Transactions on Geoscience and Remote Sensing 39(7), 1608–1612
(2001)
8. Li, W., Li, X., Wu, Y., Hu, Z.: A novel framework for urban change detection using VHR
satellite images. In: ICPR, pp. 312–315 (2006)
9. Schmid, C., Zisserman, A.: The geometry and matching of lines and curves over multiple
views. International Journal of Computer Vision 40(3), 199–233 (2000)
10. Heuel, S., Förstner, W.: Matching, reconstructing and grouping 3D lines from multiple views
using uncertain projective geometry. In: CVPR, pp. 517–524 (2001)
11. Taillandier, F., Deriche, R.: Reconstruction of 3D linear primitives from multiple views for
urban areas modelisation. In: Photogrammetric Computer Vision, vol. B, p. 267 (2002)
12. Bay, H., Ferrari, V., Gool, L.J.V.: Wide-baseline stereo matching with line segments. In:
CVPR, vol. I, pp. 329–336 (2005)
13. Nelder, J.A., Mead, R.: A simplex method for function minimization. Computer Journal 7,
308–313 (1965)
14. Taylor, C.J., Kriegman, D.J.: Structure and motion from line segments in multiple images.
IEEE Transactions on Pattern Analysis and Machine Intelligence 17(11), 1021–1032 (1995)
15. Moons, T., Frère, D., Vandekerckhove, J., Gool, L.V.: Automatic modelling and 3d recon-
struction of urban house roofs from high resolution aerial imagery. In: Burkhardt, H.-J., Neu-
mann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 410–425. Springer, Heidelberg (1998)
16. Baillard, C., Schmid, C., Zisserman, A., Fitzgibbon, A.W.: Automatic line matching and 3D
reconstruction of buildings from multiple views. In: ISPRS Congress, pp. 69–80 (1999)
17. Radke, R.J., Andra, S., Al-Kofahi, O., Roysam, B.: Image change detection algorithms: a
systematic survey. IEEE Transactions on Image Processing 14(3), 294–307 (2005)
Action Recognition with a Bio–inspired
Feedforward Motion Processing Model: The
Richness of Center-Surround Interactions
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 186–199, 2008.
c Springer-Verlag Berlin Heidelberg 2008
The Richness of Center-Surround Interactions 187
brain areas (e.g., the form pathway) and mechanisms (e.g., top-down attentional
mechanisms) are also involved to analyze complex general scenes.
Among recent bio-inspired approaches for AR, [13] proposed a model for the
visual processing in the dorsal (motion) and ventral (form) pathways. They
validated their model in the AR task using stick figures constructed from real
sequences. More recently, [14] proposed a feedforward architecture, which can
be seen as an extension of [15]. In [14], the authors mapped their model to the
cortical architecture, essentially V1 (with simple and complex cells). The only
clear bio-inspired part is one of the models for S1 units and the pooling aspect.
The use of spatio-temporal chunks seems to be supported also but the authors
never claim any biological relevance for the corresponding subsequent processing
stages (from S2 to C3). The max operator is also controversial and not supported
in neurophysiology because it mainly does not allow feedbacks.
In this article, we follow the same objective as in [14], which is to propose a
bio-inspired model of motion processing for AR in real sequences. Our model
will be a connection-based network, in which a large number of neuron-like
processing units operate in parallel. Each unit neuron will have an ‘activation
level’ membrane potential that represents the strength of a particular feature
in the environment. Here, our main contribution will be to better account for
the visual system properties, and in particular, at MT layer level: We repro-
duce part of the variety of center-surround interactions [16,17]. Then, in order
to prove the relevance of this extended motion description, we will show its ben-
efits on the AR application, and compare our results with the ones obtained
by [14].
This article presents the model described in Fig. 1 and it is organized as fol-
lows. Section 2 presents the core of the approach which is a biologically-inspired
model of motion estimation, based on a feedforward architecture. As we pre-
viously mentioned, the aim of this article is to show how a bio-inspired model
can be used in a real application such as AR. Note that we also studied some
low-level properties of the model concerning motion processing [18] but those
studies are out of the scope of this article. The first stage (Section 2.1) is the
local motion extraction corresponding to the V1 layer, with a discrete foveated
organization. The output of this layer is fed to the MT layer (Section 2.2), which
is composed of a set of neurons whose dynamics are defined by a conductance-
based neuron model. We define the connectivity between V1 and MT layers
according to neurophysiology, which defines the center-surround interactions of
a MT neuron. The output of the MT layer is a set of neuron membrane po-
tentials, whose values indicate the presence of a certain velocity or contrasts of
velocities. Then, in Section 3, we consider the problem of AR based on the MT
layer activity. In this section we also present the experimental protocol, some
validations and a comparison with the approach presented by [14]. Interestingly,
we show how the variety of surround-interactions in MT cells found in physi-
ology allows the improvement of the recognition performances. We conclude in
Section 4.
188 M.-J. Escobar and P. Kornprobst
Fig. 1. Block diagram showing the different steps of our approach from the input
image sequence as stimulus until the motion map encoding the motion pattern. (a)
We use a real video sequence as input, the input sequences are preprocessed in order
to have contrast normalization and centered moving stimuli. To compute the motion
map representing the input image we consider a sliding temporal window of length Δt.
(b) Directional-selectivity filters are applied over each frame of the input sequence in
a log-polar distribution grid obtaining the activity of each V1 cell. (c) V1 outputs feed
the MT cells which integrate the information in space and time. (d) The motion map
is constructed calculating the mean activation of MT cells inside the sliding temporal
window. The motion map has a length of NL × Nc elements, where NL is the number
of MT layers of cells and Nc is the number of MT cells per layer. This motion map
characterizes and codes the action stimulus.
The Richness of Center-Surround Interactions 189
Simple Cells are characterized by linear receptive fields where the neuron re-
sponse is a weighted linear combination of the input stimulus inside its receptive
field. By combining two simple cells in a linear manner it is possible to get
direction-selective neurons.
The direction-selectivity (DS) refers to the property of a neuron to respond to
the direction of the motion of a stimulus. The way to model this selectivity is to
obtain receptive fields oriented in space and time (Fig. 1 (b.1)). Let us consider
two spatio-temporal oriented simple cells, Fθ,f a
and Fθ,f
b
, spatially oriented in
the direction θ, and spatio-temporal frequency oriented to f = (ξ, ¯ ω̄), where ξ̄
and ω̄ are the spatial and temporal maximal responses, respectively:
Fθ,f
a
(x, y, t) = Fθodd (x, y)Hf ast (t) − Fθeven (x, y)Hslow (t),
Fθ,f
b
(x, y, t) = Fθodd (x, y)Hslow (t) + Fθeven (x, y)Hf ast (t). (1)
The spatial parts Fθodd(x, y) and Fθeven (x, y) of each conforming simple cell are
formed using the first and second derivative of a Gabor function spatially ori-
ented in θ. The temporal contributions Hf ast (t) and Hslow (t) are defined by:
Hf ast (t) = T3,τ (t) − T5,τ (t),and Hslow (t) = T5,τ (t) − T7,τ (t), (2)
tη
t
where Tη,τ (t) is a Gamma function defined by Tη,τ (t) = τ η+1 η! exp − τ , which
models the series of synaptic and cellular delays in signal transmission, from
retinal photoreceptors to V1 afferents serving as a plausible approximation of
biological findings [28].
Remark that the causality of Hf ast (t) and Hslow (t) generates a more realistic
model than the one proposed by [22] (see also [14]), where the Gaussian proposed
as temporal profile is non-causal and inconsistent with V1 physiology.
The frequency analysis is required to a right design of our filter bank. For a
given speed, the filter covers a specified region of the spatio-temporal frequency
190 M.-J. Escobar and P. Kornprobst
domain. The quotient between the highest temporal frequency activation (ω̄)
¯ is the speed of the filter. So, the filter will
and the highest spatial frequency (ξ)
be able to detect the motion for a stimulus whose spatial frequency lies inside
the energy spectrum of the filter. To pave all the space in a homogeneous way, it
is necessary to take more than one filter for the same spatio-temporal frequency
orientation (Fig. 1 (b.2)).
Complex Cells are also direction-selective neurons, however they include other
characteristics that cannot be explained by a linear combination of the input
stimulus. The complex cell property that we want to keep in this model is the
invariance to contrast polarity.
Based on [26], we define the ith V1 complex cell, located at xi = (xi , yi ), with
spatial orientation θi and spatio-temporal orientation fi = (ξ¯i , ω̄i ) as
# $2 # $2
Cxi ,θi ,fi (t) = Fθai ,fi ∗ I (xi , t) + Fθbi ,fi ∗ I (xi , t) , (3)
where the symbol ∗ represents the spatio-temporal convolution between the sim-
ple cells defined in (1) and the input sequence I(x, t). With this definition, the
cell response is independent of stimulus contrast sign and constant in time for a
drifting grating as input stimulus.
Finally, it is well known in biology that the V1 output shows several nonlinear-
ities due to: response saturation, response rectification, or contrast gain control
[29]. In order to obtain a nonlinear saturation in the V1 response, the V1 output
is passed through a sigmoid function S(·), where the respective parameters were
tuned to have a suitable response in the case of drifting gratings as inputs. So,
finally the V1 output will be given by:
+ g L E L − uMT
i (t) , (5)
where E exc , E inh and E L = 0 are constant which typical values of 70mV, -10mV
and
# inh0mV, respectively.
$ According to (5), uMT
i (t) will belong to the interval
E ,E exc
and it will be driven by several influences. The first term refers
to input pre-synaptic neurons and it will push the membrane potential uMT i (t)
towards E exc , with a strength defined by Gexc i (t). Similarly, the second term
also coming from pre-synaptic neurons will drive uMT i (t) towards E inh with a
The Richness of Center-Surround Interactions 191
strength Ginh
i (t). Finally, the last term will drive ui
MT
(t) towards the resting
potential E with a constant strength given by g . The constant δ, typically
L L
neurons connected to it (Fig. 1). Each MT cell has a receptive field built from
the convergence of pre-synaptic afferent V1 complex cells (Fig. 1 (c.1)). The
excitatory inputs forming Gexci (t) are related with the activation of the classical
receptive field (CRF) of the MT cell; whereas Ginh i (t) afferents are the cells
forming the surround interactions that could modulate or not the response of
the CRF [16,17] (Fig. 1(c.2)). The surround does not elicit responses by itself,
it needs the CRF activation to be considered. According to this, the total input
conductances Gexci (t) and Gi (t) of the post-synaptic neuron i are defined by
inh
Gexc
i (t) = max 0, w r
ij j
V1
− w r
ij j
V1
, Ginh
i (t) = wij rjV 1 , (6)
j∈Ωi j∈Ωi j∈Φi
where Ωi = {j ∈ CRF | ϕij < π/2}, Ωi = {j ∈ CRF | ϕij > π/2} and
Φi = {j ∈ Surround | ϕij < π/2}, and where the connection weight wij is the
efficacy of the synapse from neuron j to neuron i, which is proportional to the
angle ϕij between the two preferred motion direction-selectivity of the V1 and
MT cell. It is important to remark that the values of the conductances will be
always greater or equal to zero, and their positive or negative contribution to
uMT
i (t) is due to the values of E exc and E inh .
The connection weights wij will be given by
where kc is an amplification factor, ϕij is the absolute angle between the pre-
ferred cell direction of the MT cell i and the preferred cell direction of the V1 cell
j. The weight wcs (·) is associated to the distance between the MT cell positioned
at xi = (xi , yi ) and the V1 cell positioned at xj = (xj , yj ), but also depends on
the CRF or surround associated to the MT cell.
which is usually ignored in most MT-like models. In most cases this modulation
is inhibitory, but Huang et al. [35] showed that this interaction, depending on
the input stimulus, can be also integrative. The direction tuning of the surround
compared with the center tends to be either the same or opposite, but rarely
orthogonal.
Half of MT neurons have asymmetric receptive fields introducing anisotropies
in the processing of the spatial information [16]. The neurons with asymmetric
receptive fields seem to be involved in the encoding of important surfaces fea-
tures, such as slant and tilt or curvature. Their geometry is the main responsible
of the direction tuning of the MT cell and it changes along time.
Considering this, we included four types of MT cells (Fig. 2): One basic type
of cell just only activated by its CRF, and three other types with inhibitory
surrounds. We claim that inhibitory surrounds contain key information about
the motion characterization (such as motion contrasts), as we will illustrate in
Section 3. The tuning direction of the surround is always the same as the CRFs,
but their spatial geometry changes, from symmetric to asymmetric-unilateral
and asymmetric-bilateral surround interactions. It is important to mention that
this approach is a coarse approximation of the real receptive field shapes.
Concerning our problem, we define below feature vectors as motion maps, which
represent averaged MT cells activity in a temporal window.
t
with γjI (t, t) = Δt
1
t−t uj
MT
(s)ds, and where Nl is the number of MT layers
and Nc is the number of MT cells per layer.
The motion map defined in (8) is invariant to the sequence length and its
starting point (for Δt high enough depending on the scene). It is also includes
information regarding the temporal evolution of the activation of MT cells, re-
specting the causality in the order of events. The use of a sliding window allows
us to include motion changes inside the sequence.
3.2 Experiments
Implementation Details. We considered luminosity and contrast normalized
videos of size 210×210 pixels, centered on the action to recognize. Given V1
cells modeled by (3), we consider 9 layers of V1 cells. Each layer is built with V1
cells tuned with the same spatio-temporal frequency and 8 different orientations.
The 9 layers of V1 cells are distributed in the frequency space in order to tile
the whole space of interest (maximal spatial frequency of 0.5 pixels/sec and a
maximal temporal frequency of 12 cycles/sec). The centers of the receptive fields
are distributed according to a radial log-polar scheme with a foveal uniform zone.
The limit between the two regions is given by the radius of the V1 fovea R0 (80
pixels). The cells with an eccentricity less than R0 have an homogeneous density
and receptive fields size. The cells with an eccentricity greater than R0 have a
density and a receptive field size depending on its eccentricity, giving a total of
4473 cells per layer.
194 M.-J. Escobar and P. Kornprobst
The MT cells are also distributed in a log-polar architecture, but in this case
R0 is 40 pixels giving a total of 144 cells per layer. Different layers of MT cells
conform our model. Four different surround interactions were used in the MT
construction (see Fig. 2). Each layer, with a certain surround interaction, has 8
different directions.
gL = 0 gL = 0.25
TS = 4
TS = 6
Fig. 3. Recognition error rate obtained for Weizmann database using the four different
cells described in Fig. 2. We took all the combinations possibles considering 4 or 6
subjects in the training set (TS). For both cases, we ran the experiments with g L = 0
and g L = 0.25, and three surround-interactions: just CRF (black bars), CRF plus
isotropic surround suppression (gray bars) and CRF plus isotropic and anisotropic
surround suppression (red bars).
Fig. 4. Histograms obtained from the recognition error rates of our approach using all
the cells defined in Fig. 2 for Weizmann database and the same experiment protocol
used in [14]. The gray bars are our histogram obtained for g L = 0.25. (a) Mean recog-
nition error rate obtained by [14] (GrC2 , dense C2 features): 8.9%/ ± 5.9. (b) Mean
recognition error rate obtained by [14] (GrC2 , sparse C2 features): 3.0%/ ± 3.0. (c)
Mean recognition error rate obtained with our approach: 1.1%/ ± 2.1.
(3) (4)
Fig. 5. Results obtained for the robustness experiments carried out for the three input
sequences represented by the snapshots shown for normal-walker (1), noisy sequence
(2), legs-occluded sequence (3) and moving-background sequence (4). In all the cases
the recognition was correctly performed as walk and the second closest distance was
to the class side. The red bars indicate the ratio between the distance to walk class
and the distance to side class (dwalk /dside ). The experiments were done for the three
configurations of surround-suppression: (a) just CRF, (b) CRF with isotropic surround
and (c) CRF with isotropic/anisotropic surround (g L = 0.25).
sequence) and it is compared using (9) to all motion maps stored in the training
set. The class of the sequence with the shortest distance is assigned as the match
class. The experiments were done considering every possible selection of 4 or
6 subjects, giving a total of 126 or 84 experiments. As output we obtained
histograms showing the frequency of the recognition error rates.
196 M.-J. Escobar and P. Kornprobst
4 Conclusion
Acknowledgements
References
1. Gavrila, D.: The visual analysis of human movement: A survey. Computer Vision
and Image Understanding 73(1), 82–98 (1999)
2. Goncalves, L., DiBernardo, E., Ursella, E., Perona, P.: Monocular tracking of the
human arm in 3D. In: Proceedings of the 5th International Conference on Computer
Vision, June 1995, pp. 764–770 (1995)
3. Mokhber, A., Achard, C., Milgram, M.: Recognition of human behavior by space-
time silhouette characterization. Pattern Recognition Letters 29(1), 81–89 (2008)
4. Seitz, S., Dyer, C.: View-invariant analysis of cyclic motion. The International
Journal of Computer Vision 25(3), 231–251 (1997)
5. Collins, R., Gross, R., Shi, J.: Silhouette-based human identification from body
shape and gait. In: 5th Intl. Conf. on Automatic Face and Gesture Recognition, p.
366 (2002)
198 M.-J. Escobar and P. Kornprobst
26. Adelson, E., Bergen, J.: Spatiotemporal energy models for the perception of motion.
Journal of the Optical Society of America A 2, 284–299 (1985)
27. Carandini, M., Demb, J.B., Mante, V., Tollhurst, D.J., Dan, Y., Olshausen, B.A.,
Gallant, J.L., Rust, N.C.: Do we know what the early visual system does? Journal
of Neuroscience 25(46), 10577–10597 (2005)
28. Robson, J.: Spatial and temporal contrast-sensitivity functions of the visual system.
J. Opt. Soc. Am. 69, 1141–1142 (1966)
29. Albrecht, D., Geisler, W., Crane, A.: Nonlinear properties of visual cortex neurons:
Temporal dynamics, stimulus selectivity, neural performance, pp. 747–764. MIT
Press, Cambridge (2003)
30. Destexhe, A., Rudolph, M., Paré, D.: The high-conductance state of neocortical
neurons in vivo. Nature Reviews Neuroscience 4, 739–751 (2003)
31. Priebe, N., Cassanello, C., Lisberger, S.: The neural representation of speed in
macaque area MT/V5. Journal of Neuroscience 23(13), 5650–5661 (2003)
32. Perrone, J., Thiele, A.: Speed skills: measuring the visual speed analyzing proper-
ties of primate mt neurons. Nature Neuroscience 4(5), 526–532 (2001)
33. Liu, J., Newsome, W.T.: Functional organization of speed tuned neurons in visual
area MT. Journal of Neurophysiology 89, 246–256 (2003)
34. Perrone, J.: A visual motion sensor based on the properties of V1 and MT neurons.
Vision Research 44, 1733–1755 (2004)
35. Huang, X., Albright, T.D., Stoner, G.R.: Adaptive surround modulation in cortical
area MT. Neuron. 53, 761–770 (2007)
36. Topsoe, F.: Some inequalities for information divergence and related measures of
discrimination. IEEE Transactions on information theory 46(4), 1602–1609 (2000)
37. Zelnik-Manor, L., Irani, M.: Statistical analysis of dynamic actions. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 28(9), 1530–1535 (2006)
38. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time
shapes. Proceedings of the 10th International Conference on Computer Vision 2,
1395–1402 (2005)
Linking Pose and Motion
1 Introduction
Temporal consistency is a key ingredient in many 3D pose estimation algorithms
that work on video sequences. However, the vast majority of methods we know of
neglect an important source of information: The direction in which most objects
travel is directly related to their attitude. This is just as true of the fighter plane
of Fig. 1(a) that tends to move in the direction in which its nose points as of the
pedestrian of Fig. 1(b) who is most likely to walk in the direction he is facing.
The relationship, though not absolute—the plane can slip and the pedestrian
can move sideways—provides nevertheless useful constraints.
There are very many Computer Vision papers on rigid, deformable, and ar-
ticulated motion tracking, as recent surveys can attest [1,2]. In most of these,
temporal consistency is enforced by regularizing the motion parameters, by re-
lating parameters in an individual frame to those estimated in earlier ones, or
by imposing a global motion model. However, we are not aware of any that ex-
plicitly take the kind of constraints we propose into account without implicitly
learning it from training data, as is done in [3].
In this paper, we use the examples of the plane and the pedestrian to show
that such constraints, while simple to enforce, effectively increase pose estimation
reliability and accuracy for both rigid and articulated motion. In both cases, we
use challenging and long video sequences that are shot by a single moving camera
This work has been funded in part by the Swiss National Science Foundation and
in part by the VISIONTRAIN RTN-CT-2004-005439 Marie Curie Action within the
EC’s Sixth Framework Programme. The text reflects only the authors’ views and the
Community is not liable for any use that may be made of the information contained
therein.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 200–213, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Linking Pose and Motion 201
(a) (b)
Fig. 1. Airplanes and people are examples of objects that exhibit a favored direction
of motion. (a) We project the 3D aircraft model using the recovered pose to produce
the white overlay. The original images are shown in the upper right corner. (b) We
overlay the 3D skeleton in the recovered pose, which is correct even when the person
is occluded.
that can zoom to keep the target object in the field of view, rendering the use
of simple techniques such as background subtraction impractical.
the image quality is high enough, simple dynamic models that penalize excessive
speed or acceleration or more sophisticated Kalman filtering techniques [7] are
sufficient to enforce temporal consistency. However, with lower quality data such
as the plane videos of Fig. 1(a), the simple quadratic regularization constraints [8]
that are used most often yield unrealistic results, as shown in Fig. 2.
Fig. 2. The first 50 frames of the first airplane sequence. The 3D airplane model is
magnified and plotted once every 5 frames in the orientation recovered by the algorithm:
(a) Frame by Frame tracking without regularization. (b) Imposing standard quadratic
regularization constraints. (c) Linking pose to motion produces a much more plausible
set of poses. Note for example the recovered depth of the brightest airplane: In (a) and
(b) it appears to be the frontmost one, which is incorrect. In (c) the relative depth is
correctly retrieved.
(a) (b)
Fig. 3. Recovered 2D trajectory of the subject of Fig. 1(b). The arrows represent the
direction he is facing. (a) When pose and motion are not linked, he appears to walk
sideways. (b) When they are, he walks naturally. The underlying grid is made of 1
meter squares.
Ṗt · Λt
||Ṗt || · ||Λt ||
should be close to 1. To enforce this, we can approximate the derivative of the
locations using finite differences between estimated locations P̂ at different time
instants. This approximation is appropriate when we can estimate the location
at a sufficiently high frequency (e.g. 25 Hz).
Fig. 4. The continuous curve represents the real trajectory of the object, while the
dashed lines show its approximation by finite differences
Our constraint then reduces to minimizing the angle between the finite dif-
ferences approximation of the derivative of the trajectory at time t, given by
P̂t+1 − P̂t , and the object’s estimated orientation given by Λ̂t . We write this
angle, which is depicted as filled both at time t − 1 and t in Fig. 4, as
204 A. Fossati and P. Fua
3 Rigid Motion
In the case of a rigid motion, we demonstrate our approach using video sequences
of a fighter plane performing aerobatic maneuvers such as the one depicted by
Fig. 5. In each frame of the sequences, we retrieve the pose which includes
position expressed by cartesian coordinates and orientation defined by the roll,
pitch and yaw angles. We show that these angles can be recovered from single
viewpoint sequences with a precision down to a few degrees, and that linking
pose and motion estimation contributes substantially to achieving this level of
accuracy. This is extremely encouraging considering the fact that the videos we
have been working with were acquired under rather unfavorable conditions: As
can be seen in Fig. 5, the weather was poor, the sky gray, and the clouds many, all
of which make the plane less visible and therefore harder to track. The airplane
is largely occluded by smoke and clouds in some frames, which obviously has an
adverse impact on accuracy but does not result in tracking failure.
The video sequences were acquired using a fully calibrated camera that could
rotate around two axes and zoom on the airplane. Using a couple of encoders, it
could keep track of the corresponding values of the pan and tilt angles, as well
as the focal length. We can therefore consider that the intrinsic and extrinsic
camera parameters are known in each frame. In the remainder of this section,
we present our approach first to computing poses in individual frames and then
imposing temporal consistency, as depicted by Fig. 4, to substantially improve
the accuracy and the realism of the results.
Fig. 5. Airplane video and reprojected model. First and third rows: Frames from
the input video. Note that the plane is partially hidden by clouds in some frames, which
makes the task more difficult. Second and fourth rows: The 3D model of the plane
is reprojected into the images using the recovered pose parameters. The corresponding
videos are submitted as supplemental material.
– The edge term is designed to favor poses such that projected model edges
correspond to actual image edges and plays an important role in ensuring
accuracy.
In each frame t, the objective function Lr is optimized using a particle-based
stochastic optimization algorithm [18] that returns the pose corresponding to
the best sample. The resulting estimated pose is a six-dimensional vector Ŝt =
(P̂t , Λ̂t ) = argminS Lr (S) where P̂t = (X̂t , Ŷt , Ẑt ) is the estimated position of
the plane in an absolute world coordinate system and Λ̂t = (ρ̂t , θ̂t , γ̂t ) is the
estimated orientation expressed in terms of roll, pitch and yaw angles. The esti-
mated pose Ŝt at time t is used to initialize the algorithm in the following frame
t + 1, thus assuming that the motion of the airplane between two consecutive
frames is relatively small, which is true in practice.
206 A. Fossati and P. Fua
Independently optimizing Lr in each frame yields poses that are only roughly
correct. As a result, the reconstructed motion is extremely jerky. To enforce
temporal consistency, we introduce a regularization term M defined over frames
t − 1, t, and t + 1 as
N
N −1
fr (S1 , . . . , SN ) = Lr (St ) + M (St ) (4)
t=1 t=2
with respect to the poses in individual images. In practice, for long video
sequences, this represents a very large optimization problem. Therefore, in our
current implementation, we perform this minimization in sliding temporal 3-frame
windows using a standard simplex algorithm that does not require the computa-
tion of derivatives. We start with the first set of 3 frames, retain the resulting pose
in the first frame, slide the window by one frame, and iterate the process using the
previously refined poses to initialize each optimization step.
The first sequence we use for the evaluation of our approach is shown in Fig. 5
and contains 1000 frames shot over 40 seconds, a time during which the plane
performs rolls, spins and loops and undergoes large accelerations.
In Fig. 6(a) we plot the locations obtained in each frame independently. In
Fig. 6(b) we imposed motion smoothness by using only the first two terms of (1).
In Fig 6(c) we link pose to motion by using all three terms of (1). The trajectories
are roughly similar in all cases. However, using the full set of constraints yields
a trajectory that is both smoother and more plausible.
In Fig. 2, we zoom in on a portion of these 3 trajectories and project the 3D
plane model in the orientation recovered every fifth frame. Note how much more
consistent the poses are when we use our full regularization term.
The plane was equipped with sophisticated gyroscopes which gave us mean-
ingful estimates of roll, pitch, and yaw angles, synchronized with the camera
Linking Pose and Motion 207
Fig. 6. Recovered 3D trajectory of the airplane for the 40s sequence of Fig. 5: (a) Frame
by Frame tracking. (b) Imposing motion smoothness. (c) Linking pose to motion. The
coordinates are expressed in meters.
and available every third frame. We therefore use them as ground truth. Table 1
summarizes the deviations between those angles and the ones our algorithm
produces for the whole sequence. Our approach yields an accuracy improvement
over frame by frame tracking as well as tracking with simple smoothness con-
straint. The latter improvement is in the order of 5 %, which is significant if one
considers that the telemetry data itself is somewhat noisy and that we are there-
fore getting down to the same level of precision. Most importantly, the resulting
sequence does not suffer from jitter, which plagues the other two approaches, as
can be clearly seen in the videos given as supplemental material.
Table 1. Comparing the recovered pose angles against gyroscopic data for the sequence
of Fig. 5. Mean and standard deviation of the absolute error in the 3 angles, in degrees.
In Fig. 7 we show the retrieved trajectory for a second sequence, which lasts
20 seconds. As before, in Table 2, we compare the angles we recover against
gyroscopic data. Again, linking pose to motion yields a substantial improvement.
4 Articulated Motion
To demonstrate the effectiveness of the constraint we propose in the case of
articulated motion, we start from the body tracking framework proposed in [19].
In this work, it was shown that human motion could be reconstructed in 3D
208 A. Fossati and P. Fua
Fig. 7. Recovered 3D trajectory of the airplane for a 20s second sequence: (a) Frame
by Frame tracking. (b) Imposing motion smoothness. (c) Linking pose to motion. The
coordinates are expressed in meters.
Table 2. Second sequence: Mean and standard deviation of the absolute error in the
3 angles, in degrees
We rely on a coarse body model in which individual limbs are modeled as cylin-
ders. Let St = (Pt , Θt ) be the state vector that defines its pose at time t, where
Θt is a set of joint angles and Pt a 3D vector that defines the position and orien-
tation of the root of the body in a 2D reference system attached to the ground
plane.
In the original approach [19], a specific color was associated to each limb by
averaging pixel intensities in the projected area of the limb in the frames where
a canonical pose was detected. Then St was recovered as follows: A rough initial
state was predicted by the motion model. Then the sum-of-squared-differences
Linking Pose and Motion 209
between the synthetic image, obtained by reprojecting the model, and the actual
one was minimized using a simple stochastic optimization algorithm.
Here, we replace the single color value associated to each limb by a histogram,
hereby increasing generality. As in Sect. 3.1, we define an objective function
La that measures the quality of the pose using the Bhattacharyya distance to
express the similarity between the histogram associated to a limb and that of
the image portion that corresponds to its reprojection. Optimizing La in each
frame independently leads, as could be expected, to a jittery reconstruction as
can be seen in the video given as supplemental material.
N
N
fa (S1 , . . . , SN ) = La (St ) + β(φ2t−1→t ) (5)
t=1 t=2
with respect to (α1 , . . . , αn , Pstart , Pend , η), where the second term is defined the
same way as in the airplane case and β is as before a constant weight that
relates incommensurate quantities. The only difference is that in this case both
the estimated orientation and the expected motion, that define the angle φ, are
2-dimensional vectors lying on the ground plane. This term is the one that links
210 A. Fossati and P. Fua
pose to motion. Note that we do not need quadratic regularization terms such as
the first two of (1) because our parameters control the entire trajectory, which
is guaranteed to be smooth.
Fig. 8. Pedestrian tracking and reprojected 3D model for the sequence of Fig. 1 First
and third rows: Frames from the input video. The recovered body pose has been
reprojected on the input image. Second and fourth rows: The 3D skeleton of the
person is seen from a different viewpoint, to highlight the 3D nature of the results. The
numbers in the bottom right corner are the instantaneous speeds derived from the re-
covered motion parameters. The corresponding videos are submitted as supplementary
material.
Linking Pose and Motion 211
The images clearly show that, without temporal consistency constraints, the sub-
ject appears to slide sideways while when the constraints are enforced the motion
is perfectly consistent with the pose. This can best be evaluated from the videos
given as supplemental material.
To validate our results, we manually marked the subject’s feet every 10 frames
in the sequence of Fig. 8 and used their position with respect to the tiles on the
ground plane to estimate their 3D coordinates. We then treated the vector joining
the feet as an estimate of the body orientation and the midpoint as an estimated
of its location. As can be seen in Table 3, linking pose to motion produces a small
improvement in the position estimate and a much more substantial one in the
orientation estimate, which is consistent with what can be observed in Fig. 3.
In the sequence of Fig. 9 the subject is walking along a curvilinear path and
the camera follows him, so that the viewpoint undergoes large variations. We
are nevertheless able to recover pose and motion in a consistent way, as shown
in Fig. 10 which represents the corresponding recovered trajectory.
212 A. Fossati and P. Fua
Table 3. Comparing the recovered pose angles against manually recovered ground
truth data for the sequence of Fig. 8. It provides the mean and standard deviation
of the absolute error in the X and Y coordinates, in centimeters, and the mean and
standard deviation of the recovered orientation, in degrees.
(a) (b)
Fig. 10. Recovered 2D trajectory of the subject of Fig. 9. As in Fig. 3, when orientation
and motion are not linked, he appears to walk sideway (a) but not when they are (b).
5 Conclusion
In this paper, we have used two very different applications to demonstrate that
jointly optimizing pose and direction of travel substantially improves the quality
of the 3D reconstructions that can be obtained from video sequences. We have
also shown that we can obtain accurate and realistic results using a single moving
camera.
This can be done very simply by imposing an explicit constraint that forces
the angular pose of the object or person being tracked to be consistent with their
direction of travel. This could be naturally extended to more complex interac-
tions between pose and motion. For example, when a person changes orientation,
the motion of his limbs is not independent of the turn radius. Similarly, the di-
rection of travel of a ball will be affected by its spin. Explicitly modeling these
subtle but important dependencies will therefore be a topic for future research.
References
1. Lepetit, V., Fua, P.: Monocular model-based 3d tracking of rigid objects: A survey.
Foundations and Trends in Computer Graphics and Vision (2005)
2. Moeslund, T.B., Hilton, A., Krüger, V.: A survey of advances in vision-based hu-
man motion capture and analysis. CVIU 104(2), 90–126 (2006)
Linking Pose and Motion 213
3. Sidenbladh, H., Black, M.J., Sigal, L.: Implicit Probabilistic Models of Human Mo-
tion for Synthesis and Tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen,
P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 784–800. Springer, Heidelberg (2002)
4. Bar-Shalom, Y., Kirubarajan, T., Li, X.R.: Estimation with Applications to Track-
ing and Navigation. John Wiley & Sons, Inc., Chichester (2002)
5. Zexiang, L., Canny, J.: Nonholonomic Motion Planning. Springer, Heidelberg
(1993)
6. Ren, L., Patrick, A., Efros, A.A., Hodgins, J.K., Rehg, J.M.: A data-driven ap-
proach to quantifying natural human motion. ACM Trans. Graph. 24(3) (2005)
7. Koller, D., Daniilidis, K., Nagel, H.H.: Model-Based Object Tracking in Monocular
Image Sequences of Road Traffic Scenes. IJCV 10(3), 257–281 (1993)
8. Poggio, T., Torre, V., Koch, C.: Computational Vision and Regularization Theory.
Nature 317 (1985)
9. Brubaker, M., Fleet, D., Hertzmann, A.: Physics-based person tracking using sim-
plified lower-body dynamics. In: CVPR (2007)
10. Urtasun, R., Fleet, D., Fua, P.: 3D People Tracking with Gaussian Process Dy-
namical Models. In: CVPR (2006)
11. Ormoneit, D., Sidenbladh, H., Black, M.J., Hastie, T.: Learning and tracking cyclic
human motion. In: NIPS (2001)
12. Agarwal, A., Triggs, B.: Tracking articulated motion with piecewise learned dy-
namical models. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023,
pp. 54–65. Springer, Heidelberg (2004)
13. Taycher, L., Shakhnarovich, G., Demirdjian, D., Darrell, T.: Conditional Random
People: Tracking Humans with CRFs and Grid Filters. In: CVPR (2006)
14. Rosenhahn, B., Brox, T., Seidel, H.: Scaled motion dynamics for markerless motion
capture. In: CVPR (2007)
15. Brox, T., Rosenhahn, B., Cremers, D., Seidel, H.: Nonparametric density estima-
tion with adaptive, anisotropic kernels for human motion tracking. In: Workshop
on HUMAN MOTION Understanding, Modeling, Capture and Animation (2007)
16. Howe, N.R., Leventon, M.E., Freeman, W.T.: Bayesian reconstructions of 3D hu-
man motion from single-camera video. In: NIPS (1999)
17. Djouadi, A., Snorrason, O., Garber, F.: The quality of training sample estimates
of the bhattacharyya coefficient. PAMI 12(1), 92–97 (1990)
18. Isard, M., Blake, A.: CONDENSATION - conditional density propagation for visual
tracking. IJCV 29(1), 5–28 (1998)
19. Fossati, A., Dimitrijevic, M., Lepetit, V., Fua, P.: Bridging the Gap between De-
tection and Tracking for 3D Monocular Video-Based Motion Capture. In: CVPR
(2007)
20. Urtasun, R., Fleet, D., Fua, P.: Temporal Motion Models for Monocular and Mul-
tiview 3–D Human Body Tracking. CVIU 104(2-3), 157–177 (2006)
Automated Delineation of Dendritic Networks in Noisy
Image Stacks
1 Introduction
Full reconstruction of neuron morphology is essential for the analysis and understand-
ing of their functioning. In its most basic form, the problem involves processing stacks
of images produced by a microscope, each one showing a slice of the same piece of
tissue at a different depth.
Currently available commercial products such as Neurolucida1, Imaris2 , or Meta-
morph 3 provide sophisticated interfaces to reconstruct dendritic trees and rely heavily
on manual operations for initialization and re-initialization of the delineation proce-
dures. As a result, tracing dendritic trees in noisy images remains a tedious process. It
can take an expert up to 10 hours for each one. This limits the amount of data that can
be processed and represents a significant bottleneck in neuroscience research on neuron
morphology.
Automated techniques have been proposed but are designed to work on very high
quality images in which the dendrites can be modeled as tubular structures [1,2]. In
Supported by the Swiss National Science Foundation under the National Centre of Compe-
tence in Research (NCCR) on Interactive Multimodal Information Management (IM2).
1
http://www.microbrightfield.com/prod-nl.htm
2
http://www.bitplane.com/go/products/imaris
3
http://www.moleculardevices.com/pages/software/metamorph.html
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 214–227, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Automated Delineation of Dendritic Networks in Noisy Image Stacks 215
Fig. 1. (a) Minimum intensity projection of an image stack. Each pixel value is the minimum in-
tensity value of the voxels that are touched by the ray cast from the camera through the pixel. (b)
3D tree reconstructed by our algorithm, which is best viewed in color. (c) Detail of the data vol-
ume showing the non-tubular aspect of a dendrite with the corresponding automatically generated
delineation.
practice, however, due to the underlying neuron structure, irregularities in the dyeing
process, and other sources of noise, the filaments often appear as an irregular series of
blobs surrounded by other non-neuron structures, as is the case of the brightfield image
stacks depicted by Fig. 1. Yet, such images are particularly useful for analyzing large
samples. More generally, very high resolution images take a long time to acquire and
require extremely expensive equipment, such as confocal microscopes. The ability to
automatically handle lower resolution and noisier ones is therefore required to make
these techniques more accessible. Ideally, the painstaking and data-specific tuning that
many existing methods require should also be eliminated.
In this paper, we therefore propose an approach to handling the difficulties that are
inherent to this imaging process. We do not assume an a priori dendrite model but
rely instead on supervised and unsupervised statistical learning techniques to construct
models as we go, which is more robust to unpredictable appearance changes. More
specifically, we first train a classifier that can distinguish dendrite voxels from others
using a very limited amount of expert-labeled ground truth. At run-time, it lets us detect
such voxels, some of which should be connected by edges to represent the dendritic
tree. To this end, we first find the minimum spanning tree connecting dendrite-like
voxels. We then use an Expectation-Maximization approach to learn an appearance
model for the edges that correspond to dendrites and those that do not. Finally, given
these appearance models, we re-build and prune the tree to obtain the final delineation,
such as the one depicted by Fig. 1(b), which is beyond what state-of-the-art techniques
can produce automatically.
To demonstrate the versatility of our approach, we also ran our algorithm on retinal
images, which we were able to do by simply training our classifier to recognize 2D
blood vessel pixels instead of 3D dendrite voxels.
216 G. González, F. Fleuret, and P. Fua
2 Related Work
Reconstructing networks of 3D filaments, be they blood vessels or dendrites, is an im-
portant topic in Biomedical Imaging and Computer Vision [3,4]. This typically involves
measuring how filament-like voxels are and an algorithm connecting those that appear
to be. We briefly review these two aspects below.
The second class requires optimizing the path between seed points, often provided
by the operator, to maximize the overall dendritness [8,11,17]. In these examples, the
authors use active contour models, geometrical constraints and the live-wire algorithm
between to connect the seeds.
By contrast to these methods that postulate an a priori cost function for connecting
voxels, our approach learns a model at run-time, which lets it deal with the potentially
changing appearance of the filaments depending on experimental conditions. Further-
more, we do this fully automatically, which is not the case for any of the methods
discussed above.
3 Methodology
Our goal is to devise an algorithm that is fully automatic and can adapt to noisy data in
which the appearance of the dendrites is not entirely predictable. Ideally we would like
to find the tree maximizing the probability of the image under a consistent generative
model. Because such an optimization is intractable, we propose an approximation that
involves the three following steps:
1. We use a hand-labeled training image stack to train once and for all a classifier that
computes a voxel’s probability to belong to a dendrite from its neighbors intensities.
2. We run this classifier on our stacks of test images, use a very permissive threshold to
select potential dendrite voxels, apply non-maximum suppression, and connect all
the surviving voxels with a minimum spanning tree. Some of its edges will corre-
spond to actual dendritic filaments and other will be spurious. We use both the correct
and spurious edges to learn filament appearance models in an EM framework.
3. Under a Markovian assumption, we combine these edge appearance models to
jointly model the image appearance and the true presence of filaments. We then
optimize the probability of the latter given the former and prune spurious branches.
As far as detecting dendrite voxels is concerned, our approach is related to the Hessian-
based approach of [13]. However, dropping the Hessian and training our classifier di-
rectly on the intensity data lets us relax the cylindrical assumption and allows us to
handle structures that are less visibly tubular. As shown in Fig. 2, this yields a marked
improvement over competing approaches.
In terms of linking, our approach can be compared to those that attempt to find opti-
mal paths between seeds [11,8] using a dendrite appearance model, but with two major
improvements: First our seed points are detected automatically instead of being manu-
ally supplied, which means that some of them may be spurious and that the connectivity
has to be inferred from the data. Second we do not assume an a priori filament model but
learn one from the data as we go. This is much more robust to unpredictable appearance
changes. Furthermore, unlike techniques that model filaments as tubular structures [1,2],
we do not have to postulate regularities that may not be present in our images.
3.1 Notations
Given the three step algorithm outlined above, we now introduce the notations we will
use to describe it in more details.
218 G. González, F. Fleuret, and P. Fua
Fig. 2. (a) Training data. On top: image stack representing one neuron. Bellow: Manually delin-
eated filaments overlaid in white. (b,c,d) Voxels labeled as potentially belonging to a dendrite. (b)
By thresholding the grayscale images. (c) By using the Hessian. (d) By using our classifier. Note
that the seed points obtained with our method describe better the underlying neuron structure.
As discussed in Section 2, the standard approach to deciding whether voxels are inside
a dendrite or not is to compute the Hessian of the intensities and look at its eigenvalues.
This however implicitly makes strong assumptions on the expected intensity patterns.
Instead of using such a hand-designed model, we train a classifier from a small quantity
of hand-labeled neuron data with AdaBoost [18], which yields superior classification
performance as shown in Fig. 2.
More specifically, the resulting classifier f is a linear combination of weak
learners hi :
N
f (x, y, z) = αi hi (x, y, z) , (1)
i=1
Automated Delineation of Dendritic Networks in Noisy Image Stacks 219
where the hi represent differences of the integrals of the image intensity over two cubes
in the vicinity of (x, y, z) and Ti is the weak classifier threshold. We write
⎛ ⎞
hi (x, y, z) = σ⎝ I(x , y , z )− I(x , y , z ) − Ti ⎠ (2)
Vi1 Vi2
where σ is the sign function, Vi1 , Vi2 are respectively the two volumes defining hi , trans-
lated according to (x, y, z). These weak classifiers can be calculated with just sixteen
memory accesses by using precomputed integral cubes, which are natural extensions of
integral images.
During training, we build at each iteration 103 hi weak learners by randomly picking
volume pairs and finding an optimal Ti threshold for each. After running Adaboost,
N = 1000 weak learners are retained in the f classifier of 1. The training samples are
taken from the manual reconstruction of Fig. 2. They consist of filaments at different
orientations and of a certain width. The final classifier responds to filaments of the pre-
defined width, independently of the orientation.
At run time, we apply f on the whole data volume and perform non-maximum sup-
pression by retaining only voxels that maximize it within a 8 × 8 × 20 neighborhood,
such as those shown in Fig. 2. The anisotropy on the neighborhood is due to the low
resolution of the images in the z axis, produced by the point spread function of the
microscope.
8I
8J 8K 8L
8M
í
(a) (b)
Fig. 3. (a) First two dimensions of the PCA space of the edge appearance models. The Gaussian
models are shown as contour lines. The two small figures at the top represent the projection of
the means in the original lattice. The top-left one represents the model μ1 for filaments, which
appear as a continuous structure. The top-right one represents the non-filament model μ0 . Since,
by construction the endpoints of the edges are local maxima, the intensity there is higher than
elsewhere. (b) Hidden Markov Model used to estimate the probability of a vertex to belong to the
dendritic tree.
likelihood to be part of a dendrite. Nevertheless, the tree obtained with this procedure
is over-complete, spanning vertices that are not part of the dendrites, Fig. 4(b). In order
to eliminate the spurious branches, we use the tree to evaluate the probability that in-
dividual vertices belong to a dendrite, removing those with low probability. We iterate
between the tree reconstruction and vertex elimination until convergence, Fig. 4(c).
We assume that the relationship between the hidden state of the vertices and the
edge appearance vectors can be represented in terms of a hidden Markov model such
Fig. 4. Building and pruning the tree. (a) Image stack (b) Initial maximum spanning tree. (c) After
convergence of the iterative process. (d) Manually delineated ground truth. Red solid lines denote
edges that are likely to be dendrites due to their appearance. Blue dashed lines represent edges
retained by the minimum spanning tree algorithm to guarantee connectivity.The main filaments
are correctly recovered. Note that our filament detector is sensitive to filaments thinner than the
ones in the ground truth data. This produces the structures in the right part of the images that are
not part of the ground truth data.
Automated Delineation of Dendritic Networks in Noisy Image Stacks 221
as the one depicted by Fig. 3(b). More precisely, we take N (G, i) to be the neighboring
vertices of i in G and assume that
P (Xi | X \i , (Ak,l )(k,l)∈G ) = P (Xi | (Xk )k∈N (G,i) , (Ai,k )k∈N (G,i) ) , (3)
P (Ai,j | X, (Ak,l )(k,l)∈G\(i,j) ) = P (Ai,j | Xi , Xj ) . (4)
Under these assumptions, we are looking for a tree consistent with the edge appearance
model of section 3.3. This means that the labels of its vector of maximum posterior
probabilities x are all 1s. To do so we alternate the building of a tree spanning the
vertices currently labeled 1 and the re-labeling of the vertices to maximize the posterior
probability. The tree we are looking for is a fixed point of this procedure.
Building the Tree. We are looking for maximum likelihood tree that spans all vertices.
Formally:
To this end, we use a slightly modified version of the minimum spanning tree algorithm.
Starting with an empty graph, we add to it at every iteration the edge (i, j) that does not
create a cycle and maximizes
Eliminating Unlikely Vertices. From the appearance models μ0 and μ1 learned in sec-
tion 3.3, and the Markovian assumption of Section 3.3, we can estimate for any graph G
the most probable subset of nodes truly on filaments. More specifically, we are looking
for the labeling x of maximum posterior probability given the appearance, defined as
follow
4 Results
In this section we first describe the images we are using. We then compare the dis-
criminative power of our dendrite model against simple grayscale thresholding and the
baseline Hessian based method [6]. Finally, we validate our automated tree reconstruc-
tion results by comparing them against a manual delineation.
Our image database consists of six neuron image stacks, in two of which the dendritic
tree has been manually delineated. We use one of those trees for training and the other
for validation purposes.
The neurons are taken from the somatosensory cortex of Wistar-han rats. The image
stacks are obtained with a standard brightfield microscope. Each image of the stack
shows a slice of the same piece of tissue at a different depth. The tissue is transparent
enough so that these pictures can be acquired by simply changing the focal plane.
Each image stack has an approximate size of 5 ∗ 109 voxels, and is downsampled
to a size of 108 voxels to make the evaluation of the image functional in every voxel
computationally tractable. After down-sampling, each voxel has the same width, height
and depth, of 0.8 μm.
The f classifier of 1 is trained using the manual delineation of Fig. 2. As positive sam-
ples, we retain 500 voxels belonging to filaments of width ranging from two to six
voxels and different orientations. As negative samples, we randomly pick 1000 voxels
that are no closer to a neuron than three times the neuron width and are representative of
the image noise. Since the training set contains filaments of many different orientations,
Adaboost produces a classifier that is orientation independent.
Fig. 2 depicts the candidate dendrite voxels obtained by performing non maxima sup-
pression of images calculated by simply thresholding the original images, computing
Automated Delineation of Dendritic Networks in Noisy Image Stacks 223
1
1 0.9 b
0.8 0.7
0.7 (a) 0.6
0.6
0.5
0.5
0.4
0.4 (b) 0.3
0.3
0.2
0.2
Boosting Threshold on the Posterior
Grayscale Threshold 0.1
Trees Iterated
0.1
Hessian (c) 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 True Positives True Negatives
False Positive Rate False Positives False Negatives False Positive Rate
Fig. 5. (a) ROC curve for all three measures using the validation data of figure 4(d). The boosting
classifier outperforms the baseline hessian method of [6] in noisy brightfield images. (b) Defining
a metric to compare our results against a manual delineation. Top: portion of a manual delineation
in which the vertices are close to each other and the tolerance width painted in red. Middle:
Portion of the tree found by our algorithm at the same location. Bottom: The fully-connected
graph we use to evaluate our edge appearance model and plot the corresponding ROC curves.
(c) ROC curve for the detection of edges on filament obtained by thresholding the individual
estimated likelihood of the edges of the graph of (b). The individual points represent the iterations
of the tree reconstruction algorithm. Two of them are depicted by Fig. 4(b,c). After five iterations
we reach a fixed point, which is our final result.
a Hessian-based measure [6], or computing the output of our classifier at each voxel.
The same procedure is applied in the validation data of Fig 4(d). Considering correct
the vertices that are within 5 μm (6 voxels) of the neuron, we can plot the three ROC
curves of Fig. 5(a) that show that our classifier outperforms the other two.
Fig. 6. Three additional reconstructions without annotations. Top row: Image stacks. Bottom row:
3D Dendritic tree built by our algorithm. As in Fig. 4, the edges drawn with a solid red lines are
those likely to belong to a dendrite given their appearance. The edges depicted with dashed blue
lines are kept to enforce the tree structure through all the vertices. This figure is best viewed in
color.
0.9
0.7
0.6
0.5 boosted
2nd observer
0.4
Staal
0.3 Niemeijer
Zana
0.2 Jiang
Martinez-Perez
0.1
Chaudhuri
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(a) (b)
Fig. 7. (a) Top: image of the retina. Bottom: response of our boosting classifier in this image. (b)
Comparison of our classifier against other algorithms evaluated in the DRIVE database [19]. It
performs similarly to most of them, but worse than algorithms designed specifically to trace blood
vessels in images of the retina. This can be attributed to the fact that our boosted classifier operates
at a single scale and is optimized to detect large vessels, whereas the others are multiscale.
Fig. 8. Retinal trees reconstructed with our method. Top row: original image with the recon-
structed tree overlay. As in Fig. 6, edges likely to belong to filaments are drawn in red, while
edges kept to enforce the tree structure are colored in blue. Bottom row: manually obtained
ground truth. Note that thick filaments are correctly delineated, whereas thin filaments are prone
to errors because our classifier is trained only for the thick ones.
5 Conclusion
the dendrite measure using discriminative machine learning techniques. We model the
edges as a gaussian mixture model, whose parameters are learned using E-M on neuron-
specific samples.
To demonstrate the generality of the approach, we showed that it also works for
blood vessels in retinal images, without any parameter tuning.
Our current implementation approximates the maximum likelihood dendritic tree
under the previous models by means of minimum spanning trees and markov random
fields. Those techniques are very easy to compute, but tend to produce artifacts. In
future work we will replace them by more general graph optimization techniques.
References
1. Al-Kofahi, K., Lasek, S., Szarowski, D., Pace, C., Nagy, G., Turner, J., Roysam, B.: Rapid
automated three-dimensional tracing of neurons from confocal image stacks. IEEE Transac-
tions on Information Technology in Biomedicine (2002)
2. Tyrrell, J., di Tomaso, E., Fuja, D., Tong, R., Kozak, K., Jain, R., Roysam, B.: Robust 3-
d modeling of vasculature imagery using superellipsoids. Medical Imaging 26(2), 223–237
(2007)
3. Kirbas, C., Quek, F.: Vessel extraction techniques and algorithms: A survey. In: Proceedings
of the Third IEEE Symposium on BioInformatics and BioEngineering, p. 238 (2003)
4. Krissian, K., Kikinis, R., Westin, C.F.: Algorithms for extracting vessel centerlines. Technical
Report 0003, Department of Radiology, Brigham and Women’s Hospital, Harvard Medical
School, Laboratory of Mathematics in Imaging (September 2004)
5. Sato, Y., Nakajima, S., Atsumi, H., Koller, T., Gerig, G., Yoshida, S., Kikinis, R.: 3d multi-
scale line filter for segmentation and visualization of curvilinear structures in medical images.
Medical Image Analysis 2, 143–168 (1998)
6. Frangi, A.F., Niessen, W.J., Vincken, K.L., Viergever, M.A.: Multiscale vessel enhance-
ment filtering. In: Wells, W.M., Colchester, A.C.F., Delp, S.L. (eds.) MICCAI 1998. LNCS,
vol. 1496, pp. 130–137. Springer, Heidelberg (1998)
7. Streekstra, G., van Pelt, J.: Analysis of tubular structures in three-dimensional confocal im-
ages. Network: Computation in Neural Systems 13(3), 381–395 (2002)
8. Meijering, E., Jacob, M., Sarria, J.C.F., Steiner, P., Hirling, H., Unser, M.: Design and valida-
tion of a tool for neurite tracing and analysis in fluorescence microscopy images. Cytometry
Part A 58A(2), 167–176 (2004)
9. Aguet, F., Jacob, M., Unser, M.: Three-dimensional feature detection using optimal steerable
filters. In: Proceedings of the 2005 IEEE International Conference on Image Processing (ICIP
2005), Genova, Italy, September 11-14, 2005, vol. II, pp. 1158–1161 (2005)
10. Dima, A., Scholz, M., Obermayer, K.: Automatic segmentation and skeletonization of neu-
rons from confocal microscopy images based on the 3-d wavelet transform. IEEE Transaction
on Image Processing 7, 790–801 (2002)
11. Schmitt, S., Evers, J.F., Duch, C., Scholz, M., Obermayer, K.: New methods for the
computer-assisted 3d reconstruction of neurons from confocal image stacks. NeuroImage 23,
1283–1298 (2004)
12. Agam, G., Wu, C.: Probabilistic modeling-based vessel enhancement in thoracic ct scans.
In: CVPR 2005: Proceedings of the 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 684–689. IEEE Computer Society,
Washington (2005)
Automated Delineation of Dendritic Networks in Noisy Image Stacks 227
13. Santamarı́a-Pang, A., Colbert, C.M., Saggau, P., Kakadiaris, I.A.: Automatic centerline ex-
traction of irregular tubular structures using probability volumes from multiphoton imaging.
In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp.
486–494. Springer, Heidelberg (2007)
14. Al-Kofahi, K.A., Can, A., Lasek, S., Szarowski, D.H., Dowell-Mesfin, N., Shain, W., Turner,
J.N., et al.: Median-based robust algorithms for tracing neurons from noisy confocal micro-
scope images (December 2003)
15. Flasque, N., Desvignes, M., Constans, J., Revenu, M.: Acquisition, segmentation and track-
ing of the cerebral vascular tree on 3d magnetic resonance angiography images. Medical
Image Analysis 5(3), 173–183 (2001)
16. McIntosh, C., Hamarneh, G.: Vessel crawlers: 3d physically-based deformable organisms
for vasculature segmentation and analysis. In: CVPR 2006: Proceedings of the 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1084–1091.
IEEE Computer Society Press, Washington (2006)
17. Szymczak, A., Stillman, A., Tannenbaum, A., Mischaikow, K.: Coronary vessel trees from
3d imagery: a topological approach. Medical Image Analisys (08 2006)
18. Freund, Y., Schapire, R.: Experiments with a New Boosting Algorithm. In: International
Conference on Machine Learning, pp. 148–156. Morgan Kaufmann, San Francisco (1996)
19. Staal, J., Abramoff, M., Niemeijer, M., Viergever, M., van Ginneken, B.: Ridge based vessel
segmentation in color images of the retina. IEEE Transactions on Medical Imaging 23, 501–
509 (2004)
Calibration from Statistical Properties
of the Visual World
Abstract. What does a blind entity need in order to determine the geometry of
the set of photocells that it carries through a changing lightfield? In this paper, we
show that very crude knowledge of some statistical properties of the environment
is sufficient for this task.
We show that some dissimilarity measures between pairs of signals produced
by photocells are strongly related to the angular separation between the photo-
cells. Based on real-world data, we model this relation quantitatively, using dis-
similarity measures based on the correlation and conditional entropy. We show
that this model allows to estimate the angular separation from the dissimilarity.
Although the resulting estimators are not very accurate, they maintain their per-
formance throughout different visual environments, suggesting that the model
encodes a very general property of our visual world.
Finally, leveraging this method to estimate angles from signal pairs, we show
how distance geometry techniques allow to recover the complete sensor geometry.
1 Introduction
This paper departs from traditional computer vision by not considering images or image
features as input. Instead, we take signals generated by photocells with unknown ori-
entation and a common center of projection, and explore the information these signals
can shed on the sensor and its surrounding world.
We are particularly interested in determining whether the signals allow to determine
the geometry of the sensor, that is, to calibrate a sensor like the one shown in Figure 1.
Psychological experiments [1] showed that a person wearing distorting glasses for a
few days, after a very confusing and disturbing period, could learn the necessary image
correction to restart interacting effectively with the environment. Can a computer do the
same when, rather than distorted images, it is given the signals produced by individual
photocells? In this situation, it is clear that traditional calibration techniques [2,3] are
out of the question.
Less traditional non-parametric methods that assume a smooth image mapping and
smooth motion [4] can obviously not be applied either. Using controlled-light stimuli
This work was partially supported by TYZX, Inc, by the Portuguese FCT POS_C program that
includes FEDER funds, and by the EU-project URUS FP6-EU-IST-045 062.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 228–241, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Calibration from Statistical Properties of the Visual World 229
Fig. 1. A discrete camera consists of a number of photocells (pixels) that measure the light trav-
eling along pencil of lines
Moreover, these statistics are about planar images, which is a hindrance in our case:
first, we do not want to exclude the case of visual sensor elements that are separated
by more than 180 degrees, such as the increasingly popular omnidirectional cameras.
Also, the local statistical properties of perspective images depend of the orientation of
the image plane with respect to the scene, except in special constrained cases such as the
fronto-parallel “leaf world” of Wu et al. [14]. Defining images on the unit sphere thus
appears as a natural way to render image statistics independent of the sensor orientation,
at least with proper assumptions on the surrounding world and/or the motion of the
sensor.
The present article elaborates and improves over our previous work [10]. We in-
novate by showing that the correlation, like the information distance, can be used to
provide geometric information about a sensor. Also, we use a simpler method to model
to relation between angles and signal statistics.
More important, we go much beyond [15] in showing that this model generalizes well
to diverse visual environments, and can thus be considered to be a reliable characteristic
of our visual world. In addition, we show that the presented calibration method performs
much better, for example by allowing to calibrate sensors that cover more than one
hemisphere.
The present work relies on statistical properties of the data streams produced by pairs
of sensor elements that depend only on the angular separation between the photocells.
For example, if the sampled lightfield is a homogeneous random field defined on the
sphere [16], then the covariance between observations depends only on the angular
separation between the sampled points.
This assumption does not hold in general in our anisotropic world, but it does hold,
e.g. if the orientation of the sensor is uniformly distributed amongst all unitary transfor-
mations of the sphere, that is, if the sensor is randomly oriented, so that each photocell
is just as likely to sample the light-field in any direction.
Perhaps more importantly we are only interested in statistics that have an expectancy
that is a strictly monotonous function of the angular separation of the pair of photocells.
That is, if x, y are two signals (random variables) generated by two photocells separated
by an angle θ, and d (x, y) is the considered statistic, then the expectancy of d (x, y) is
a strictly monotonous function of θ, for 0 ≤ θ ≤ π. The importance of this last point is
that this function can be inverted, resulting in a functional model that links the value of
the statistic to the angle.
The statistic-to-angle graph of such statistics is the a-priori knowledge about the
world that we leverage to estimate the geometry of discrete cameras. In the present
work, we use discrepancy measures based on the correlation or conditional entropy,
defined in Section 3. In Section 4, we show how to build the considered graph.
Having obtained angle estimates, we recover the sensor geometry, in Section 5.1, by
embedding the angles in a sphere. This is done using simple techniques from distance
geometry [17]. Experimental results are presented in Section 5.2. Finally, Section 6
presents some conclusions and possible directions for future research. The calibration
process considered in the present work is outlined in Figure 2. The statistic-to-angle
modeling produces the crucial functional relation used in the third-from right element
of Figure 2.
where W , H are the image width and height, K is the intrinsic parameters matrix,
% represents the integer modulo operation and . is the lower-rounding operation.
Cameras equipped with fisheye lenses, or having log- polar sensors, can also be modeled
again by setting Xi to represent the directions of the light-rays associated to the image
pixels. In the same vein, omnidirectional cameras having a single projection center,
as the ones represented by the unified projection model [18], also fit in the proposed
model. In this paper we use a calibrated omnidirectional camera to simulate various
discrete cameras.
232 E. Grossmann, J.A. Gaspar, and F. Orabona
Fig. 3. Left: The camera used to sample omnidirectional images (image mirrored). Right: A
calibrated omnidirectional image mapped to a sphere.
where C (x, y) is the correlation between the signals. It is easy to verify that dc (., .) is
a distance.
For the task considered in this paper, it is natural to prefer the correlation distance
over the variance or the (squared) Euclidean distance x − y2 , because both vary
with signal amplitude (and offset, for the latter), whereas dc (., .) is offset- and scale-
invariant.
Given two random variables x and y (in our case, the values produced by individual
pixels of a discrete camera) taking values in a discrete set {1, . . . , Q}, the information
distance between x and y is [9]:
where H (x, y) is the Shannon entropy of the paired random variable (x, y), and H (x)
and H (y) are the entropies of x and y, respectively. It is easy to show that Eq. (1) de-
fines a distance over random variables. This distance is bounded by H (x, y) ≤ log2 Q,
and is conveniently replaced thereafter by the normalized information distance :
This expression shows the slow convergence rate and strong bias of Ĥ (x). We some-
what alleviate these problems by first, correcting for the first bias term (Q − 1) /2T ,
i.e. applying the Miller-Madow correction; and by re-quantizing the signal to a much
smaller number of bins, Q = 4. Extensive benchmarking in [15] has shown these
choices to be beneficial.
As explained earlier, our a-priori knowledge of the world will be encoded in a graph
mapping a measure of discrepancy between two signals, to the angular separation be-
tween the photocells that generated the signals. We now show how to build this graph,
and assess its effectiveness at estimating angles.
For this purpose, we use the 31-pixel planar discrete camera (or “probe”) shown in
Fig. 4, left. This probe design allows to study the effect of angular separations rang-
ing from 0.5 to 180 degrees and each sample provides 465=31(31-1)/2 pixel pairs. In
234 E. Grossmann, J.A. Gaspar, and F. Orabona
Sampled Directions
0.8
0.6
0.4
0.2
-1 -0.5 0 0.5 1
Fig. 4. Left: Geometry of a discrete camera consisting of a planar array of thirty one (31) pixels,
spanning 180◦ in the plane. The first two pixels are separated by 0.5◦ , the separation between
consecutive photocells increases geometrically (ratio
1.14), so that the 31st photocell is an-
tipodal with respect to the first. Right: Two instances of the linear discrete camera, inserted in an
omnidirectional image. Pixels locations are indicated by small crosses connected by white lines.
the “tighter” part of the discrete camera layout, there exists a slight linear dependence
between the values of consecutive pixels due to aliasing.
The camera is hand-held and undergoes “random” general rotation and translation,
according to the author´s whim, while remaining near the middle of the room, at 1.0 to
1.8 meters from the ground. We acquired three sequences consecutively, in very similar
conditions and joined them in a single sequence totaling 1359 images, i.e. approxi-
mately 5 minutes of video at ˜4.5 frames per second.
To simulate the discrete camera, we randomly choose an orientation (i.e. half a great
circle) such that all pixels of the discrete camera fall in the field of view of the panoramic
camera. Figure 4 shows two such choices of orientations. For each choice of orientation,
we produce a sequence of 31 samples x (i, t), 1 ≤ i ≤ 31, 1 ≤ t ≤ 1359, where each
x (i, t) ∈ {0, . . . , 255}. Choosing 100 different orientations, we obtain 100 discrete
sensors and 100 arrays of data xn (i, t), 1 ≤ n ≤ 100. Appending these arrays we
obtain 31 signals x (i, t) of length to 135900.
We then compute, for each pair of pixels (indices) 1 ≤ i, j ≤ 31, the correlation
and information distances, dc (i, j) and dI (i, j). Joining to these the known angular
separations θi,j , we obtain a set of pairs (θi,j , d (i, j)), 1 ≤ i, j ≤ 31.
From this dataset, we build a constant by parts model of the expectancy of the dis-
tance, knowing the angle. For the correlation distance, we limit the abscissa to values
in [0, 1/2]. After verifying and, if needed enforcing, the monotonicity of this model,
we invert it, obtaining a graph of angles as a function of (correlation or information)
distances. Strict monotonicity has to be enforced for the correlation-based data, owing
to the relatively small number of data points used for each quantized angle.
Figure 5 shows the resulting graphs. This figure shows one of the major issues that
appear when estimating the angular separation between pixels from the correlation or
information distance: the graphs become very steep for large values of the distance,
indicating that small changes of the distance result in large changes in the estimated an-
gle. On the other hand, for small distance values, the curves are much flatter, suggesting
Calibration from Statistical Properties of the Visual World 235
100 100
Predicted Angle
Predicted Angle
80 80
60 60
40 40
20 20
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Correlation Distance Information Distance
Fig. 5. Models relating correlation (left) or information distance (right) to angular separation
between photocells. These models were build from simulated signals produced by the linear
probe of Fig. 4, left. Signals of length T = 135900, acquired indoors were used.
that small angles can be determined with greater accuracy. Both trends are particularly
true for the information distance.
90 90
60 60
30 30
20 20
10 10
5 5
1 1
0 0
0 1 5 10 20 30 60 90 120 150 180 0 1 5 10 20 30 60 90 120 150 180
True Angular Separation True Angular Separation
Accuracy of Angles From Corr. Distance Accuracy of Angles From Info. Distance
90 90
80 80
Mean Absolute Error (deg)
Fig. 6. Precision and accuracy of angles estimated from correlation (left) or information distance
(right). The boxplots at the top show the 5th percentile, first quartile, median, third quartile and
95th percentile of the estimated angles, plotted against the true angles. The bottom curves show
the mean absolute error in the estimated angles. These statistics were generated from 100 planar
probes (Fig. 4, left) and signals of length T = 1359. The angles were estimated using the models
of Fig. 5. The signals were acquired in the same conditions as those used to build the models.
Fig. 7. Four images from a sequence of 2349 images acquired indoors and outdoors at approxi-
mately 4.5FPS
Having seen the qualities and shortcomings of the proposed angle estimators, we now
show how to use them to calibrate a discrete camera.
To stress the generalization ability of the angle estimators, all the reconstructions
produced by the above method are obtained from the in- and outdoors sequence of
Fig. 7, rather than from the indoors sequence used to build the distance-to-angle models.
Calibration from Statistical Properties of the Visual World 237
90 90
60 60
30 30
20 20
10 10
5 5
1 1
0 0
0 1 5 10 20 30 60 90 120 150 180 0 1 5 10 20 30 60 90 120 150 180
True Angular Separation True Angular Separation
Accuracy of Angles From Corr. Distance Accuracy of Angles From Info. Distance
90 90
80 80
Mean Absolute Error (deg)
Fig. 8. Precision and accuracy of angles estimated in the same conditions as in Fig. 6, except that
signals extracted from an indoors-and-outdoors sequence (Fig. 7) were used. These figures show
that the models in Fig. 5 generalize fairly well to signals produced in conditions different from
that in which the models were produced. In particular, the angles estimated from the correlation
distance are improved w.r.t. those of Fig. 6 (see text).
This problem can be reduced to the classical problem of distance geometry [17]:
90 90
60 60
30 30
20 20
10 10
5 5
1 1
0 0
0 1 5 10 20 30 60 90 120 150 180 0 1 5 10 20 30 60 90 120 150 180
True Angular Separation True Angular Separation
Accuracy of Angles From Corr. Distance Accuracy of Angles From Info. Distance
90 90
80 80
Mean Absolute Error (deg)
Fig. 9. Precision and accuracy of angles estimated in the same conditions as in Fig. 8, except that
the planar probes are constrained to remain approximately horizontal. These figures show that
the models in Fig. 5 are usable even if the isotropy assumption of the moving entity is not valid.
the unit (r − 1) −dimensional sphere that verify Xi
Xj = Cij for all i, j. This result
directly suggests the following method for embedding points in the 2-sphere:
One should note that this very simple algorithm is not optimal in many ways. In par-
ticular, it does not take into account that the error in the angles θij is greater in some
cases than in others. It is easy to verify that the the problem is not directly tractable by
variable-error factorization methods used in computer vision.
Noting that the error in the estimated angles is approximately proportional to the
actual angle suggests an embedding method that weighs less heavily large angular esti-
mates. One such method is Sammon´s algorithm [20], which we adapt and modify for
the purpose of spherical embedding from our noisy data. In this paper, we minimize the
sum
, -
2 max 0, 1−C 1
− 1
if Cij = 1
wi,j Xi Xj − Cij , where wij = 1
ij 1−Co
i,j η otherwise.
Calibration from Statistical Properties of the Visual World 239
Fig. 10. Calibrations of two different sensors covering more than one hemisphere. On the left,
a band-like sensor consisting of 85 photocells, calibrated from correlations (estimated: smaller,
true: bigger). On the right, a discrete camera covering more than 180×360◦ , of 168 photocells,
calibrated from the information distance (estimated: smaller, true: bigger). Each ball represents a
photocell except the big black balls, representing the optical center.
To reflect the fact that big angles are less well estimated, we set C0 = 0.9, so that
estimates greater than acos (0.9) 25◦ be ignored. The other parameter, η is set to 1,
allowing the points Xi to stray a little bit away from the unit sphere. Our
implementation is inspired by the second-order iterative method of Cawley and
Talbot (http://theoval.sys.uea.ac.uk/~gcc/matlab/default.html). For
initialization, we use an adaptation of [21] to the spherical metric embedding problem,
which will be described in detail elsewhere.
We now evaluate the results of this embedding algorithm on data produced by the angle-
estimating method of Sec. 4. For this purpose, we produce sequences of pixel signals
in the same conditions as previously, using the outdoors and indoors sequence shown
in Figure 7, except that the sensor shape is different. The information and correlation
distances between pixels is then estimated from these signals, the angular separation
between the pixels is estimated using Sec. 4, and the embedding method of Sec. 5.1 is
applied to these angle estimates.
Figure 10 shows the results of our calibration method on sensors covering more than
a hemisphere, which thus cannot be embedded in a plane without significant distortion.
It should be noted that, although the true sensor is each time more than hemispheric,
the estimated calibration is in both cases smaller. This shrinkage is a known effect of
some embedding algorithms, which we could attempt to correct.
Figure 11 shows how our method applies to signals produced by a different sensor
from the one used to build the distance-to-angle models, namely an Olympus Stylus 300
camera. An 8-by-8 square grid pixels spanning 34 degrees was sampled along a 22822
image sequence taken indoors and outdoors. From this sequence, the estimated angles
were generally greater than the true angles, which explains the absence of shrinkage.
The higher angle estimates were possibly due to higher texture contents of the sequence.
The estimated angles were also fairly noisy, possibly due to the sequence length, and
we surmise that longer sequences would yield better results.
240 E. Grossmann, J.A. Gaspar, and F. Orabona
Reconstruction using correlations (New sensor). Reconstruction using information distances (New sensor).
Signal length: 22822 Signal length: 22822
0.3 0.3
Estimated Estimated
True True
0.2 0.2
0.1 0.1
0 0
-0.1 -0.1
-0.2 -0.2
-0.3 -0.3
-0.2 -0.1 0 0.1 0.2 -0.2 -0.1 0 0.1 0.2
Fig. 11. Reconstructed and true pixel layouts of a discrete camera consisting of photocells lying
on a rectangular grid. The sensor used differs from that with which the models of Fig 5 were
built. The reconstructions are obtained by first estimating the pairwise angular distances, then
embedding the angles in the sphere (see text). For visualization, the reconstructions are aligned
by the usual procrustes method, mapped to the plane by projective mapping with unit focal length.
Added line segments show the true pixel neighborhood relations. The left plot is obtained from
the correlation distance, and the right from the information distance.
These results represent typical results that researchers reproducing our method may
encounter. Results from other experiments will be presented elsewhere.
6 Discussion
In this paper, we have shown that simple models exist that relate signal discrepancy
to angular separation, and are valid in indoors and outdoors scenes. This suggests the
existence of near-universal properties of our visual world, in line with other work show-
ing statistical properties of natural images. Contrarily to previous works, we consider
statistics of the lightfield taken as a function defined on the sphere, rather than the plane,
a choice that allows us to consider fields of view greater than 180 degrees.
We addressed the problem of determining the geometry of a set of photocells in a
very general setting. We have confirmed that a discrete camera can be calibrated to a
large extent, using just two pieces of data: a table relating signal distances to angles;
and a long enough signal produced by the camera.
The presented results are both superior and of a much wider scope than that of [15]:
we have shown that it is necessary neither to strictly enforce the assumptions that the
camera directs each pixel uniformly in all directions, nor that statistically similar en-
vironments be used to build the statistic-to-angle table and to calibrate the discrete
camera. This flexibility reinforces the impression that models such as those shown in
Figure 5 have a more general validity than the context of calibration.
We showed also that angle estimators based on correlation and information distance
(entropy) have different performance characteristics. It would be very interesting to
apply machine learning techniques to leverage the power of many such weak estimators.
Finally a more curious question is worth asking in the future: can the problem of
angle estimation be altogether bypassed in a geometrically meaningful calibration pro-
cedure? Embedding methods based on rank or connectivity [17,22], e.g. correlation or
information distance, suggest that this is possible.
Calibration from Statistical Properties of the Visual World 241
References
1. Kohler, I.: Experiments with goggles. Scientific American 206, 62–72 (1962)
2. Tsai, R.: An efficient and accurate camera calibration technique for 3D machine vision. In:
IEEE Conf. on Computer Vision and Pattern Recognition (1986)
3. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge Uni-
versity Press, Cambridge (2000)
4. Nistér, D., Stewenius, H., Grossmann, E.: Non-parametric self-calibration. In: Proc. ICCV
(2005)
5. Ramalingam, S., Sturm, P., Lodha, S.: Towards complete generic camera calibration. In:
Proc. CVPR, vol. 1, pp. 1093–1098 (2005)
6. Pierce, D., Kuipers, B.: Map learning with uninterpreted sensors and effectors. Artificial
Intelligence Journal 92(169–229) (1997)
7. Krzanowski, W.J.: Principles of Multivariate Analysis: A User’s Perspective. Statistical Sci-
ence Series. Clarendon Press (1988)
8. Olsson, L., Nehaniv, C.L., Polani, D.: Sensory channel grouping and structure from uninter-
preted sensor data. In: NASA/NoD Conference on Evolvable Hardware (2004)
9. Crutchfield, J.P.: Information and its metric. In: Lam, L., Morris, H.C. (eds.) Nonlinear Struc-
tures in Physical Systems–Pattern Formation, Chaos and Waves, pp. 119–130. Springer, Hei-
delberg (1990)
10. Grossmann, E., Orabona, F., Gaspar, J.A.: Discrete camera calibration from the informa-
tion distance between pixel streams. In: Proc. Workshop on Omnidirectional Vision, Camera
Networks and Non-classical Cameras, OMNIVIS (2007)
11. Torralba, A., Oliva, A.: Statistics of natural image categories. Network: Computation in Neu-
ral Systems 14, 391–412 (2003)
12. Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low-level vision. International
Journal of Computer Vision 40(1), 25–47 (2000)
13. Potetz, B., Lee, T.S.: Scaling laws in natural scenes and the inference of 3d shape. In: NIPS –
Advances in Neural Information Processing Systems, pp. 1089–1096. MIT Press, Cambridge
(2006)
14. Wu, Y.N., Zhu, S.C., Guo, C.E.: From information scaling of natural images to regimes of
statistical models. Technical Report 2004010111, Department of Statistics, UCLA (2004)
15. Grossmann, E., Gaspar, J.A., Orabona, F.: Discrete camera calibration from pixel streams.
In: Computer Vision and Image Understanding (submitted, 2008)
16. Roy, R.: Spectral analysis for a random process on the sphere. Annals of the institute of
statistical mathematics 28(1) (1976)
17. Dattorro, J.: Convex Optimization & Euclidean Distance Geometry. Meboo Publishing
(2005)
18. Geyer, C., Daniilidis, K.: A unifying theory for central panoramic systems and practical
applications. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 445–461. Springer,
Heidelberg (2000)
19. Schoenberg, I.J.: Remarks to Maurice Fréchet’s article “Sur la définition axiomatique d’une
classe d’espaces distanciés vectoriellement applicable sur l’espace de Hilbert”. Annals of
Mathematics 36(3), 724–732 (1935)
20. Sammon, J.W.J.: A nonlinear mapping for data structure analysis. IEEE Transactions on
Computers C-18, 401–409 (1969)
21. Lee, R.C.T., Slagle, J.R., Blum, H.: A triangulation method for the sequential mapping of
points from n-space to two-space. IEEE Trans. Computers 26(3), 288–292 (1977)
22. Shang, Y., Ruml, W., Zhang, Y., Fromherz, M.P.J.: Localization from mere connectivity. In:
MobiHoc 2003: Proc. ACM Intl. Symp. on Mobile Ad Hoc Networking & Computing, pp.
201–212. ACM Press, New York (2003)
Regular Texture Analysis as Statistical Model
Selection
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 242–255, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Regular Texture Analysis as Statistical Model Selection 243
select a list of dominant peaks. The translation vectors were estimated based on
these dominant peaks. However, the important problem of how to determine the
number of dominant peaks was not addressed. Whilst it is usually relatively easy
for a human to select an appropriate subset of peaks, automating this process is
difficult. Fig. 1 shows three different texels obtained similarly to Lin et al. [12]
from the same image by using different numbers of peaks. The peaks were ob-
tained using the region of dominance method [1]. Whilst using only the first ten
peaks can result in success, the method is rather sensitive to this choice.
Fig. 1. Texels obtained using (a) ten, (b) forty, and (c) seventy dominant peaks in the
autocorrelation function. The peak locations are marked with white dots.
Fig. 2. Two examples of a local feature-based method [5] extracting incorrect lattices
Regular Texture Analysis as Statistical Model Selection 245
for which a fixed value that works on a wide range of images can often not be
found. Methods based on finding peaks in an AC function often yield many un-
reliable peaks and the number which are reliable can vary dramatically between
images. This serious drawback currently makes these methods difficult to apply
to large image collections.
1.2 Contributions
most probable texel hypothesis given the image. According to Bayes’ theorem,
the posterior probability is proportional to the likelihood of the hypothesis times
a prior:
p(I|Hk )p(Hk )
p(Hk |I) = ∝ p(I|Hk )p(Hk ) (1)
p(I)
In the absence of prior knowledge favouring any of the texel hypotheses, the
(improper) prior is taken to be uniform. For each Hk , we define a unique Mk
deterministically so p(Mk |Hk ) is a delta function. Hence,
p(Hk |I) ∝ p(I|Mk ) = p(I|θk , Mk )p(θk |Mk )dθk (2)
The first term can be interpreted as an error of fit to the data while the second
term penalises model complexity.
The proposed approach to regular texture analysis involves (i) generation of
multiple texel hypotheses, and (ii) comparison of hypotheses based on statistical
models. The hypothesis with the model that has the largest marginal likelihood
is selected. Using the BIC approximation, hypothesis Hk̂ is selected where,
where MR refers to the model corresponding to the reference lattice and Mk̂ is
the best lattice hypothesis selected by Equation (5).
3 Lattice Models
The lattice model should be able to account for both regularity from periodic
arrangement and statistical photometric and geometric variability. Let us first
suppose a regular texture image I with N pixels x1 , x2 , . . . , xN , and a hypothesis
H with Q pixels per texel. Based on H, each pixel of the image is assigned to
one of Q positions on the texel according to the lattice structure. Thus, the N
pixels are partitioned into Q disjoint sets, or clusters. If we choose to assume
that the N pixels are independent given the model, we have,
N
Q
p(I|M ) = p(xn |M ) = p(xn |M ) (7)
n=1 q=1 n:f (n,H)=q
where f (n, H) ∈ {1, . . . , Q} maps n to its corresponding index in the texel. Fig. 3
illustrates this assigment of pixels to clusters.
Q
BIC(M ) = (Q/2) log N − log p(xn |μˆq , σ 2 ) (8)
q=1 n:f (n,H)=q
1
Q
= (Q/2) log N + C1 + (xn − μˆq )2 (9)
2σ 2 q=1
n:f (n,H)=q
Q
BIC(M ) = − log p(xn |μˆq , σ12 , σ22 , π1 ) + (Q/2) log N (10)
q=1 n:f (n,H)=q
where C2 is a constant.
5 Experiments
A dataset of 103 regular texture images was used for evaluation, comprising 68
images of printed textiles from a commercial archive and 35 images taken from
three public domain databases (the Wikipedia Wallpaper Groups page, a Corel
database, and the CMU near regular texture database). These images ranged in
size from 352 × 302 pixels to 2648 × 1372 pixels. The number of texel repeats
per image ranged from 5 to a few hundreds. This data set includes images that
are challenging because of (i) appearance variations among texels, (ii) small
geometric deformations, (iii) texels that are not distinctive from the background
and are large non-homogeneous regions, (iv) occluding labels, and (v) stains,
wear and tear in some of the textile images.
Systematic evaluations of lattice extraction are lacking in the literature. We
compared the proposed method with two previously published algorithms. Two
volunteers (one male and one female) qualitatively scored and rank ordered
the algorithms. In cases of disagreement, they were forced to reach agreement
through discussion. (Disagreement happened in very few cases).
When the proposed method used Gaussians to model clusters, the only free
parameter was the variance, σ 2 . A suitable value for σ 2 was estimated from a
set of 20 images as follows. Many texel hypotheses were automatically generated
using different numbers of AC peaks and a user then selected from them the
best translation vectors, t1 , t2 . Pixels were allocated to clusters according to
250 J. Han, S.J. McKenna, and R. Wang
shared the same rank if they yielded equally good results. For example, if three
of the algorithms gave good lattices of equal quality and the fourth algorithm
gave a poor lattice then three algorithms shared rank 1 and the other algorithm
was assigned rank 4. Table 2 summarizes the rankings. For the Gaussian model,
we set σ 2 = 264 which yields the worst accuracy of the variance values tried.
For the algorithm of Liu et al. [1], we set the number of dominant peaks to
40, which achieved the best performance of the values tried. Even with these
parameter settings which disadvantage the proposed method, Table 2 shows
that it is superior to the other algorithms.
The method was also used to classify texture images as regular or irregular as
described in Equation (6). A set of 62 images was selected randomly from a mu-
seum fine art database and from the same commercial textile archive as used ear-
lier. Figure 5 shows some examples of these images. A classification experiment
0.4
0.35
0.3
False negative rate
0.25
0.2
0.15
0.1
0.05
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
False positive rate
Fig. 6. Classification of texture as regular or irregular. The curve was plotted by vary-
ing the value of σ 2 and characterises the trade-off between the two types of error.
Regular Texture Analysis as Statistical Model Selection 253
was performed using these images as negative examples and the 103 regular tex-
ture images as positive examples. Figure 6 shows the ROC curve obtained by vary-
ing the value of σ 2 in the Gaussian model (σ 2 ∈ {49, 64, 81, 100, 144}). The equal
error rate was approximately 0.22.
The computational speed depends on the number of lattice hypotheses (and
many different subsets of peaks lead to the same lattice hypothesis). A Matlab
implementation typically takes a few minutes per image on a 2.4GHz, 3.5GB
PC which is adequate for off-line processing.
Fig. 7. Comparisons of two thumbnail generation methods. In each set, the first image
is the original image, the second image is the thumbnail generated by our method, and
the third image is the thumbnail generated by the standard method.
254 J. Han, S.J. McKenna, and R. Wang
the resolution. Thumbnails extracted using knowledge of the texels can convey
more detailed information about the pattern design.
7 Conclusions
A fully automatic lattice extraction method for regular texture images has been
proposed using a framework of statistical model selection. Texel hypotheses were
generated based on finding peaks in the AC function of the image. BIC was
adopted to compare various hypotheses and to select a ‘best’ lattice. The exper-
iments and comparisons with previous work have demonstrated the promise of
the approach. Various extensions to this work would be interesting to investigate
in future work. Alternative methods for generating hypotheses could be explored
in the context of this approach. Further work is needed to explore the relative
merits of non-Gaussian models. This should enable better performance on im-
ages of damaged textiles, for example. BIC can give poor approximations to the
marginal likelihood and it would be worth exploring alternative approximations
based on sampling methods, for example. Finally, it should be possible in princi-
ple to extend the approach to analysis of near-regular textures on deformed 3D
surfaces by allowing relative deformation between texels. This could be formu-
lated as a Markov random field over texels, for example. Indeed, Markov random
field models have recently been applied to regular texture tracking [6].
Acknowledgments. The authors thank J. Hays for providing his source code,
and Chengjin Du and Wei Jia for helping to evaluate the algorithm. This re-
search was supported by the UK Technology Strategy Board grant “FABRIC:
Fashion and Apparel Browsing for Inspirational Content” in collaboration with
Liberty Fabrics Ltd., System Simulation Ltd. and Calico Jack Ltd. The Technol-
ogy Strategy Board is a business-led executive non-departmental public body,
established by the government. Its mission is to promote and support research
into, and development and exploitation of, technology and innovation for the
benefit of UK business, in order to increase economic growth and improve the
quality of life. It is sponsored by the Department for Innovation, Universities
and Skills (DIUS). Please visit www.innovateuk.org for further information.
References
1. Liu, Y., Collins, R.T., Tsin, Y.: A computational model for periodic pattern percep-
tion based on frieze and wallpaper groups. IEEE Transactions on Pattern Analysis
and Machine Intelligence 26, 354–371 (2004)
2. Leung, T., Malik, J.: Recognizing surfaces using three-dimensional textons. In:
IEEE International Conference on Computer Vision, Corfu, Greece, pp. 1010–1017
(1999)
3. Liu, Y., Tsing, Y., Lin, W.: The promise and perils of near-regular texture. Inter-
national Journal of Computer Vision 62, 145–159 (2005)
4. Malik, J., Belongie, S., Shi, J., Leung, T.: Textons, contours and regions: cue
integration in image segmentation. In: IEEE International Conference of Computer
Vision, Corfu, Greece, pp. 918–925 (1999)
Regular Texture Analysis as Statistical Model Selection 255
5. Hays, J., Leordeanu, M., Efros, A., Liu, Y.: Discovering texture regularity as a
higher-order correspondance problem. In: European Conference on Computer Vi-
sion, Graz, Austria, pp. 533–535 (2006)
6. Lin, W., Liu, Y.: A lattice-based MRF model for dynamic near-regular texture
tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 29,
777–792 (2007)
7. Leung, T., Malik, J.: Detecting, localizing and grouping repeated scene elements
from an image. In: European Conference on Computer Vision, Cambridge, UK,
pp. 546–555 (1996)
8. Tuytelaars, T., Turina, A., Gool, L.: Noncombinational detection of regular repeti-
tions under perspective skew. IEEE Transactions on Pattern Analysis and Machine
Intelligence 25, 418–432 (2003)
9. Schaffalitzky, F., Zisserman, A.: Geometric grouping of repeated elements within
images. In: Shape, Contour and Grouping in Computer Vision. Lecture Notes In
Computer Science, pp. 165–181. Springer, Heidelberg (1999)
10. Forsyth, D.A.: Shape from texture without boundries. In: European Conference in
Computer Vision, Copenhagen, Denmark, pp. 225–239 (2002)
11. Lobay, A., Forsyth, D.A.: Recovering shape and irradiance maps from rich dense
texton fields. In: Computer Vision and Pattern Recognition, Washington, USA,
pp. 400–406 (2004)
12. Lin, H., Wang, L., Yang, S.: Extracting periodicity of a regular texture based on
autocorrelation functions. Pattern Recognition Letters 18, 433–443 (1997)
13. Chetverikov, D.: Pattern regularity as a visual key. Image and Vision Comput-
ing 18, 975–985 (2000)
14. Leu, J.: On indexing the periodicity of image textures. Image and Vision Comput-
ing 19, 987–1000 (2001)
15. Charalampidis, D.: Texture synthesis: Textons revisited. IEEE Transactions on
Image Processing 15, 777–787 (2006)
16. Starovoitov, V., Jeong, S.Y., Park, R.: Texture periodicity detection: features, prop-
erties, and comparisons. IEEE Transactions on Systems, Man, and Cybernetics-
A 28, 839–849 (1998)
17. Schwarz, G.: Estimating the dimensions of a model. Annals and Statistics 6, 461–
464 (1978)
18. Raftery, A.E.: Bayesian model selection in social research. Sociological Methodol-
ogy 25, 111–163 (1995)
19. Suh, B., Ling, H., Benderson, B.B., Jacobs, D.W.: Automatic thumbnail cropping
and its effectiveness. In: ACM Symposium on User Interface Software and Tech-
nology, pp. 95–104 (2003)
Higher Dimensional Affine Registration
and Vision Applications
Yu-Tseh Chi1 , S.M. Nejhum Shahed1 , Jeffrey Ho1 , and Ming-Hsuan Yang2
1
CISE Department, University of Florida, Gainesville, 32607
{ychi,smshahed,jho}@csie.ufl.edu
2
EECS, University of California, Merced, CA 95344
mhyang@ucmerced.edu
Abstract. Affine registration has a long and venerable history in computer vi-
sion literature, and extensive work have been done for affine registrations in IR2
and IR3 . In this paper, we study affine registrations in IRm for m > 3, and to
justify breaking this dimension barrier, we show two interesting types of match-
ing problems that can be formulated and solved as affine registration problems
in dimensions higher than three: stereo correspondence under motion and image
set matching. More specifically, for an object undergoing non-rigid motion that
can be linearly modelled using a small number of shape basis vectors, the stereo
correspondence problem can be solved by affine registering points in IR3n . And
given two collections of images related by an unknown linear transformation of
the image space, the correspondences between images in the two collections can
be recovered by solving an affine registration problem in IRm , where m is the
dimension of a PCA subspace. The algorithm proposed in this paper estimates
the affine transformation between two point sets in IRm . It does not require con-
tinuous optimization, and our analysis shows that, in the absence of data noise,
the algorithm will recover the exact affine transformation for almost all point sets
with the worst-case time complexity of O(mk2 ), k the size of the point set. We
validate the proposed algorithm on a variety of synthetic point sets in different
dimensions with varying degrees of deformation and noise, and we also show
experimentally that the two types of matching problems can indeed be solved
satisfactorily using the proposed affine registration algorithm.
1 Introduction
Matching points, particularly in low-dimensional settings such as 2D and 3D, has been
a classical problem in computer vision. The problem can be formulated in a variety of
ways depending on the allowable and desired deformations. For instance, the orthogonal
and affine cases have been studied already awhile ago, e.g., [1][2], and recent research
activities have been focused on non-rigid deformations, particularly those that can be
locally modelled by a family of well-known basis functions such as splines, e.g., [3]. In
this paper, we study the more classical problem of matching point sets1 related by affine
transformations. The novel viewpoint taken here is the emphasis on affine registrations
1
In this paper, the two point sets are assumed to have the same size.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 256–269, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Higher Dimensional Affine Registration and Vision Applications 257
in IRm for m > 3, and it differs substantially from the past literature on this subject,
which has been overwhelmingly devoted to registration problems in IR2 and IR3 .
To justify breaking this dimension barrier, we will demonstrate that two important
and interesting types of matching problems can be formulated and solved as affine reg-
istration problems in IRm with m > 3: stereo correspondence under motion and image
set matching (See Figure 1). In the stereo correspondence problem, two video cameras
are observing an object undergoing some motion (rigid or non-rigid), and a set of k
points on the object are tracked consistently in each view. The problem is to match
the tracking results across two views so that the k feature points can be located and
identified correctly. In the image set matching problem, two collections of images are
given such that the unknown transformation between corresponding pairs of images
can be approximated by some linear transformation F : IRm → IRm between two
(high-dimensional) image spaces. The task is to compute the correspondences directly
from the images. Both problems admit quick solutions. For example, for stereo corre-
spondence under motion, one quick solution would be to select a pair of corresponding
frames and compute the correspondences directly between these two frames. This ap-
proach is clearly since there is no way to know a priori which pair of frames is optimal
for computing the correspondences. Furthermore, if the baseline between cameras is
large, direct stereo matching using image features does not always produce good re-
sults, even when very precise tracking result are available. Therefore, there is a need
for a principled algorithm that can compute the correspondences directly using all the
tracking results simultaneously instead of just a pair of frames.
Fig. 1. Left: Stereo Correspondence under Motion. A talking head is observed by two (affine)
cameras. Feature points are tracked separately on each camera and the problem is to compute
the correspondences between observed feature points across views. Center and Right: Image
Set Matching. Two collections (432 images each) of images are given. Each image on the right
is obtained by rotating and down-sizing an image on the left. The problem is to recover the
correspondences. These two problems can be formulated as affine registration problems in IRm
with m > 3.
An important point to realize is that in each problem there are two linear subspaces
that parameterize the input data. For nonrigid motions that can be modelled using linear
shape basis vectors, this follows immediately from the work of [4][5]. For image set
matching, each set of images can usually be approximated by a linear subspace with
dimension that is considerably smaller than that of the ambient image space. We will
258 Y.-T. Chi et al.
show that the correspondences can be computed (or be approximated) by affine regis-
tering point sets in these two linear subspaces. Therefore, instead of using quantities
derived from image intensities, our solution to these two matching problems is to first
formulate them as affine point set matching problems in IRm , with m > 3, and solve
the resulting affine registration problems.
Let P = {p1 , · · · , pk } and Q = {q1 , · · · , qk } denote two point sets in IRm with equal
number of points. The affine registration problem is typically formulated as an opti-
mization problem of finding an affine transformation A and a correspondence map π be-
tween points in P, Q such that the following registration error function is
minimized
k
E(A, π) = d2 (Api , qπ(i) ), (1)
i=1
where d(Api , qπ(i) ) denotes the usual L2 -distance between Api and qπ(i) . The vener-
able iterative closest point (ICP) algorithm [6][7] can be easily generalized to handle
high-dimensional point sets, and it gives an algorithm that iteratively solves for corre-
spondences and affine transformation. However, the main challenge is to produce good
initial correspondences and affine transformation that will guarantee the algorithm’s
convergence and the quality of the solution. For dimensions two and three, this is al-
ready a major problem and the difficulty increases exponentially with dimension. In this
paper, we propose an algorithm that can estimate the affine transformation (and hence
the correspondences π) directly from the point sets P, Q. The algorithm is algebraic in
nature and does not require any optimization, which is its main strength. Furthermore, it
allows for a very precise analysis showing that for generic point sets and in the absence
of noise, it will recover the exact affine transformation and the correspondences. For
noisy data, the algorithm’s output can serve as a good initialization for the affine-ICP
algorithm. While the algorithm is indeed quite straightforward, it is to the best of our
knowledge that there has not been published algorithm which is similar to ours in its
entirety. In this paper, we will provide experimental results that validate the proposed
affine registration algorithm and show that both the stereo correspondence problem un-
der motion and image set matching problem can be solved quite satisfactorily using the
proposed affine registration algorithm.
each camera, the tracker provides the correspondences (xtij , yij t
) ↔ (xtij , yij
t
) across
different frames t and t . Our problem is to compute correspondences across two views
so that the corresponding points (xt1j , y1j
t
) ↔ (xt2j , y2j
t
) are the projections of the scene
point Xj in the images. We show next that it is possible to compute the correspondences
directly using only the high-dimensional geometry of the point sets ( xtij , yij t
) without
referencing to image features such as intensities.
For each view, we can stack the image coordinates of one tracked point over T frames
vertically into a 2T -dimensional vector:
pj = ( x11j y1j
1
· · · , xT1j y1j
T t
) , qj = ( x12j y2j
1
· · · , xT2j y2j
T t
) (2)
In motion segmentation (e.g., [8]), the main objects of interest are the 4-dimensional
subspaces Lp , Lq spanned by these 2T -dimensional vectors
P = {p1 , · · · , pk }, Q = {q1 , · · · , qk },
and the goal is to cluster motions by determining the subspaces Lp , Lq given the set of
vectors P ∪ Q. Our problem, on the hand, is to determine the correspondences between
points in P and Q. It is straightforward to show that there exists an affine transformation
L : Lp → Lq that produces the correct correspondences, i.e., L(pi ) = qi for all i. To
see this, we fix an arbitrary world frame with respect to which we can write down the
camera matrices for C1 and C2 . In addition, we also fix an object coordinates system
with orthonormal basis {i, j, k} centered at some point o ∈ O. Since O is undergoing
a rigid motion, we denote by ot , it , jt , kt , the world coordinates of o, i, j, k at frame t.
The point Xj , at frame t, with respect to the fixed world frame is given by
Xjt = ot + αj it + βj jt + γj kt , (3)
for some real coefficients αj , βj , γj that are independent of time t. The corresponding
image point is then given as
( xtij , yij
t t
) = õit + αj ĩit + βj j̃it + βj k̃it ,
where õit , ĩit , j̃it , k̃it are the projections of the vectors ot , it , jt , kt onto camera i. In par-
ticular, if we define the 2T -dimensional vectors Oi , Ii , Ji , Ki by stacking the vectors
õit , ĩit , j̃it , k̃it vertically as before, we have immediately,
pj = O1 + αj I1 + βj J1 + γj K1 , qj = O2 + αj I2 + βj J2 + γj K2 . (4)
The two linear subspaces Lp , Lq are spanned by the basis vectors {O1 , I1 , J1 , K1 },
{O2 , I2 , J2 , K2 }, respectively. The linear map that produces the correct correspon-
dences is given by the linear map L such that L(O1 ) = O2 , L(I1 ) = I2 , L(J1 ) = J2
and L(K1 ) = K2 . A further reduction is possible by noticing that the vectors pj , qj be-
long to two three-dimensional affine linear subspaces Lp , Lq in IR2T , affine subspaces
that pass through the points O1 , O2 with bases {I1 , J1 , K1 } and {I2 , J2 , K2 }, respec-
tively. These two subspaces can be obtained by computing the principle components
for the collections of vectors P, Q. By projecting points in P, Q onto Lp , Lq , respec-
tively, it is clear that the two sets of projected points are now related by an affine map
A : Lp → Lq . In other words, the correspondence problem can now be solved by
solving the equivalent affine registration problem for these two sets of projected points
(in IR3 ).
260 Y.-T. Chi et al.
Non-Rigid Motions. The above discussion generalizes immediately to the types of non-
rigid motions that can be modelled (or approximated) using linear shape basis [2,5,9].
In this model, for k feature points, a shape basis element Bl is a 3 × k matrix. For a
model that employs m linear shape basis elements, the 3D world coordinates of the k
feature points at tth frame can be written as a linear combination of these shape basis
elements:
# t $ m
X1 · · · Xkt = atl Bl , (5)
l=1
for some real numbers atl . Using affine camera model, the imaged points (disregarding
the global translation) are given by the following equation [9]
# t $
x1 · · · xtk = (a ⊗ P )B, (6)
where at = (at1 , · · · , atm ), P is the first 2 × 3 block of the affine camera matrix and B is
the 3m × k matrix formed by vertically stacking the shape basis matrices Bl . The right
factor in the above factorization is independent of the camera (and the images), and we
have the following equations similar to Equations 4:
m
m
pj = O1 + (αjl I1l + βjl J1l + γjl K1l ), qj = O2 + (αjl I2l + βjl J2l + γjl K2l ),
l=1 l=1
(7)
where Iil , Jil Kil are the projections of the three basis vectors in the lth shape basis
element Bl onto camera i. The numbers αjl , βjl and γjl are in fact entries in the matrix
Bl . These two equations then imply, using the same argument as before, that we can re-
cover the correspondences directly using a 3m-dimensional affine registration provided
that the vectors Oi , Iil , Jil , Kil are linearly independent for each i, which is typically
the case when the number of frames is sufficiently large.
In the image set matching problem, we are given two sets of images P = {I1 , · · · , Ik }
⊂ IRm , Q = {I1 , · · · , Ik } ⊂ IRm and the corresponding pairs of images Ii , Ii are
related by a linear transformation F : IRm → IRm between two high-dimensional
image spaces:
Ik ≈ F(Ik ).
Examples of such sets of images are quite easy to come by, and Figure 1 gives an
example in which Ii is obtained by rotating and downsizing Ii . It is easy to see that
many standard image processing operations such as image rotation and down-sampling
can be modelled as (or approximated by) a linear map F between two image spaces.
The problem here is to recover the correspondences Ii ↔ Ii without actually computing
the linear transformation F , which will be prohibitively expensive since the dimensions
of the image spaces are usually very high.
Many interesting sets of images can in fact be approximated well by low-dimensional
linear subspaces in the image space. Typically, such linear subspaces can be computed
Higher Dimensional Affine Registration and Vision Applications 261
readily using principal component analysis (PCA). Let Lp , Lq denote two such low-
dimensional linear subspaces approximating P, Q, respectively and we will use the
same notations P, Q to denote their projections onto the subspace Lp , Lq . A natural
question to ask is how are the (projected) point sets P, Q related? Suppose that F is
orthogonal and Lp , Lq are the principle subspaces of the same dimension. If the data
is “noiseless”, i.e., Ik = F (Ik ), it is easy to show that P, Q are then related by an
orthogonal transformation. In general, F may not be orthogonal and data points are
noisy, the point sets P, Q are related by a transformation T = A + r, which is a sum
of an affine transformation A and a nonrigid transformation r. If the nonrigid part is
small, we can recover the correspondences by affine registering the two point sets P, Q.
Note that this gives an algorithm for computing the correspondences without explicitly
using the image contents, i.e., there is no feature extraction. Instead, it works directly
with the geometry of the point sets.
k
E(A, t, π) = d2 (Api + t, qπ(i) ). (9)
i=1
Solving A, t separately while holding π fixed, the above registration error function
gives a quadratic programming problem in the entries of A, and the optimal solution
can be computed readily by solving a linear system. With a fixed A, t can be solved
immediately. On the hand, given an affine transformation, a new assignment π can be
defined using closest points:
2
We will call this algorithm affine-ICP.
262 Y.-T. Chi et al.
to finish off the problem by exploiting invariants of the orthogonal matrices, namely,
distances. Let SP and SQ denote the covariance matrices for P and Q, respectively:
k
k
SP = pi pti , SQ = qi qit .
i=1 i=1
We will use the same notations to denote the transformed points and point sets. If the
original point sets are related by A, the transformed point sets are then related by Ā =
−1 1
SQ 2 ASP2 . The matrix Ā can be easily shown to be orthogonal:
Proposition 1. Let P and Q denote two point sets (of size k) in IRm , and they are
related by an unknown linear transformation A. Then, the transformed point sets (using
Equation 10) are related by a matrix Ā, whose rows are orthonormal vectors in IRm .
The proof follows easily from the facts that 1) the covariance matrices SP and SQ
are now identity matrices for the transformed point sets, and 2) SQ = ĀSP Āt . They
together imply that the rows of Ā must be orthonormal.
of −1.
There are many ways to construct the matrices LP and LQ . Let f (x) be any function.
We can construct a k × k symmetric matrix LP (f ) from pairwise distances using the
formula
264 Y.-T. Chi et al.
⎛ ⎞
f (d(p1 , p1 )) · · · f (d(p1 , pk ))
⎜ .. .. ⎟
LP (f ) = Ik − μ ⎝ . ··· . ⎠, (11)
f (d(pk , p1 )) · · · f (d(pk , pk ))
where Ik is the identity matrix and μ some real constant. One common choice of f
that we will use here is the Gaussian exponential f (x) = exp(−x2 /σ 2 ), and the result-
ing symmetric matrix LP is related to the well-known (unnormalized) discrete Lapla-
cian associated with the point set P [15]. Denote Up Dp Upt = LP , Uq Dq Uqt = LQ
the eigen-decompositions of LP and LQ . When the eigenvalues are all distinct, up to
sign differences, Up and Uq differ only by some unknown row permutation if we order
the columns according to the eigenvalues. This unknown row permutation is exactly
the desired correspondence π. In particular, we can determine m correspondences by
matching m rows of Up and Uq , and from these m correspondences, we can recover the
orthogonal transformation Ā. The complexity of this operation is O(mk 2 ) and we have
the following result
Proposition 2. For a generic pair of point sets P, Q with equal number of points in
IRm related by some orthogonal transformation L and correspondences π such that
qπ(i) = Lpi , the above method will recover L and π exactly for some choice of σ.
The proof (omitted here) is an application of Sard’s theorem and transversality in dif-
ferential topology [16]. The main idea is to show that for almost all point sets P, the
symmetric matrix LP will not have repeating eigenvalues for some σ. This will guaran-
tee that the row-matching procedure described above will find the m needed correspon-
dences after examining all rows of Up m times. Since the time complexity for matching
one row is O(k), the total time complexity is no worse than O(mk 2 ).
in IRm . In stereo matching and homography estimation [2], they are computed using
image features such as image gradients and intensity values.
More precisely, let λp1 < λp2 < · · · < λpl (l ≤ k) be l non-repeating eigenvalues of
LP and likewise, λq1 < λq2 < · · · < λql the l eigenvalues of LQ such that |λqi − λpi | <
for some threshold value . Let vλP1 , vλP2 , · · · , vλPl , and vλQ1 , vλQ2 , · · · , vλQl denote the
corresponding eigenvectors. We stack these eigenvectors horizontally to form two k × l
matrices VP and VQ:
Denote the i, j-entry of VP (and also VQ) by VP(i, j). We define the matching measure
M as
l
M(pi , qj ) = min{ (VP(i, h) − VQ(j, h))2 , (VP(i, h) + VQ(j, h))2 }.
h=1
Note that if l = k, M is comparing the ith row of LP with j th row of LQ . For ef-
ficiency, one does not want to compare the entire row; instead, only a small fragment
of it. This would require us to use those eigenvectors that are most discriminating for
picking the right correspondences. For discrete Laplacian, eigenvectors associated with
smaller eigenvalues can be considered as smooth functions on the point sets, while those
associated with larger eigenvalues are the non-smooth ones since they usually exhibit
greater oscillations. Typically, the latter eigenvectors provide more reliable matching
measures than the former ones and in many cases, using one or two such eigenvectors
(l = 2) is already sufficient to produce good results.
Table 1. Experimental Results I. For each dimension and each noise setting, one hundred trials,
each with different point sets and matrix A, were performed. The averaged relative error and
percentage of mismatched points as well as standard deviations (in parenthesis) are shown.
Dim → 3 3 5 5 10 10
Noise ↓ Matrix Error Matching Error Matrix Error Matching Error Matrix Error Matching Error
0% 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
1% 0.001 (0.0005) 0 (0) 0.002 (0.0006) 0 (0) 0.004 (0.0008) 0 (0)
2% 0.003 (0.001) 0 (0) 0.004 (0.001) 0 (0) 0.008 (0.001) 0 (0)
5% 0.008 (0.003) 0 (0) 0.01 (0.003) 0 (0) 0.02 (0.003) 0 (0)
10% 0.017 (0.01) 0.008 (0.009) 0.05 (0.05) 0.009 (0.04) 0.04 (0.009) 0 (0)
4 Experiments
In this section, we report four sets of experimental results. First, with synthetic point
sets, we show that the proposed affine registration algorithm does indeed recover exact
affine transformations and correspondences for noiseless data. Second, we show that
the proposed algorithm also works well for 2D point sets. Third, we provide two se-
quences of nonrigid motions and show that the feature point correspondences can be
266 Y.-T. Chi et al.
Table 2. Experimental Results II. Experiments with point sets of different sizes with 5% noise
added. All trials match point sets in IR10 with settings similar to Table 1. Average errors for one
hundred trials are reported with standard deviations in parenthesis.
satisfactorily solved using affine registration in IR9 . And finally, we use images from
COIL database to show that the image set matching problem can also be solved using
affine registration in IR8 . We have implemented the algorithm using MATLAB without
any optimization. The sizes of the point sets range from 20 to 432, and on a DELL
desktop with single 3.1GHz processor, each experiment does not run longer than one
minute.
Fig. 2. 2D Image Registration. 1st column: Source images (taken from COIL database) with
feature points marked in red. 2nd and 4th column: Target images with feature points marked in
blue. 3rd and 5th column: Target images with corresponding feature points marked in blue. The
affine transformed points from the source images are marked in red. Images are taken with 15◦
and 30◦ differences in viewpoint. The RMS errors for these four experiments (from left to right)
2.6646, 3.0260, 2.0632, 0.7060, respectively.
Fig. 3. Top: Sample frames from two video sequences of two objects undergoing nonrigid mo-
tions. Bottom: Sample frames from another camera observing the same motions.
point. The registration results for four pairs of images are shown in Figure 2. Notice the
small RMS registration errors for all these results given that the image size is 128×128.
proposed algorithm recovers all the correspondences correctly, while for the tatoo se-
quence, among the recovered sixty feature point correspondences, nine are incorrect.
This can be explained by the fact that in several frames, some of the tracked feature
points are occluded and missing and the subsequent factorizations produce relatively
noisy point sets in IR9 . On the other hand, affine-ICP with closest point initialization
fails poorly for both sequences. In particular, more than three quarters of the estimated
correspondences are incorrect.
Fig. 4. Image Set Matching. The original image set A is shown in Figure 1. Image sets B, C are
shown above. The plots on the right show the L2 -registration error for each of the fifty iterations
of running affine-ICP algorithm using different initializations. Using the output of the proposed
affine registration as the initial guess, the affine-ICP algorithm converges quickly to the desired
transformation (blue curves) and yields correct correspondences. Using closest points for initial
correspondences, the affine-ICP algorithm converges (red curves) to incorrect solutions in both
experiments.
Higher Dimensional Affine Registration and Vision Applications 269
References
1. Scott, G., Lonquiet-Higgins, C.: An algorithm for associating the features of two images.
Proc. of Royal Society of London B244, 21–26 (1991)
2. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge Uni-
versity Press, Cambridge (2003)
3. Chui, H., Rangarajan, A.: A new algorithm for non-rigid point matching. In: Proc. IEEE
Conf. on Comp. Vision and Patt. Recog, vol. 2, pp. 44–51 (2000)
4. Toamsi, C., Kanade, T.: Shape and motion from image streams under orthography—a factor-
ization method. Int. J. Computer Vision 9(2), 137–154 (1992)
5. Bregler, C., Hertzmann, A., Biermann, H.: Recovering non-rigid 3d shape from image
streams. In: Proc. IEEE Conf. on Comp. Vision and Patt. Recog., pp. 2690–2696 (2000)
6. Besel, P.J., Mckay, H.D.: A method for registration of 3-d shapes. PAMI 14, 239–256 (1992)
7. Zhang, Z.: Iterative point matching for registration of free-form curves and surfaces. Int. J.
Computer Vision 13, 119–152 (1994)
8. Kanatani, K.: Motion segmentation by subspace separation and model selection. In: Proc.
Int. Conf. on Computer Vision, vol. 2, pp. 586–591 (2001)
9. Brand, M.: Morphable 3d models from video. In: Proc. IEEE Conf. on Comp. Vision and
Patt. Recog., vol. 2, pp. 456–463 (2001)
10. Fitzgibbon, A.W.: Robust registration of 2d and 3d point sets. Computer Vision and Image
Understanding 2, 1145–1153 (2003)
11. Sharp, G.C., Lee, S.W., Wehe, D.K.: Icp registration using invariant features. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 24, 90–102 (2002)
12. Rusinkiewicz, S., Levoy, M.: Efficient variants of the icp algorithm. In: Proc. Third Interna-
tional Conference on 3D Digital Imaging and Modeling (3DIM), pp. 145–152 (2001)
13. Granger, S., Pennec, X.: Multi-scale em-icp: A fast and robust approach for surface registra-
tion. In: Proc. European Conf. on Computer Vision, vol. 3, pp. 418–432 (2002)
14. Makadia, A., Patterson, A.I., Daniilidis, K.: Fully automatic registration of 3d point clouds.
In: Proc. IEEE Conf. on Comp. Vision and Patt. Recog., vol. 1, pp. 1297–1304 (2006)
15. Chung, F.R.K.: Spectral Graph Theory. American Mathematical Society (1997)
16. Hirsch, M.: Differential Topology. Springer, Heidelberg (1976)
17. Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fitting with ap-
plications to image analysis and automated cartography. Communications of the ACM 24,
381–395 (1981)
Semantic Concept Classification by Joint
Semi-supervised Learning of Feature Subspaces
and Support Vector Machines
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 270–283, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Semantic Concept Classification 271
detection. We compare our algorithm with several state of the arts, including
the standard SVM [3], semi-supervised LapSVM and LapRLS [10], and the naive
approach of first learning a feature subspace (unsupervised) and then solving an
SVM (supervised) in the learned feature subspace. Experimental results demon-
strate the effectiveness of our LPSSVM algorithm.
2 Related Work
The SVM classifier [3] has been a popular approach to learn a classifier based on
the labeled subset XL for classifying the unlabeled set XU and new unseen test
samples. The primary goal of an SVM is to find an optimal separating hyperplane
that gives a low generalization error while separating the positive and negative
training samples. Given a data vector x, SVMs determine the corresponding
label by the sign of a linear decision function f (x) = wT x+b. For learning non-
linear classification boundaries, a kernel mapping φ is introduced to project data
vector x into a high dimensional feature space as φ(x), and the corresponding
class label is given by the sign of f (x) = wT φ(x) + b. In SVMs, this optimal
hyperplane is determined by giving the largest margin of separation between
different classes, i.e. by solving the following problem:
1
1 nL
min Qd = min ||w||22 +C i , s.t. yi (wTφ(xi )+b) ≥ 1−i, i ≥ 0, ∀ xi ∈XL . (1)
w,b, w,b, 2
i=1
where = 1 , . . . , nL are the slack variables assigned to training samples, and C
controls the scale of the empirical error loss the classifier can tolerate.
To exploit the unlabeled data, the idea of graph Laplacian [6] has been shown
promising for both subspace learning and classification. We briefly review the
ideas and formulations in the next two subsections. Given the set of data points
X, a weighted undirected graph G = (V, E, W ) can be used to characterize the
pairwise similarities among data points, where V is the vertices set and each
node vi corresponds to a data point xi ; E is the set of edges; W is the set of
weights measuring the strength of the pairwise similarity.
Regularization for feature subspace learning. In feature subspace learning,
the objective of graph Laplacian [6] is to embed original data graph into an m-
dimensional Euclidean subspace which preserves the locality property of original
Semantic Concept Classification 273
data. After embedding, connected points in original G should stay close. Let X̂
be the m×n dimensional embedding, X̂= [x̂1 , . . . , x̂n ], the cost function is:
⎧ ⎫
⎨ n ⎬ , -
min ||x̂i − x̂j ||22 Wij , s.t.X̂DX̂ T= I ⇒ min tr(X̂LX̂ T) , s.t.X̂DX̂ T = I.(2)
X̂ ⎩ i,j=1
⎭ X̂
where L is the Laplacian matrix and L = D−W , D is the diagonal weight matrix
whose entries are defined as Dii = j Wij . The condition X̂DX̂ T = I removes
an arbitrary scaling factor in the embedding [6]. The optimal embedding can be
obtained as the matrix of eigenvectors corresponding to the lowest eigenvalues
of the generalized eigenvalue problem: Lx̂=λDx̂. One major issue of this graph
embedding approach is that when a novel unseen sample is added, it is hard
to locate the new sample in the embedding graph. To solve this problem, the
Locality Preserving Projection (LPP ) is proposed [9] which tries to find a linear
projection matrix a that maps data points xi to aT xi , so that aT xi can best
approximate graph embedding x̂i . Similar to Eq(2), the cost function of LPP is:
mina Qs = mina tr(aT XLX T a) , s.t. aT XDX T a = I . (3)
1 nL
min V (xi , yi , f ) + γA ||f ||22 + γI f T Lf . (4)
f nL i=1
where V(xi ,yi ,f ) is the loss function, e.g., the square loss V(xi ,yi ,f )=(yi−f(xi ))2
for LapRLS and the hinge loss V(xi ,yi ,f ) = max(0, 1 − yi f (xi )) for LapSVM;
f is the vector of discriminative functions over the entire data set X, i.e., f =
[f (x1 ), . . . , f (xnU +nL )]T . Parameters γA and γI control the relative importance
of the complexity of f in the ambient space and the smoothness of f according
to the feature manifold, respectively.
2.3 Motivation
In this paper, we pursue a new semi-supervised approach for feature subspace
discovery as well as classifier learning. We propose a novel algorithm, Locality
Preserving Semi-supervised SVM (LPSSVM ), aiming at joint learning of both an
optimal feature subspace and a large margin SVM classifier in a semi-supervised
manner. Specifically, the graph Laplacian regularization condition in Eq(3) is
adopted to maintain the smoothness of the neighborhoods over both labeled
and unlabeled data. At the same time, the discriminative constraint in Eq(1)
274 W. Jiang et al.
3.1 LPSSVM
The smooth regularization term Qs in Eq(3) and discriminative cost function Qd
in Eq(1) can be combined synergistically to generate the following cost function:
5 nL 6
1
min Q = min {Qs + γQd} = min tr(aTXLX Ta)+γ[ ||w||22+C i ] (5)
a,w,b, a,w,b, a,w,b, 2 i=1
Through optimizing Eq(5) we can obtain the optimal linear projection a and
classifier w, b simultaneously. In the following, we develop an iterative algorithm
to minimize over a and w, b, which will monotonically reduce the cost Q by
coordinate ascent towards a local minimum. First, using the method of Lagrange
multipliers, Eq(5) can be rewritten as the following:
5 6
1
min Q = min max tr(a XLX a)+γ[ ||w||2 −F (XLaw−B)+M ] , s.t.aTXDX Ta=I.
T T 2 T T
a,w,b, a,w,b, α,μ 2
where
nL we have nL defined quantities:nL F =[α1 y1 , . . . , αnL ynL]T , B= [b, . . . , b]T , M =
C i=1 i + i=1 αi (1−i )− i=1 μi i , and non-negative Lagrange multipliers α=
α1 , . . . , αnL , μ=μi , . . . , μnL . By differentiating Q with respect to w, b, i we get:
∂Q nL
=0⇒w= αi yi aT xi = aT XL F . (6)
∂w i=1
∂Q nL ∂Q
=0⇒ αi yi = 0, = 0 ⇒ C − αi − μi = 0 . (7)
∂b i=1 ∂i
Semantic Concept Classification 275
Note Eq(6) and Eq(7) are the same as those seen in SVM optimization [3], with
the only difference that the data points are now transformed by a as x̃i = aT xi .
That is, given a known a, the optimal w can be obtained through the standard
SVM optimization process. Secondly, by substituting Eq(6) into Eq(5), we get:
, γ -
min Q = min tr(aTXLX Ta)+ F T XLT aaT XL F , s.t. aT XDX T a = I . (8)
a a 2
∂Q γ
= 0 ⇒ (XLX + XL F F T XLT )a = λXDX T a .
T
(9)
∂a 2
In this section, we show that the LPSSVM method proposed above can be ex-
tended to a nonlinear kernel version. Assume that φ(xi ) is the projection function
which maps the original data point xi into a high-dimension feature space. Sim-
ilar to the approach used in Kernel PCA [14] or Kernel LPP [9], we pursue the
projection matrix a in the span of existing data points, i.e.,
n
a= φ(xi )vi = φ(X)v . (10)
i=1
where v = [v1 , . . . , vn ]T . Let K denote the kernel matrix over the entire
! data set
"
KL KLU
X = [XL , XU ], where Kij = φ(xi )·φ(xj ). K can be written as: K = ,
KUL KU
where KL and KU are the kernel matrices over the labeled subset XL and the
unlabeled subset XU respectively; KLU is the kernel matrix between the labeled
data set and the unlabeled data set and KUL is the kernel matrix between the
unlabeled data and the labeled data (KLU = KUL T
).
In the kernel space, the projection updating equation (i.e., Eq(8)) turns to:
, γ -
minQ = min tr(aTφ(X)LφT(X)a)+ F TφT(XL)aaTφ(XL)F , s.t.aTφ(X)DφT(X)a = I .
a a 2
By differentiating Q with respect to a, we can get:
γ
φ(X)LφT(X)a+ φ(XL)F F TφT(XL)a = λφ(X)DφT(X)a
2
γ LU|L
⇒ KLK + K F F T (K LU|L )T v = λKDKv . (11)
2
276 W. Jiang et al.
where K L|LU=[KL , KLU ]. This is the same with the original SVM dual problem
[3], except that the kernel matrix is changed from original K to:
% &% &
K̂ = K L|LU v vT K LU|L . (12)
Combining the above two components, we can obtain the kernel-based two-step
optimization process as follows:
Step-1: With the current projection matrix vt at iteration t, train an SVM to
get wt and α1,t , . . . , αnL ,t with the new kernel described in Eq(12).
Step-2: With the current wt , α1,t , . . . , αnL ,t , update vt+1 by solving Eq(11).
In the testing stage, given a test example xj (xj can be an unlabeled training
sample, i.e., xj ∈ XU or xj can be an unseen test sample), the SVM classifier
gives classification prediction based on the discriminative function:
7 n 87 n 8T
nL
nL L|LU
f (xj ) = w a φ(xj ) = αi yi φ(xi )aa φ(xj ) = αi yi
T T T
Kig vg K(xg , xj )vg .
i=1 i=1 g=1 g=1
Thus the SVM classification process is also similar to that of standard SVM [3],
with the difference that the kernel function between
# labeled
$# training $data and
test data is changed from K L|test to: K̂ L|test = K L|LU v vT K LU|test . v plays
the role of modeling kernel-based projection a before computing SVM.
In terms of speed, LPSSVM is very fast in the testing stage, with complexity
similar to that of standard SVM classification. In training stage, both steps of
LPSSVM are fast. The generalized eigenvalue problem in Eq(11) has a time com-
plexity of O(n3 ) (n = nL+nU ). It can be further reduced by exploiting the sparse
implementation of [17]. For step 1, the standard quadratic programming opti-
mization for SVM is O(n3L ), which can be further reduced to linear complexity
(about O(nL )) by using efficient solvers like [18].
4 Experiments
We conduct experiments over 4 data sets: a toy set, two UCI sets [11], Caltech
101 for image classification [12], and Kodak’s consumer video set for concept de-
tection [13]. We compare with some state-of-the-arts, including supervised SVM
[3], semi-supervised LapSVM and LapRLS [10]. We also compare with a naive
LPP+SVM: first apply kernel-based LPP to get projection and then learn SVM
in projected space. For fair comparison, all SVMs in different algorithms use
RBF kernels for classifying UCI data, Kodak’s consumer videos, and toy data,
and use the Spatial Pyramid Match (SPM) kernel [16] for classifying Caltech
101 (see Sec.4.3 for details). This is motivated by the promising performance in
classifying Caltech 101 in [16] by using SPM kernels. In LPSSVM, γ = 1 in Eq(5)
to balance the consideration on discrimination and smoothness, and θ = 1/d in
RBF kernel where d is feature dimension. This follows the suggestion of the pop-
ular toolkit LibSVM [15]. For all algorithms, the error control parameter C = 1
for SVM. This parameter setting is found robust for many real applications [15].
Other parameters: γA , γI in LapSVM, LapRLS [10] and kn for graph construc-
tion, are determined through cross validation. LibSVM [15] is used for SVM,
and source codes from [17] is used for LPP.
Fig. 2. Performance over toy data. Compared with others, LPSSVM effectively dis-
criminates 3 categories. Above results are generated by using the SVM Gram matrix
directly for constructing Laplacian graph. With more deliberate tuning of the Lapla-
cian graph, LapSVM, LapRLS, and LPSSVM can give better results. Note that the
ability of LPSSVM to maintain good performance without graph tuning is important.
the three categories. This data set is hard since data points around the class
boundaries from different categories (red and cyan, and blue and cyan) are close
to each other. This adds great difficulty to manifold learning. The one-vs.-all
classifier is used to classify each category from others, and each test data is
assigned the label of the classifier with the highest classification score. Fig. 2
gives an example of the classification results using different methods with 10%
samples from each category as labeled data (17 labeled samples in total). The
averaged classification error rates (over 20 randomization runs) when varying the
number of labeled data are also shown. The results clearly show the advantage
of our LPSSVM in discriminative manifold learning and classifier learning.
Fig. 3. Classification rates over UCI data sets. The vertical dotted line over each point
shows the standard deviation over 20 randomization runs.
of SPM, only the labeled data is fed to the kernel matrix for standard SVM.
For other methods, the SPM-based measure is used to construct kernel matrices
for both labeled and unlabeled data (i.e., KL , KU , KLU ) before various semi-
supervised learning methods are applied. Specifically, for each image category,
5 images are randomly sampled as labeled data and 25 images are randomly
sampled as unlabeled data for training. The remaining images are used as novel
test data for evaluation (we limit the maximum number of novel test images
in each category to be 30). Following the procedure of [16], a set of local SIFT
features of 16 × 16 pixel patches are uniformly sampled from these images over
a grid with spacing of 8 pixels. Then for each image category, a visual codebook
is constructed by clustering all SIFT features from 5 labeled training images
into 50 clusters (codewords). Local features in each image block are mapped
to the codewords to compute codeword histograms. Histogram intersections are
calculated at various locations and resolutions (2 levels), and are combined to
estimate similarity between image pairs. One-vs.-all classifiers are built for classi-
fying each image category from the other categories, and a test image is assigned
the label of the classifier with the highest classification score.
Table 1 (a) and (b) give the average recognition rates of different algorithms
over 101 image categories for the unlabeled data and the novel test data, respec-
tively. From the table, over the unlabeled training data LPSSVM can improve
baseline SVM by about 11.5% (on a relative basis). Over the novel test data,
LPSSVM performs quite similarly to baseline SVM1 .
It is interesting to notice that all other competing semi-supervised meth-
ods, i.e., LapSVM, LapRLS, and naive LPP+SVM, get worse performance than
LPSSVM and SVM. Please note that extensive research has been conducted for
supervised classification of Caltech 101, among which SVM with SPM kernels
gives one of the top performances. To the best of our knowledge, there is no
report showing that the previous semi-supervised approaches can compete this
state-of-the-art SPM-based SVM in classifying Caltech 101. The fact that our
LPSSVM can outperform this SVM, to us, is very encouraging.
1
Note the performance of SPM-based SVM here is lower than that reported in [16].
This is due to the much smaller training set than that in [16]. We focus on scenarios
of scarce training data to access the power of different semi-supervised approaches.
280 W. Jiang et al.
Table 1. Recognition rates for Caltech 101. All methods use SPM to compute image
similarity and kernel matrices. Numbers shown in parentheses are standard deviations.
We also use the challenging Kodak’s consumer video data set provided in [13],
[21] for evaluation. Unlike the Caltech images, content in this raw video source
involves more variations in imaging conditions (view, scale, lighting) and scene
complexity (background and number of objects). The data set contains 1358
video clips, with lengths ranging from a few seconds to a few minutes. To avoid
shot segmentation errors, keyframes are sampled from video sequences at a 10-
second interval. These keyframes are manually labeled to 21 semantic concepts.
Each clip may be assigned to multiple concepts; thus it represents a multi-label
corpus. The concepts are selected based on actual user studies, and cover several
categories like activity, occasion, scene, and object.
To explore complementary features from both audio and visual channels, we
extract similar features as [21]: visual features, e.g., grid color moments, Gabor
texture, edge direction histogram, from keyframes, resulting in 346-dimension
visual feature vectors; Mel-Frequency Cepstral Coefficients (MFCCs) from each
audio frame (10ms) and delta MFCCs from neighboring frames. Over the video
interval associated with each keyframe, the mean and covariance of the audio
frame features are computed to generate a 2550-dimension audio feature vector
. Then the visual and audio feature vectors are concatenated to form a 2896-
dimension multi-modal feature vector. 136 videos (10%) are randomly sampled
as training data, and the rest are used as unlabeled data (also for evaluation). No
videos are reserved as novel unseen data due to the scarcity of positive samples
for some concepts. One-vs.-all classifiers are used to detect each concept, and
average precision (AP) and mean of APs (MAP) are used as performance metrics,
which are official metrics for video concept detection [22].
Semantic Concept Classification 281
0.5
Standard SVM
0.45
LapSVM
0.4 LapRLS
0.35 LPP+SVM
Average Precision
0.3 LPSSVM
0.25
0.2
0.15
0.1
0.05
0
on
par ng
yg ic
pla picn k
dan ing
sun rts
on ni m
cro oat
b y
bir each
spo w
b y
we set
ade
ers t
P
gro _3+
muup_2
MA i
sho d
bab l
up d
e_p gh
sk
a
par
a
n
gro w
seu
im
thd
ci
rou
dd
an
Fig. 4. Performance over consumer videos: per-concept AP and MAP. LPSSVM gets
good performance over most concepts with strong cues from both visual and audio chan-
nels, where LPSSVM can find discriminative feature subspaces from multi-modalities.
Fig. 4 gives the per-concept AP and the overall MAP performance of different
algorithms2 . On average, the MAP of LPSSVM significantly outperforms other
methods - 45% better than the standard SVM (on a relative basis), 42%, 41% and
92% better than LapSVM, LapRLS and LPP+SVM, respectively. From Fig. 4,
we notice that our LPSSVM performs very well for the “parade” concept, with
a 17-fold performance gain over the 2nd best result. Nonetheless, even if we
exclude “parade” and calculate MAP over the other 20 concepts, our LPSSVM
still does much better than standard SVM, LapSVM, LapRLS, and LPP+SVM
by 22%, 15%, 18%, and 68%, respectively.
Unlike results for Caltech 101, here semi-supervised LapSVM and LapRLS
also slightly outperform standard SVM. However, the naive LPP+SVM still per-
forms poorly - confirming the importance of considering subspace learning and
discriminative learning simultaneously, especially in real image/video classifica-
tion. Examining individual concepts, LPSSVM achieves the best performance
for a large number of concepts (14 out of 21), with a huge gain (more than
100% over the 2nd best result) for several concepts like “boat”, “wedding”, and
“parade”. All these concepts generally have strong cues from both visual and
the audio channels, and in such cases LPSSVM takes good advantage of finding
a discriminative feature subspace from multiple modalities, while successfully
harnessing the challenge of the high dimensionality associated with the multi-
modal feature space. As for the remaining concepts, LPSSVM is 2nd best for 4
additional concepts. LPSSVM does not perform as well as LapSVM or LapRLS
for the rest 3 concepts (i.e., “ski”, “park”, and “playground”), since there are
no consistent audio cues associated with videos in these classes, and thus it is
difficult to learn an effective feature subspace. Note although for “ski” visual
2
Note the SVM performance reported here is lower than that in [21]. Again, this is
due to the much smaller training set than that used in [21].
282 W. Jiang et al.
0.35 0.24
LPSSVM
0.3 LPSSVM
Standard SVM
0.18 Standard SVM
0.25
0.2
AP
AP
0.12
0.15
0.1
0.06
0.05
0 0
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Energy Ratio Energy Ratio
Fig. 5. Effect of varying energy ratio (subspace dimensionality) on the detection per-
formance. There exists a reasonable range of energy ratio that LPSSVM performs well.
5 Conclusion
way to optimize the proposed joint cost function in Eq(5). With relaxation
aTXDX Ta − I 0 instead of aTXDX Ta − I = 0, the problem can be solved
via SDP (Semidefinite Programming), where all parameters can be recovered
without resorting to iterative processes. In such a case, we can avoid the local
minima, although the solution may be different from that of the original problem.
References
1. Joachims, T.: Transductive inference for text classification using support vector
machines. In: ICML, pp. 200–209 (1999)
2. Chapelle, O., et al.: Semi-supervised learning. MIT Press, Cambridge (2006)
3. Vapnik, V.: Statistical learning theory. Wiley-Interscience, New York (1998)
4. Zhu, X.: Semi-supervised learning literature survey. Computer Sciences Technique
Report 1530. University of Wisconsin-Madison (2005)
5. Bengio, Y., Delalleau, O., Roux, N.: Efficient non-parametric function induction in
semi-supervised learning. Technique Report 1247, DIRO. Univ. of Montreal (2004)
6. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation 15, 1373–1396 (2003)
7. Cai, D., et al.: Spectral regression: a unified subspace learning framework for
content-based image retrieval. ACM Multimedia (2007)
8. Duda, R.O., et al.: Pattern classification, 2nd edn. John Wiley and Sons, Chichester
(2001)
9. He, X., Niyogi, P.: Locality preserving projections. Advances in NIPS (2003)
10. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric frame-
work for learning from labeled and unlabeled examples. Journal of Machine Learn-
ing Research 7, 2399–2434 (2006)
11. Blake, C., Merz, C.: Uci repository of machine learning databases (1998),
http://www.ics.uci.edu/∼ mlearn/MLRepository.html
12. Li, F., Fergus, R., Perona, P.: Learning generative visual models from few training
examples: An incremental bayesian approach tested on 101 object categories. In:
CVPR Workshop on Generative-Model Based Vision (2004)
13. Loui, A., et al.: Kodak’s consumer video benchmark data set: concept definition
and annotation. In: ACM Int’l Workshop on Multimedia Information Retrieval
(2007)
14. Schölkopf, B., Smola, A., Müller, K.: Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 10, 1299–1319 (1998)
15. Hsu, C., Chang, C., Lin, C.: A practical guide to support vector classification,
http://www.csie.ntu.edu.tw/∼ cjlin/libsvm/
16. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid match-
ing for recognizing natural scene categories. In: CVPR, vol, 2, pp. 2169–2178
17. Cai, D., et al.: http://www.cs.uiuc.edu/homes/dengcai2/Data/data.html
18. Joachims, T.: Training linear svms in linear time. ACM KDD, 217–226 (2006)
19. Fergus, R., et al.: Object class recognition by unsupervised scale-invariant learning.
In: CVPR, pp. 264–271 (2003)
20. Lowe, D.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision 60, 91–110 (2004)
21. Chang, S., et al.: Large-scale multimodal semantic concept detection for consumer
video. In: ACM Int’l Workshop on Multimedia Information Retrieval (2007)
22. NIST TRECVID (2001 – 2007),
http://www-nlpir.nist.gov/projects/trecvid/
Learning from Real Images to Model Lighting
Variations for Face Images
1 Introduction
Face recognition is difficult due to variations caused by pose, expression, occlu-
sion and lighting (or illumination), which make the distribution of face object
highly nonlinear. Lighting is regarded as one of the most critical factors for robust
face recognition. Current attempt to handle lighting variation by either finding
the invariant features or modeling the variation. The edge based algorithm [1]
and the algorithm based on quotient image [2,3,4]belong to the first type. But
these methods cannot extract sufficient features for accurate recognition.
Early work on modeling lighting variation [5,6] showed that a 3D linear
subspace can represent the variation of a Lambertian object under a fixed
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 284–297, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Learning from Real Images to Model Lighting Variations for Face Images 285
pose when there is no shadow. With the same Lambertian assumption, Bel-
humeur and Kriegman [7] showed that images illuminated by an arbitrary num-
ber of point light sources formed a convex polyhedral cone, i.e. the illumination
cone. In theory, the dimensionality of the cone is finite. They also pointed out
that the illumination cone can be approximated by a few properly chosen im-
ages. Good recognition results of the illumination cone in [8] demonstrated its
representation for lighting variation. [9] indicated that lighting subspace of Lam-
bertian object can be approximated by a linear subspace with dimension be-
tween three and seven. Recent research is mainly focused on the application
of low-dimensional subspace to lighting variation modeling. With the assump-
tion of Lambertian surface and non-concavity, Ramamoorith and Hanrahan [10]
and Basri and Jacobs[11] independently introduced the spherical harmonic (SH)
subspace to approximate the illumination cone. However, the harmonic images
(basis images of SH subspace) are computed from the geometric and albedo in-
formation of the subject’s surface. In order to use the SH subspace theory, a
lot of algorithms applied the 3D model of faces to handling lighting variations
[12,13,14,15,16]. However, recovering the 3D shape from images is still an open
problem in computer vision.
Lee et al.[19] built up a subspace that is nearest to the SH subspace and has
the largest intersection with the illumination cone, called the nine points of light
(9PL) subspace. It has a universal configuration for different subjects, i.e. the
subspace is spanned by images under the same lighting conditions for different
subjects. In addition, the basis images of 9PL subspace can be duplicated in
real environments, while those of the SH subspace cannot because its the basis
images contain negative values. Therefore the 9PL subspace can overcome the
inherent limitation of SH subspace. Since the human face is neither completely
Lambertain nor entirely convex, SH subspace can hardly represent the specular-
ities or cast shadows (not to mention inter-reflection). The basis images of 9PL
subspace are taken from real environment, they already contain all the compli-
cated reflections of the objects. Therefore the 9PL subspace can give a more
detailed and accurate description of lighting variation.
In practice, the requirement of these nine real images cannot always be ful-
filled. Usually there are fewer gallery images (e.g. one gallery image) per subject,
which can be taken under arbitrary lighting conditions. In this paper, we pro-
pose a statistical model for recovering the 9 basis images of the 9PL subspace
from only one gallery image. Zhang and Samaras [12] presented a statistical
method for recovering the basis images of SH subspace instead. In their training
procedure, geometric and albedo information is still required for synthesizing
the harmonic images. In contrast, the proposed method requires only some real
images that can be easily obtained in real environment. Since the recovered ba-
sis images of the 9PL subspace contain all the reflections caused by the shape
of faces, such as cast shadows, specularities, and inter-reflections, better recog-
nition results are obtained, even under extreme lighting conditions. Compared
with other algorithms based on 3D model [12,15,16], the proposed algorithm is
entirely a 2D algorithm, which has much lower computational complexity. The
286 X. Jiang et al.
where s̃ij = b˜i × b˜j and B̃ ∈ !n×3 . Every row b˜i of B̃ is a three element row
vector determined by the product of the albedo with the inward pointing unit
normal vector of a point on the surface. There are at most q(q − 1) extreme rays
for q ≤ n distinct surface normal vectors. Therefore the cone can be constructed
with finite extreme rays and the dimensionality of the lighting subspace is finite.
However, building the full illumination cone is tedious, and the low dimensional
approximation of the illumination cone is applied in practice.
From the view of signal processing, the reflection equation can be considered
as the rotational convolution of incident lighting with the albedo of the surface
[10]. The spherical harmonic functions Ylm (θ, φ) are a set of orthogonal basis
functions defined in the unit sphere, given as follows,
9) can span a subspace for representing the variability of lighting. This subspace
is called the spherical harmonic (SH) subspace .
Good recognition results reported in [11] indicates that the SH subspace H
is a good approximation to the illumination cone C. Given the geometric infor-
mation of a face, its spherical harmonic functions can be calculated with Eq.(2).
These spherical harmonic functions are synthesized images, also called harmonic
images. Except the first harmonic image, all the others have negative values,
which cannot be obtained in reality. To avoid the requirement of geometric in-
formation, Lee et al.[19] found a set of real images which can also serve as a low
dimensional approximation to illumination cone based on linear algebra theory.
Since the SH subspace H is good for face recognition, it is reasonable to
assume that a subspace R close to H would be likewise good for recognition. R
should also intersect with the illumination cone C as much as possible. Hence a
linear subspace R which is meant to provide a basis for good face recognition
will also be a low dimensional linear approximation to the illumination cone C.
Thus subspace should satisfy the following two conditions [19]:
1. The distance between R and H should be minimized.
2. The unit volume (vol(C ∩ R)) of C ∩ R should be maximized ( the unit
volume is defined as the volume of the intersection of C ∩ R with the unit ball)
Note that C ∩R is always a subcone of C; therefore maximizing its unit volume
is equivalent to maximize the solid angle subtended by the subcone C ∩ R. If
{I˜1 , I˜2 , · · · , I˜k }are the basis images of R. The cone Rc ⊂ R is defined by I˜k ,
M
Rc = {I|I ∈ R, I = αk I˜k , αk ≥ 0} (3)
k=1
where I p denotes the image of subject p taken under a single light source. H p
is the SH subspace of subject p. Rpk−1 denotes the linear subspace spanned by
images {I˜1p , · · · , I˜kp }of subject p. The universal configuration of nine light source
direction is obtained. They are (0, 0), (68, −90), (74, 108), (80, 52), (85, −42),
(85, −137), (85, 146), (85, −4), (51, 67)[14]. The directions are expressed in spher-
ical coordinates as pairs of (φ, θ), Figure 1(a) illustrates the nine basis images
of a person from the Yale Face Database B [8].
Fig. 1. the basis images of 9PL subspace. (a) images taken under certain lighting
conditions can serve as the basis images of the object. (b) the mean images of the basis
images estimated from the bootstrap data set.
of lighting coefficients which denotes the lighting conditions of the image. Error
term e(s) ⊂ !d×1 is related to the pixels’ position and lighting conditions.
For a novel image, we estimate its basis images through the maximum a
posterior (MAP) estimation. That is
BMAP = arg max P (B|I) (7)
B
In order to recover basis images from an image with Eq.(9), one should know
the pdf of the basis images, i.e. P (B), and the pdf of the likelihood, i.e. P (I|B).
Assuming the error term of Eq.(6) is normally distributed with mean μe (s) and
variance σe2 (s), we can deduce that the pdf of the likelihood P (I|B) is also
Gaussian with mean Bs + μe (s) and variance σe2 (s) according to Eq.(6).
We assume that the pdf of the basis images B are Gaussians of means μB
and covariances CB as in [12,20]. The probability P (B) can be estimated from
the basis images in the training set. In our experiments, the basis images of 20
different subjects from the extented Yale face database B [8] are introduced to
the bootstrap set. Note that, the basis images of every subject are real images
which were taken under certain lighting conditions. The lighting conditions are
determined by the universal configurations of the 9PL subspace. The sample
mean μB and sample covariance matrix CB are computed. Figure 1(b) shows
the mean basis images, i.e. μB .
The error term e(s) = I − Bs models the divergence between the real image
and the estimated image which is reconstructed by the low dimensional subspace.
The error term is related to the lighting coefficients. Hence, we need to know
the lighting coefficients of different lighting conditions. In the training set, there
are 64 different images that taken under different lighting condition for every
subject. Under a certain lighting condition, we calculate the lighting coefficients
of every subject’s image, i.e. spk (the lighting coefficients of the pth subject’s image
under the lighting condition sk ). For a training image, its lighting coefficients
can be estimated by solving the linear equation I = Bs. The mean value of
different subjects’ lighting coefficients can be the estimated coefficients (s̄k ) for
N p
that lighting condition, i.e. s̄k = p=1 sk /N . Then, under a certain lighting
condition, the error term the of the p subject’s image is
th
As described in the previous section, the basis images of a novel image can
be recovered by using the MAP estimation. Since the error term is related to
lighting condition, we need to estimate the lighting condition, i.e. the lighting
coefficients, of every image before calculating its basis images.
The error term denotes the difference between the reconstructed image and the
real image. This divergence is caused by the fact that the 9PL subspace is the
low-dimensional approximation to the lighting subspace, and it only accounts for
the low frequency parts of the lighting variance. The statistics of the error under
a new lighting condition can be estimated from those of the error under known
illumination, i.e. μe (s̄k ), σe2 (s̄k ), also via the kernel regression method [20].
M
k=1 wk μe (s̄k )
μe (s) = M (13)
k=1 wk
M
k=1 wk σe (s̄k )
2
σe2 (s) = M (14)
k=1 wk
Learning from Real Images to Model Lighting Variations for Face Images 291
AB = b (18)
−1 −1
where A = ss
T
From
Eq.(19), the estimated
basis image is composed of the term of characteris-
I−μB s−μe
tics, σ2 +sT CB s CB s , and the term of mean, μB . In the term of characteristics,
e
(I − μB s − μe ) is the difference between the probe image and the image recon-
structed by the mean basis images.
4.4 Recognition
The most direct way to perform recognition is to measure the distance between
probe images and the subspace spanned by the recovered basis images. Every
column of B is one basis image. However, the basis images are not orthonormal
292 X. Jiang et al.
5 Experiments
The statistical model is trained by images from the extended Yale Face Database
B. With the trained statistical model, we can reconstruct the lighting subspace
from only one gallery image. This estimation is insensitive to lighting variation.
Thus, recognition can be achieved across illumination conditions.
Fig. 2. Recovered basis images. (a)∼(d) are images in subset 1∼4 of Yale Face Database
B respectively. (e)∼(h) are recovered basis images from image (a)∼(d) respectively. (i)
are the reconstruction results: from left to right, the columns are the original im-
ages, the reconstruction results from the real basis images and the estimated basis
images(e)∼(h), respectively.
Learning from Real Images to Model Lighting Variations for Face Images 293
Although the images of the same object are under different lighting condi-
tions, the recovered basis images should be similar. The probe images are from
the Yale face database B. There are 10 subjects and 45 probe images per subject.
According to the lighting conditions of the probe images, they can be grouped
into 4 subsets as in [8]. The details can be found in Table 1. From subset1 to
subset4, the lighting conditions become extreme. For every subject, we recover
its basis images from only one of its probe images each time. Then we can ob-
tain 45 sets of basis images for every subject. Fig.2(e)∼(h) are the basis images
recovered from an image of each subset. σ̄basis (the mean standard deviation of
the 45 sets of basis images of 10 subjects) is 7.76 intensity levels per pixel, while
σ̄image (the mean standard deviation of the original 45 probe images of 10 sub-
jects) is 44.12 intensity levels per pixel. From the results, we can see that the
recovered basis images are insensitive to the variability of lighting. Thus we can
recover the basis images of a subject from its images under arbitrary lighting
conditions. Fig.2(i) are the reconstruction results from different basis images.The
reconstructed images also contain shadows and inter-reflections because the re-
covered basis images contain detailed reflection information. As a result, good
recognition results can be obtained.
5.2 Recognition
Recognition is performed on the Yale Face Database B [8] first. We take the
frontal images (pose 0) as the probe set, which is composed of 450 images (10
subjects, 45 images per subject). For every subject, one image is used for recov-
ering its lighting subspace and the 44 remaining images are used for recognition.
The comparison of our algorithm with the reported results is shown in Table 2.
Our algorithm reconstructed the 9PL subspace for every subject. The recov-
ered basis images also contained complicated reflections on faces, such as cast
shadows, specularities, and inter-reflection. Therefore the recovered 9PL sub-
space can give a more detailed and accurate description for images under differ-
ent lighting conditions. As a result, we can get good recognition results on images
with different lighting conditions. Also, the reported results of ’cone-cast’, ’har-
monic images-cast’ and ’9PL-real’ showed that better results can be obtained
when cast shadows were considered. Although [15,16] also use only one image to
adjust lighting conditions, they need to recover the 3D model of the face first.
The performance of our algorithm is comparable to that of these algorithms,
which are based on high-resolution rendering [15,16] and better than that of
those algorithms based on normal rendering [14]. Our algorithm is a completely
2D-based approach. Computationally, it is much less expensive compared with
294 X. Jiang et al.
those 3D based methods. The basis images of a subject can be directly com-
puted with Eq.19 while the recognition results are comparable to those from the
3D-based methods.
Fig. 3. Recovered basis images. (a) and (b) are images in PIE database, (e) and (f)
are estimated basis images from image (a) and (b), respectively. (c) and (d)are im-
ages in AR database, (g) and (h) are estimated basis images from image (c) and (d),
respectively.
6 Conclusion
The 9PL provides a subspace which is useful for recognition and is spanned by real
images. Based on this framework, we built a statistical model for these basis im-
ages. With the MAP estimation, we can recover the basis images from one gallery
image under arbitrary lighting conditions, which could be single lighting source
or multiple lighting sources. The experimental results based on the recovered sub-
space are comparable to those from other algorithms that require lots of gallery
images or the geometric information of the subjects. Even in extreme lighting con-
ditions, the recovered subspace can still appropriately represent lighting variation.
The recovered subspace retains the main characteristics of the 9PL subspace.
Based on our statistical model, we can build the lighting subspace of a sub-
ject from only one gallery image. It avoids the limitation of requiring tedious
training or complex training data, such as many gallery images or the geometric
information of the subject. After the model has been trained well, the computa-
tion for recovering the basis images is quite simple and without the need of 3D
models. The proposed framework can also potentially be used to deal with pose
and lighting variations together, with training images in different poses taken
under different lighting for building the statistical model.
Acknowledgement
This work is funded by China Postdoctoral Science Foundation(No.20070421129).
296 X. Jiang et al.
References
1. Guo, X., Leung, M.: Face recognition using line edge map. IEEE Trans. Pattern
Recognition and Machine Intelligence 24(6), 764–799 (2002)
2. Shashua, A., Tammy, R.: The quotient images: class-based rendering and recog-
nition with varying illuminations. IEEE Trans. Pattern Recognition and Machine
Intelligence 23(2), 129–139 (2001)
3. Gross, R., Brajovic, V.: An image processing algorithm for illumination invari-
ant face recognition. In: 4th International Conference on Audio and Video Based
Biometric Person Authentication, pp. 10–18 (2003)
4. Wang, H., Li, S.Z., Wang, Y.: Generalized quotient image. In: Proc. IEEE Conf.
Computer Vision and Pattern Recognition (2004)
5. Hallinan, P.: A low-dimensional representation of human faces for arbitrary lighting
conditions. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp.
995–999 (1994)
6. Nayar, S., Murase, H.: Dimensionality of illumination in appearance matching. In:
Proc. IEEE Conf. Robotics and Automation, pp. 1326–1332 (1996)
7. Belhumeur, P., Kriegman, D.J.: What is set of images of an object under all possible
lighting conditions? In: Proc. IEEE Conf. Computer Vision and Pattern Recogni-
tion, pp. 270–277 (1996)
8. Georghiads, A., Belhumeur, P., Kriegman, D.: From few to many: illumination cone
models for face recognition under variable lighting and pose. IEEE Trans. Pattern
Recognition and Machine Intelligence 23(6), 643–660 (2001)
9. Yuille, A., Snow, D., Epstein, R., Belhumeur, P.: Determing generative models of
objects under varying illumination: shape and albedo from multiple images using
SVD and integrability. International Journal of Computer Vision 35(3), 203–222
(1999)
10. Ramamoorthi, R., Hanrahan, P.: On the relationship between radiance and irra-
diance: determine the illumination from images of a convex Lambertian object. J.
Optical. Soc. Am. A 18(10), 2448–2459 (2001)
11. Basri, R., Jacobs, D.: Lambertian reflectance and linear subspaces. IEEE Trans.
Pattern Recognition and Machine Intelligence 25(2), 218–233 (2003)
12. Zhang, L., Samaras, D.: Face recognition under variable lighting using harmonic
image exemplars. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition
(2003)
13. Wen, Z., Liu, Z., Huang, T.: Face relighting with radiance environment map. In:
Proc. IEEE Conf. Computer Vision and Pattern Recognition (2003)
14. Zhang, L., Wang, S., Samaras, D.: Face synthesis and recognition from a single im-
age under arbitrary unknown lighting using a spherical harmonic basis morphable
model. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2005)
15. Lee, J., Moghaddam, B., Pfister, H., Machiraju, R.: A bilinear illumination model
for robust face recognition. In: Proc. IEEE International Conference on Computer
Vision (2005)
16. Wang, Y., Liu, Z., Hua, G., et al.: Face re-lighting from a single image under
harsh lighting conditions. In: Proc. IEEE Conf. Computer Vision and Pattern
Recognition (2007)
17. Chen, H.F., Belhumeur, P.N., Jacobs, D.W.: In search of illumination invariants.
In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (2000)
18. Zhou, S., Chellappa, R.: Illuminating light field: image-based face recognition across
illuminations and poses. In: Proc. IEEE Intl. Conf. on Automatic Face and Gesture
Recognition (May 2004)
Learning from Real Images to Model Lighting Variations for Face Images 297
19. Lee, K., Ho, J., Kriegman, D.: Acquiring linear subspaces for face recognition under
variable lighting. IEEE Trans. Pattern Recognition and Machine Intelligence 27(5),
684–698 (2005)
20. Sim, T., Kanade, T.: Combining models and exemplars for face recognition: an
illumination example. In: Proc. Of Workshop on Models versus Exemplars in Com-
puter Vision, CVPR 2001 (2001)
21. Adini, Y., Moses, Y., Ullman, S.: Face recognition: the problem of compensating
for changes in illumination directions. IEEE Trans. Pattern Analysis and Machine
Intelligence 19(7), 721–733 (1997)
22. Atkenson, C., Moore, A., Schaal, S.: Locally weighted learning. Artificial Intelli-
gence Review (1996)
23. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression (pie)
database. In: Proc. IEEE International Conference on Automatic Face and Gesture
Recognition (May 2002)
24. Martinez, A.M., Benavente, R.: The AR face database. CVC Tech. Report No.24
(1998)
25. Scharf, L.: Statistical signal processing: detection, estimation and time series anal-
ysis, p. 54. Addison-Wesley, Reading (1991)
Toward Global Minimum through Combined
Local Minima
Abstract. There are many local and greedy algorithms for energy min-
imization over Markov Random Field (MRF) such as iterated condition
mode (ICM) and various gradient descent methods. Local minima so-
lutions can be obtained with simple implementations and usually re-
quire smaller computational time than global algorithms. Also, methods
such as ICM can be readily implemented in a various difficult problems
that may involve larger than pairwise clique MRFs. However, their short
comings are evident in comparison to newer methods such as graph cut
and belief propagation. The local minimum depends largely on the ini-
tial state, which is the fundamental problem of its kind. In this paper,
disadvantages of local minima techniques are addressed by proposing
ways to combine multiple local solutions. First, multiple ICM solutions
are obtained using different initial states. The solutions are combined
with random partitioning based greedy algorithm called Combined Lo-
cal Minima (CLM). There are numerous MRF problems that cannot be
efficiently implemented with graph cut and belief propagation, and so
by introducing ways to effectively combine local solutions, we present a
method to dramatically improve many of the pre-existing local minima
algorithms. The proposed approach is shown to be effective on pairwise
stereo MRF compared with graph cut and sequential tree re-weighted be-
lief propagation (TRW-S). Additionally, we tested our algorithm against
belief propagation (BP) over randomly generated 30×30 MRF with 2×2
clique potentials, and we experimentally illustrate CLM’s advantage over
message passing algorithms in computation complexity and performance.
1 Introduction
Recently, there are great interests in energy minimization methods over MRF.
The pairwise MRF is currently the most prominent MRF which became most
frequent subject of study in computer vision. Also, in the forefront, there is a
movement toward 2×2 and higher clique potentials for de-noising and segmenta-
tion problems [1,2,3,4,5,6]. They claim better performance through larger clique
potentials that can give more specified constraints.
However, the conventional belief propagation which has been so effective in
the pairwise MRF, is shown to have severe computational burden over large
cliques. In a factor graph belief propagation, the computational load increases
exponentially as the size of clique increases, although for the linear constraint
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 298–311, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Toward Global Minimum through Combined Local Minima 299
MRFs, the calculation can be reduced to time linear [6,3]. Graph cut based
methods are also introduced for energy functions with global constraints and
larger clique potentials with pair-wise elements [5,4]. However, these methods
are targeted toward a specific category of energy functions and the applicability
limitations are high.
A practical and proven method for minimizing even the higher order MRFs
is simulated annealing. Gibbs sampler, generalized Gibbs sampler, data-driven
Markov chain Monte Carlo and Swendsen-Wang cut were respectively applied
to de-noising, texture synthesizing and segmentation problems that involved
large clique potentials [7,8,9,10]. However, simulated annealing is considered im-
practically slow compared to belief propagation and graph cut even in pairwise
MRFs [10,11]. More recently, simulated annealing has been modified by local-
ized temperature scheduling and additional window scheduling to increase its
effectiveness [12,13].
Another approach that is often being ignored is the greedy local minimum
algorithms. With the introductions of theoretically sound graph cut and belief
propagation over pairwise MRF, older methods such as ICM [14] and various
gradient descent methods are often disregarded as an under-performing alter-
natives [11]. However, methods like ICM and other local minimum algorithms
do not have any constraints over the size of cliques in MRF. Gradient descent
method was readily implemented over 5 × 5 and 3 × 2 clique potential in the de-
noising problem [2,1]. Texture synthesis and segmentation problems were model
by high order MRF and the energy was minimized using ICM [15]. Thus, when
considering both the computational time and performance, local greedy methods
that depend largely on the initial states are still viable in many of the high order
MRFs.
In this paper we propose a new algorithm to effectively combine these local
minima to obtain a solution that is closer to the global minimum state. First,
local solutions are calculated from various initial states. Then, they are com-
bined by random partitioning process such that the energy is minimized. The
proposed Combined Local Minima (CLM) approach is very simple but it can
effectively find lower energy state than graph cut and belief propagation. CLM
is tested on the pairwise stereo MRFs provided by [16,17,18,19,20], and it is
shown that the performance can be better than graph cut [21] and sequential
tree reweighted belief propagation (TRWS) [22]. We also performed tests over
randomly generated 2 × 2 clique MRFs, and showed that the proposed method
converges not only faster but finds lower energy state than belief propagation.
However, the biggest advantage of the proposed algorithm is that it can bring
further improvement over various local minima algorithms that are applicable
to general energy functions.
Section 2 will review ICM algorithm. Section 3 presents proposed CLM. In
the experiment section, CLM is shown to be competative over pairwise MRF
and superior over 2 × 2 clique MRF. The paper will close with conclusion and
possible future work.
300 H.Y. Jung, K.M. Lee, and S.U. Lee
Fig. 1. (a) to (e) show ICM solutions from different initial states. Homogeneous states
of disparity 0, 5, 8, 10, and 14 are respectively used as the initial states of (a), (b),
(c), (d), and (e). Combined local minima algorithm effectively combines these ICM
solutions into an lower energy state (f).
The problem of choosing the right initial state is the big disadvantage of ICM.
Figure 1 (a) to (e) show the ICM solutions for Tsukuba stereo MRF. The solutions
in Figure 1 are obtained with different initial homogeneous states. Even though
the energy minimization cannot be low as graph cut or belief propagation, the
computational time is very small because the comparative inequality of step 5 can
be evaluated in O(1) for most of the energy functions, including pairwise functions.
ICM guarantees to converge but the performance is very poor as shown in figure 1.
Also, because of its simplicity, ICM can be applied to high order MRF with larger
cliques where graph cut and BP are having problems with.
Lj is obtained from the set of the local solutions such that Lj = {lsj1 , lsj2 , lsj3 , ..., lsjk }.
The search for the minimum energy state will be over ΩS , although there
is no guarantee that the global minima is in the reduced space. Choosing the
right combinations of local minima for CLM will admittedly be heuristic for
302 H.Y. Jung, K.M. Lee, and S.U. Lee
each problem. More on the choices of local minima will be discussed in the later
sections. However, when the sufficient number and variety of local minima are
present in the proposed CLM, the solution space will be the original Ω.
3. Repeat for i = 1 to i = m.
4. Make k + 1 proposal states {x0 , x1 , x2 , ..., xk } in combinations
of current state x and s1 , ..., sk such that Vxi vector partition of
x is replaced 1by 2the 3Vs of local
i
minima 1states. See below.
x0 = x = V
1 x2 x i x , V , V , ..., V m
x , x 1 = x 1 x 2 Vs1i, ..., Vx m ,
V , V 2
, ..., i m
Among set S = {x0 , x1 , ..., xk }, take the lowest energy state as the
current state.
The computational complexity of CLM depends largely on the complexity of
evaluating ϕ(xi ). If ϕ(xi ) is needed to be calculated in O(N ), ICM’s complexity
will be O(kmN ). If m is randomly chosen, the worst case would be for m = N ,
and the time complexity will be O(kN 2 ) per iteration. However, if the maximum
clique size is small compared to MRF size, both the worst and best complexity
will be O(kN ) because only V i and areas around V i are needed to be evaluated
to find the lowest energy among S = {x0 , x1 , ..., xk }. Also, the complexity can
still be lowered using various computation techniques such as integral image
method [24].
The proposed algorithm is greedy and guarantees that the energy does not
increase for each iteration. Figure 2 shows the iterative results of the proposed
Toward Global Minimum through Combined Local Minima 303
Fig. 2. (a) shows the initial state of CLM. (b), (c), (d), (e), and (f) show respectively
the first, second, third, fourth, and sixth iterations of combined local minima algorithm.
CLM over Tsukuba stereo pair MRF. k = 16 number of local minima were
used. Few of local minima are shown in Figure 1. With only a small number
of iterations, CLM can output energy minimization result far superior to ICM
method, and with enough iterations it can be effective as the message passing
and graph cut algorithms.
However, there are two heuristics that must be resolved for CLM. First, it is
unclear how current state x and {s1 , s2 , ..., sk } should be randomly partitioned
in step 2 of the algorithm. Second, the choice of local minima and the value of k
are subject to question. These two issues are important to the performance of the
proposed algorithm and the basic guidelines are provided in next subsections.
Thus, in order to have different local minima states, ICM with different ho-
mogeneous initial states were used. See experimental section and Figure 4 and 5.
In both of the comparison tests, the number of local minima are set to Q, the
number of labels. {s1 , ..., sQ } are obtained from ICM with homogeneous initial
state, respectively having labels l1 , l2 , ..., lQ . In both stereo MRF and randomly
generated MRF, such initial states resulted in the energy minimization compa-
rable to message passing algorithms. Thus, the rule of thumb is to use Q number
of local minima derived from the respective homogeneous initial states.
However, by increasing the number of local minima as shown in Figure 5,
much lower energy can be achieved with incremental addition to computation
time. In Figure 5, CLM200 minimizes energy using total of 200 local minima
composed of Q homogeneous initial states and 200 − Q number of ICM solutions
obtained from random initial states. CLM200 achieves much lower energy than
belief propagation. Although, random initial states are used here, more adaptive
initial states can also be applied for different problems.
4 Experiments
In order to show the effectiveness of the proposed CLM, we compared it’s per-
formance with graph cut and TRW-S over pairwise stereo MRF. Additionally,
window annealing (WA) [13] results are included in the test. Pairwise stereo
MRF is known to be effectively optimized by alpha expansion graph cut (GC)
and TRW-S [21,22], but very ill posed for greedy algorithms such as ICM. The
experiments were performed over stereo pairs provided by [16,17,18,20,19].
Toward Global Minimum through Combined Local Minima 305
V1 V2
Vm
(a) Rectangular Partition (b) 2 × 2 clique MRF
Also recently, larger than pairwise clique models are often proposed for vision
problems. Gradient descent and belief propagation are used over 2 × 2 and larger
clique MRF to attack such problems as de-noising, and shape from shading
[6,1,2,3]. Thus, we tested our algorithm over randomly generated MRF with
2 × 2 clique potentials, see Figure 5 (a). Alpha expansion algorithms cannot
deal with randomly generated larger than pairwise MRF, and it was excluded
from the test. CLM reaches a lower energy faster than belief propagation (BP)
and WA methods. The computational complexity of proposed method is O(kN ),
allowing CLM to be a practical minimization scheme over large clique MRFs.
All computations are done over a 3.4GHz desktop.
725000 1e+06
CLM 60 CLM 60
720000 WA WA
Graph Cut 980000 Graph Cut
715000 TRW-S TRW-S
Low Bound Low Bound
960000
710000
705000 940000
Energy
Energy
700000
920000
695000
900000
690000
685000 880000
680000
860000
675000
0 20 40 60 80 100 0 20 40 60 80 100
Time(seconds) Time(seconds)
720000
Energy
Energy
1.15e+06
700000
1.1e+06
680000
1.05e+06
660000
0 50 100 150 200 250 300 350 0 20 40 60 80 100 120 140
Time(seconds) Time(seconds)
Fig. 4. (a)Cones uses truncated quadratic discontinuity cost. (c) Bowling2 is the result
for truncated linear discontinuity cost. (b) Teddy and (d) Art use Potts discontinu-
ity cost. CLM 60 means 60 local minima are used in CLM algorithm. The CLM’s
performance is shown to be in-between TRW-S and GC. The performance difference
to state-of-art methods are very small, however, CLM performance does not seem to
strongly vary according to the discontinuity model apposed to TRW-S and graph cut.
the annealing scheduling for lower minimization, we kept the same temperature
and window scheduling of [13].
For the energy function, gray image Birchfield and Tomasi matching costs [26]
and Potts, truncated linear and truncated quadratic discontinuity cost are used.
ϕ(x) = D(p) + V (p, q). (3)
p∈V (p,q)∈Ng
D(p) is a pixel-wise matching cost between left and right image. V (p, q) is
pair-wise discontinuity costs. The implementations of graph cut and TRW-S
by [11,21,27,28,29,22] are used in this experiment.
For the implementation of CLM, Q number of local minima ICM solutions
are obtained from following set of initial states {(0, 0, ..., 0), (1, 1, ..., 1), ..., (Q −
1, Q − 1, ..., Q − 1)}. As mentioned before, a rule of thumb seems to be Q number
of local minima with homogeneous initial states, especially if the MRF is known
to have smoothness constraint. For the state partition technique of step 2 of
CLM, a simple rectangular partitioning method is used, see Figure 3.
Toward Global Minimum through Combined Local Minima 307
500 500
CLM 2 CLM 3
CLM 200 CLM 200
450 WA 450 WA
ICM ICM
BP 400 BP
400
350
350
Energy
Energy
300
300
250
250
200
200 150
150 100
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 8 10
Time(seconds) Time(seconds)
(a) Q = 2 (b) Q = 3
500 500
CLM 4 CLM 5
CLM 200 CLM 200
450 WA 450 WA
ICM ICM
400 BP 400 BP
350 350
Energy
Energy
300 300
250 250
200 200
150 150
100 100
50 50
0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250
Time(seconds) Time(seconds)
(c) Q = 4 (d) Q = 5
Fig. 5. Energy versus time results of max product BP, ICM, WA, and CLM over
30 × 30 randomly generated MRF. Figure (a), (b), (c), and (d) respectively have label
size Q = 2, Q = 3, Q = 4, and Q = 5. CLM using k = Q and k = 200 number of
local minima are performed for each random MRF. The increase in the local minima
allows lower energy state to be achieved in exchange for computation time and memory.
However, such price is very small compared to the computation time of BP.
Figure 4 shows energy vs time graph results using Potts, truncated linear,
truncated quadratic discontinuity model. Qualitatively, there is a very small
difference between TRW-S, graph cut, WA, and CLM, see Figure 6. However,
the energy versus time graphs show more edifying comparison. The first iteration
of the CLM takes much longer time than the other iterations because all the local
solutions are needed to be computed. Overall performance of the proposed CLM
stands in the middle of graph cut and TRW-S. However, compared with window
annealing, CLM outperforms it everywhere except for the initial calculations.
(d-1) Graph cut (d-2) Graph cut (d-3) Graph cut (d-4) Graph cut
Fig. 6. This figure shows the qualitative stereo energy minimization results at roughly
at same computation time. (a-1) to (a-4) are left reference stereo images. (b-1) to (b-
4) are the results of proposed CLM. (c), (d), and (e) respectively show the results
of window annealing, graph cut and TRW-S. For each stereo pair, the same energy
function is used. The qualitative differences between 4 methods are very small, except
for Teddy image where graph cut’s lower energy makes a difference over the roof area of
image. Otherwise, the energy difference between 4 methods are small enough to make
no visible differences.
Toward Global Minimum through Combined Local Minima 309
ϕ(x) = V (p, q, r, s) (4)
(p,q,r,s)∈Ng
Acknowledgement
This research was supported in part by the Defense Acquisition Program Admin-
istration and Agency for Defense Development, Korea, through the Image Infor-
mation Research Center under the contract UD070007AD, and in part by the
MKE (Ministry of Knowledge Economy), Korea under the ITRC (Information
Technolgy Research Center) Support program supervised by the IITA (Institute
of Information Technology Advancement) (IITA-2008-C1090-0801-0018).
References
1. Roth, S., Black, M.J.: Steerable random fields. In: ICCV (2007)
2. Roth, S., Black, M.J.: Field of experts: A framework for learning image priors. In:
CVPR (2005)
3. Potetz, B.: Efficient belief propagation for vision using linear constraint nodes. In:
CVPR (2007)
4. Kohli, P., Mudigonda, P., Torr, P.: p3 and beyond: Solving energies with higher
order cliques. In: CVPR (2007)
5. Rother, C., Kolmogorov, V., Minka, T., Blake, A.: Cosegmenation of image pairs
by histogram matching- incorporating a global constraint into mrfs. In: CVPR
(2006)
6. Lan, X., Roth, S., Huttenlocher, D., Black, M.J.: Efficient belief propagation with
learned higher-order markov random fields. In: Leonardis, A., Bischof, H., Pinz, A.
(eds.) ECCV 2006. LNCS, vol. 3952, pp. 269–282. Springer, Heidelberg (2006)
7. Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the bayesian
restoration of images. PAMI 6 (1984)
8. Zhu, S.C., Liu, X.W., Wu, Y.N.: Exploring texture ensembles by efficent markov
chain monte carlo: Toward a trichromacy theory of texture. PAMI 22(6) (2000)
9. Tu, Z., Zhu, S.C.: Image segmentation by data-driven markov chain monte carlo.
PAMI 24 (2002)
10. Barbu, A., Zhu, S.C.: Generalizing swendsen-wang cut to sampling arbitrary pos-
terior probabilities. PAMI 27 (2005)
11. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A.,
Tappen, M., Rother, C.: A comparative study of energy minimization methods for
markov random fields. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
LNCS, vol. 3952, pp. 16–29. Springer, Heidelberg (2006)
12. Woodford, O.J., Reid, I.D., Torr, P.H.S., Fitzgibbon, A.W.: Field of experts for
image-based rendering. BMVC (2006)
13. Jung, H.Y., Lee, K.M., Lee, S.U.: Window annealing over square lattice markov
random field. ECCV (2008)
14. Besag, J.: On the statistical analysis of dirty pictures (with discussion). Journal of
the Royal Statistical Society Series B 48 (1986)
15. Mignotte, M.: Nonparametric multiscale energy-based model and its application in
some imagery problems. PAMI 26 (2004)
16. http://vision.middlebury.edu/stereo/
17. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. In: IJCV (2002)
18. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light.
In: CVPR (2003)
Toward Global Minimum through Combined Local Minima 311
19. Hirshmuller, H., Szeliski, R.: Evaluation of cost functions for stereo matching. In:
CVPR (2007)
20. Scharstein, D., Pal, C.: Learning conditional random fields for stereo. In: CVPR
(2007)
21. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via
graph cuts. PAMI 23 (2001)
22. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimiza-
tion. PAMI 28 (2006)
23. Lempitsky, V., Rother, C., Blake, A.: Logcut - efficient graph cut optimization for
markov random fields. In: ICCV (2007)
24. Crow, F.: Summed-area tables for texture mapping. SIGGRAPH (1984)
25. Tappen, M.F., Freeman, W.T.: Comparison of graph cuts with belief propagation
for stereo, using identical mrf parameters. In: ICCV (2003)
26. Birchfield, S., Tomasi, C.: A pixel dissimilarity measure that is insensiitive to image
samplin. PAMI 20 (1998)
27. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph
cuts? PAMI 26 (2004)
28. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow
algorithms for energy minimization in vision. PAMI 26 (2004)
29. Wainwright, M.J., Jaakkola, T.S., Willsky, A.S.: Map estimation via agreement on
trees: Message-passing and linear-programming approaches. IEEE Trans. Informa-
tion Theory 51(11) (2005)
Differential Spatial Resection - Pose Estimation
Using a Single Local Image Feature
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 312–325, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Differential Spatial Resection 313
but initial start values from as little data as possible, suitable for further pro-
cessing. For instance, when dealing with small or hand-clicked data sets or when
RANSAC-like estimators[6] are used, it is often desirable to obtain a minimal
solution, which requires as little of the data as possible. In RANSAC, the prob-
ability of picking an all-inlier-set from correspondences with many mismatches
depends exponentially on the number of samples required to construct a solution
hypothesis. Using our novel approach, it is now possible to obtain an estimate
of a camera’s pose from as little as one e.g. MSER[21] or comparable feature
(cf. to [24] for a discussion) or e.g. one suitable photogrammetric ground control
point (cf. to [22], p.1111) in an image, given the local plane in 3D space where
it is located and its texture.
For instance, when a feature descriptor is recognized in an unknown image, the
6 DOF camera or object pose can be obtained by the methods given here. To im-
prove the pose estimation result, gradient based optimization techniques[19,16]
can be applied between the current view and a reference texture. The reference
texture can either be an orthophoto (cf. to [22], p.758) or any other view with
sufficient resolution for which the warp to an orthophoto is known. When sev-
eral such feature correspondences and the camera poses are optimized at once,
this is similar to the approach of Jin et al.[11]. However, their approach is for-
mulated in a nonlinear fashion only and requires an initialization, comparable
to the requirements for bundle adjustment. Since we exploit the perspectiv-
ity concept, a plane-to-plane mapping in euclidian space, in section 3 we also
present the related work in homography estimation [33,13,10] and projective
reconstruction[27], which did not inspect the differential constraints on the per-
spectivity, because often the calibrated camera case is not considered in projec-
tive approaches. The exploitation of the Jacobian of the texture warp has been
proposed though for the estimation of a conjugate rotation in [15].
Notation. To improve the readability of the equations we use the following
notation: Boldface italic serif letters x denote Euclidean vectors while boldface
upright serif letters x denote homogeneous vectors. For matrices we do not use
serifs, so that Euclidean matrices are denoted as A and homogeneous matrices
are denoted as A, while functions H [x] appear in typewriter font.
2 Perspectivity
The contribution is based on estimating a transformation between two theoreti-
cal planes: The first plane is tangent to a textured surface in 3D and the second
plane is orthogonal to the optical axis of a camera. The estimation of the pose is
then formulated as the problem of obtaining a perspectivity between these two
planes (see figure 1). A 2D perspectivity is a special kind of homography (cf.
also to [9], pp. 34), which has only 6 degrees of freedom and which is particularly
important for mappings between planes in Euclidian space. We assume a locally
planar geometry at the origin of 3D space facing into z-direction and attach
x, y-coordinates onto it, which coincide with the x, y coordinates in 3D space.
If we now move a perspective pinhole camera to position C with orientation R
Differential Spatial Resection 315
In the next section we describe the differential correspondence and how it can
be exploited to obtain constraints on H.
3 Differential Correspondence
Progress in robust local features (cf. to [24,23] for a thorough discussion) al-
lows automatic matching of images in which appearance of local regions un-
dergoes approximately affine changes of brightness and/or of shape, e.g. for
automated panorama generation[1], scene reconstruction[30] or wide-baseline
matching[18,21]. The idea is that interesting features are detected in each im-
age and that the surrounding region of each feature is normalized with respect
to the local image structure in this region, leading to about the same normal-
ized regions for correspondences in different images, which can be exploited for
matching. The concatenation of the normalizations provides affine correspon-
dences between different views, i.e. not only a point-to-point relation but also
a relative transformation of the local region (e.g. scale, shear or rotation). Al-
though such correspondences carry more information than the traditional point
316 K. Köser and R. Koch
perspectivity. This derivative ∂H/∂pp tells us something about the relative scal-
ing of coordinates between the plane in the origin and the image, e.g. if C is
large and the camera is far away from the origin ∂H/∂pp will be small, because
a large step on the origin plane will result in a small step in the image far away.
Actually, ∂H/∂pp carries information about rotation, scale and shear through
perspective effects. Since H can be scaled arbitrarily without changing H, we set
H3,3 = 1 without loss of generality1 and compute the derivative at the origin:
∂H r̃11 − r̃13 t1 r̃12 − r̃13 t1 a11 a12
= = (7)
∂pp 0
r̃21 − r̃23 t2 r̃22 − r̃23 t2 a21 a22
r̃11
2
+ r̃12
2
+ r̃13
2
= r̃21
2
+ r̃22
2
+ r̃23
2
∧ r̃T
1 r̃ 2 = 0 (9)
We can now compute H by first substituting t into eq. (7), then solving for
r̃11 , r̃21 , r̃12 and r̃22 and substituting into eq.(9), leaving us with two quadratic
equations in the two unknowns r̃13 and r̃23 :
(r̃13 t1 + a11 )(r̃23 t2 + a21 ) + (r̃13 t1 + a12 )(r̃23 t2 + a22 ) + r̃13 r̃23 = 0 (11)
The first equation is about the length and the second about the orthogonal-
ity of the r̃-vectors as typical for constraints on rotation matrices. We find it
instructive to interpret them as the intersection problem of two planar conics,
the length conic Cl and the orthogonality conic Co :
⎛ ⎞
0 t1 t2 + 12 (a21 + a22 )t1
Co = ⎝ t1 t2 + 12 0 (a11 + a12 )t2 ⎠ (15)
(a21 + a22 )t1 (a11 + a12 )t2 a11 a21 + a12 a22
1
This is not a restriction because the only unrepresented value H3,3 = 0 maps the
origin to the line at infinity and therefore such a feature would not be visible.
318 K. Köser and R. Koch
Solving for the Pose Parameters. Two conics cannot have more than four
intersection points, therefore, we can obtain at most four solutions for our camera
pose. To solve the intersection of the two conics we use the elegant method of
Finsterwalder and Scheufele[5], which proved also to be the numerically most
stable method of the six different 3-point algorithms for spatial resection [8]:
Since a common solution of equations (12) and (13) must also fulfill any linear
combination of both, we construct a linear combination of both conics, which
does not have full rank (zero determinant), but which still holds all solutions.
This creates a third order polynomial, which has at least one real root and which
can be solved easily:
det(λCo + (1 − λ)Cl ) = 0 (16)
The resulting degenerate conic will in general consist of two lines. The inter-
section of these lines with the original conics is only a quadratic equation and
determines the solutions. The resulting R and C have to be selected and normal-
ized in such a way that we obtain an orthonormal rotation matrix (determinant
+1) and the camera looks towards the plane. We have now obtained up to four
hypotheses for the pose of the camera in the object coordinate system (relative
to the feature). If there is a world coordinate system, in which the plane is not at
the origin, the rigid world transformation has to be appended to the computed
pose of the camera. Computing the relative pose in the object coordinate system
in general also improves conditioning since the absolute numbers of the object’s
pose in the world become irrelevant.
CS on the space plane of the previous section maps to a conic CI in the image
with the equation
CI = HT CS H, (17)
where H is the perspectivity of the previous sections. First, we show how the
two primitives used in our differential correspondence can be related to conic
representations: For each affine feature, e.g. MSER, there exists a local image
coordinate system, the local affine frame[2], such that coordinates can be speci-
fied relative to the size, shear, position and orientation of a feature. Imagine that
L takes (projective) points from local feature coordinates to image coordinates:
xI = LxLAF (18)
If the same feature is seen in two images, points with identical feature (LAF)
coordinates will have the same grey value. The local affine frames of the features
in the different images are then called L1 and L2 and their concatenation is the
first order Taylor approximation HTaylor of the texture warp (e.g. a homography)
between the two images at the feature positions:
HTaylor = L1 L−1
2 (19)
If we now just think of a single image and imagine a small ellipse through
the points (0; λ)T ,(λ; 0)T ,(0; −λ)T and (−λ; 0)T of the local feature coordinate
system, this ellipse can be represented by a conic equation in homogeneous co-
ordinates such that points at the ellipse contour fulfill the quadratic constraint:
⎛ ⎞
1
0 = xTLAF
⎝ 1 ⎠ xLAF (20)
−λ2
5 Evaluation
In this section the differential correspondence-based pose estimation is evaluated
first using synthetic sensitivity experiments. Next, rendered images with known
ground truth information are used to evaluate the real-world applicability, where
everything has to be computed from image data. In the final experiments, object
pose estimation from one feature is shown qualitatively using non-ideal cameras.
corners of the local patch. On the other hand, when the 3 individual 3D points
of Grunert’s solution approach each other, the standard spatial resection can
become unstable, because it is based on the difference of the distances to the 3
points. To overcome this issue, Kyle [17] proposed an approximate initial guess
322 K. Köser and R. Koch
Fig. 3. Camera Pose From Noisy Images. A ground plane has been textured with
an aerial image serving as an orthophoto and a series of 40 views have been rendered
with different levels of noise (upper row: sample views with low noise). A reference
MSER feature with orientation has been chosen in the orthophoto. This feature is
then detected in the other views and refined using a simple 6-parametric affine warp
(see ellipses in bottom left image) according to [16] based upon a half window size of
10 pixels. From such differential correspondences, the camera pose is estimated and
compared against the known ground truth value as explained earlier. Whenever the
error was above 20◦ or the algorithm did not come up with a solution a failure was
recorded. The bottom right graph shows the average pose errors in dependence of the
added image noise. When adding much more image noise, the MSER detector is no
longer able to find the feature. This experiment is particularly interesting because it
shows that the concept does still work when the ellipse is not infinitely small.
for narrow angle images, which is the same as the POS (Pose from Orthography
and Scaling) in the POSIT[4] algorithm: Both require 4 non-coplanar points.
For the POSIT algorithm however, there exists also a planar variant[25], which
copes with planar 3D points.
Therefore we compare our novel algorithm (well-suited for small solid an-
gles) to the spatial resection[7,8] implemented as proposed in the Manual of
Photogrametry[22, pp. 786] and the planar POSIT[25] algorithm kindly pro-
vided on the author’s homepage, which are both desiged for larger solid angles.
We vary the size of a local square image patch from ten to several hundred
pixels and use the corners as individual 2D-3D correspondences in the existing
algorithms. For our new method the patch corner points are used to compute
a virtual local affine transform which approximates the required Jacobian. An
evaluation of the quality of the approximation can be seen in the bottom right
of fig. 5, which shows that for small solid angles the novel solution outper-
forms spatial resection, while for large solid angles - as expected - the affine
Differential Spatial Resection 323
Fig. 4. Object Pose Estimation from a Single Feature. This figure shows that
in a real camera with radial distortion object pose estimation is possible from a single
feature. The orthophoto of the object is displayed in the right image with the local fea-
ture region enlarged. The two left images show cluttered views with the object partially
occluded. The “M” has been detected using MSER and refined, the resulting object
poses from this single differential correspondence are then displayed by augmenting a
contour model (white).
approximation is not suitable. It is however still better in average than the or-
thographic approximation in the planar POSIT algorithm. Particularly, when
the solid angle approaches zero, the error in the novel solution tends to zero,
while for the other algorithms no solution can be obtained or the best solution
is worse than the robust error threshold of 10◦ .
Normal or Pose Error of the Local Plane. An error of the normal of the
3D reference plane, for which the orthophoto exists or an error of the pose of this
plane cannot be detected within the algorithm. The pose is computed relative
to this plane and an error of the plane in global coordinates will consequently
result in a relative error of the camera pose in global coordinates.
6 Conclusion
A method for estimating a camera pose based upon a single local image feature
has been proposed which exploits the often readily available local affine warp
between two images. This differential correspondence provides more constraints
than a point or a conic and can be used easily in calibrated cameras even if they
deviate from the linear projection model. The algorithm proved to be stable
under several kinds of disturbance and can also be applied when the 3 individual
3D points of a general spatial resection problem come very close because the
novel formulation avoids directly computing the 3 distances, which can lead to
numerical difficulties in practise. Another benefit of the novel minimal solution
is that it allows now for computing the pose from a single image-model match
of common robust features which could reduce RANSAC complexity compared
to the previously required set of 3 correspondences.
References
1. Brown, M., Lowe, D.G.: Automatic panoramic image stitching using invariant fea-
tures. International Journal of Computer Vision 74(1), 59–73 (2007)
2. Chum, O., Matas, J., Obdrzalek, S.: Epipolar geometry from three correspon-
dences. In: Computer Vision Winter Workshop, Prague, pp. 83–88 (2003)
3. Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: Monoslam: Real-time sin-
gle camera slam. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 29(6), 1052–1067 (2007)
4. DeMenthon, D., Davis, L.S.: Model-based object pose in 25 lines of code. Interna-
tional Journal of Computer Vision 15, 123–141 (1995)
5. Finsterwalder, S., Scheufele, W.: Das Rückwärtseinschneiden im Raum. In: Bay-
erische, K., der Wissenschaften, A. (eds.) Sitzungsberichte der mathematisch-
physikalischen Klasse, vol. 23/4, pp. 591–614 (1903)
6. Fischler, M., Bolles, R.: RANdom SAmpling Consensus: a paradigm for model
fitting with application to image analysis and automated cartography. Communi-
cations of the ACM 24(6), 381–395 (1981)
7. Grunert, J.A.: Das Pothenot’sche Problem, in erweiterter Gestalt; nebst Bemerkun-
gen über seine Anwendung in der Geodäsie. In: Archiv der Mathematik und Physik,
vol. 1, pp. 238–248, Greifswald. Verlag C.A. Koch (1841)
8. Haralick, B., Lee, C., Ottenberg, K., Nölle, M.: Review and analysis of solutions
of the three point perspective pose estimation problem. International Journal of
Computer Vision 13(3), 331–356 (1994)
9. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn.
Cambridge University Press, Cambridge (2004)
10. Irani, M., Rousso, B., Peleg, S.: Recovery of ego-motion using region alignment.
Transact. on Pattern Analysis and Machine Intelligence 19(3), 268–272 (1997)
11. Jin, H., Favaro, P., Soatto, S.: A semi-direct approach to structure from motion.
The Visual Computer 19(6), 377–394 (2003)
12. Kahl, F., Heyden, A.: Using conic correspondence in two images to estimate the
epipolar geometry. In: Proceedings of ICCV, pp. 761–766 (1998)
13. Kähler, O., Denzler, J.: Rigid motion constraints for tracking planar objects. In:
Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp.
102–111. Springer, Heidelberg (2007)
Differential Spatial Resection 325
14. Kannala, J., Salo, M., Heikkila, J.: Algorithms for computing a planar homography
from conics in correspondence. In: Proceedings of BMVC 2006 (2006)
15. Koeser, K., Beder, C., Koch, R.: Conjugate rotation: Parameterization and esti-
mation from an affine feature corespondence. In: Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2008)
16. Koeser, K., Koch, R.: Exploiting uncertainty propagation in gradient-based image
registration. In: Proc. of BMVC 2008 (to appear, 2008)
17. Kyle, S.: Using parallel projection mathematics to orient an object relative to a
single image. The Photogrammetric Record 19, 38–50 (2004)
18. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision 60(2), 91–110 (2004)
19. Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli-
cation to stereo vision. In: IJCAI 1981, pp. 674–679 (1981)
20. De Ma, S.: Conics-based stereo, motion estimation, and pose determination. Inter-
national Journal of Computer Vision 10(1), 7–25 (1993)
21. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from
maximally stable extremal regions. In: Proceedings of BMVC 2002 (2002)
22. McGlone, J.C. (ed.): Manual of Photogrammetry, 5th edn. ASPRS (2004)
23. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. Trans-
act. on Pattern Analysis and Machine Intell. 27(10), 1615–1630 (2005)
24. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffal-
itzky, F., Kadir, T., van Gool, L.: A comparison of affine region detectors. Inter-
national Journal of Computer Vision 65(1-2), 43–72 (2005)
25. Oberkampf, D., DeMenthon, D., Davis, L.S.: Iterative pose estimation using copla-
nar feature points. CVGIP 63(3) (1996)
26. Riggi, F., Toews, M., Arbel, T.: Fundamental matrix estimation via TIP - transfer
of invariant parameters. In: Proceedings of the 18th International Conference on
Pattern Recognition, Hong Kong, August 2006, pp. 21–24 (2006)
27. Rothganger, F., Lazebnik, S., Schmid, C., Ponce, J.: Segmenting, modeling, and
matching video clips containing multiple moving objects. IEEE Transactions on
Pattern Analysis and Machine Intelligence 29(3), 477–491 (2007)
28. Schmid, C., Zisserman, A.: The geometry and matching of lines and curves over
multiple views. International Journal of Computer Vision 40(3), 199–234 (2000)
29. Se, S., Lowe, D.G., Little, J.: Vision-based global localization and mapping for
mobile robots. IEEE Transactions on Robotics 21(3), 364–375 (2005)
30. Skrypnyk, I., Lowe, D.G.: Scene modelling, recognition and tracking with invari-
ant image features. In: IEEE and ACM International Symposium on Mixed and
Augmented Reality, pp. 110–119 (2004)
31. Thompson, E.H.: Space resection: Failure cases. The Photogrammetric
Record 5(27), 201–207 (1966)
32. Williams, B., Klein, G., Reid, I.: Real-time slam relocalisation. In: Proceedings of
ICCV, Rio de Janeiro, Brazil, pp. 1–8 (2007)
33. Zelnik-Manor, L., Irani, M.: Multiview constraints on homographies. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 24(2), 214–223 (2002)
Riemannian Anisotropic Diffusion for Tensor Valued
Images
Abstract. Tensor valued images, for instance originating from diffusion tensor
magnetic resonance imaging (DT-MRI), have become more and more important
over the last couple of years. Due to the nonlinear structure of such data it is
nontrivial to adapt well-established image processing techniques to them. In this
contribution we derive anisotropic diffusion equations for tensor-valued images
based on the intrinsic Riemannian geometric structure of the space of symmet-
ric positive tensors. In contrast to anisotropic diffusion approaches proposed so
far, which are based on the Euclidian metric, our approach considers the nonlin-
ear structure of positive definite tensors by means of the intrinsic Riemannian
metric. Together with an intrinsic numerical scheme our approach overcomes
a main drawback of former proposed anisotropic diffusion approaches, the so-
called eigenvalue swelling effect. Experiments on synthetic data as well as real
DT-MRI data demonstrate the value of a sound differential geometric formula-
tion of diffusion processes for tensor valued data.
1 Introduction
In this paper anisotropic diffusion driven by a diffusion tensor is adapted to tensor-
valued data in a way respecting the Riemannian geometry of the data structure. Nonlin-
ear diffusion has become a widely used technique with a well understood theory (see
e.g. [1,2] for overviews). It was introduced in [3] and has been frequently applied to
scalar-, color- or vector-valued data. Anisotropic diffusion1 driven by a diffusion ten-
sor [2] is the most general form of diffusion processes. Tensor-valued data frequently
occur in image processing, e.g. covariance matrices or structure tensors in optical flow
estimation (see e.g. [4]). Due to rapid technological developments in magnetic reso-
nance imaging (MRI) also interest in tensor-valued measurement data increases. Due
to the increasing need of processing tensor valued data, the development of appropri-
ate regularization techniques become more and more important (e.g. see [5,6,7,8] and
[9] as well as references therein). Riemannian geometry refers to the fact that the set
of positive definite tensors P (n) of size n does not form a vector space but a nonlin-
ear manifold embedded in the vector space of all symmetric matrices. The nonlinear
1
Please note that the term ’anisotropic diffusion’ is not uniquely defined in literature. In this
contribution we use the term in accordance with the definition given in [2].
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 326–339, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Riemannian Anisotropic Diffusion for Tensor Valued Images 327
structure of P (n) is studied from a differential geometric point of view for a long time
[10]. Due to the nonlinear structure of P (n), well established image processing tech-
niques for scalar and vector valued data might destroy the positive definiteness of the
tensors. Approaches for processing tensor valued images can be classified into two
groups: using extrinsic [5,11,12,13,14] or intrinsic view [15,16,17,18,19,20,21,7,22].
Methods using the extrinsic point of view consider the space of positive definite sym-
metric tensors as an embedding in the space of all symmetric tensors which constitute
a vector space. Distances, as e.g. required for derivatives, are computed with respect
to the flat Euclidian metric of the space of symmetric matrices. To keep tensors on the
manifold of positive definite tensors, solutions are projected back onto the manifold
[5], selected only on the manifold in a stochastic sampling approach [11], or process-
ing is restricted to operations not leading out of P (n), e.g. convex filters [12,13,14].
Although then tensors stay positive definite the use of a flat metric is not appropriate to
deal with P (n). For instance in regularization, the processed tensors become deformed
when using the flat Euclidian metric [7] known as eigenvalue swelling effect [5,6,7,8].
Tschumperlé and Deriche [5] avoid the eigenvalue swelling effect by applying a spectral
decomposition and regularizing eigenvalues and eigenvectors separatively. Chefd’hotel
et al. [6] proposed to take the metric of the underlying manifold for deriving evolution
equations from energy functionals that intrinsically fulfill the constraints upon them
(e.g. rank or eigenvalue preserving) as well as for the numerical solution scheme. How-
ever, they consider the Euclidian metric for measuring distances between tensors such
that their methods suffer from the eigenvalue swelling effect for some of the proposed
evolution equations. Methods using the intrinsic point of view consider P (n) as a Rie-
mannian symmetric space (see [23] and Sect. 3 for an introduction in symmetric Rie-
mannian spaces) equipped with an affine invariant metric on the tangent space at each
point. Consequently, using this metric the eigenvalue swelling effect is avoided. The
symmetry property of the Riemannian manifold easily allows to define evolution equa-
tions on the tangent spaces, approximate derivatives by tangent vectors as well as con-
struct intrinsic gradient descent schemes as we will show for anisotropic diffusion in the
following.
first introduced in the current paper. A computational more efficient approach than the
framework of Pennec et al. [7] based on the so called log-Euclidean metric has been
introduced in [28]. There, the positive definite tensors are mapped onto the space of
symmetric matrices by means of the matrix logarithmic map. In this new space com-
mon vector valued approaches can be applied. The final result is obtained by mapping
the transformed symmetric matrices back onto the space of positive definite matrices
using the matrix exponential map. However, the log-Euclidean metric is not affine in-
variant. As a consequence the approach might suffer from a change of coordinates.
However, the formulation of anisotropic diffusion for tensor valued data based on the
log-Euclidean metric might be a computational efficient alternative not proposed in lit-
erature so far. In [22,29] a Riemannian framework based on local coordinates has been
proposed (see also in [30] for a variational framework for general manifolds). Although,
the authors in [22,29] consider the affine invariant metric their approach may only be
classified as intrinsic in a continuous formulation. For computing discrete data, a sim-
ple finite difference approximation is applied. Inferring from a continuous formulation
without a proof to a discrete approximation can be misleading as constraints holding in
the continuous case may be relaxed by discretization. As a consequence, the proposed
approaches not necessarily preserve positive definiteness of tensors (for a detailed dis-
cussion of this topic for scalar valued signals we refer to [2]). Furthermore, the approach
of [29] shows no significant difference with the log Euclidean framework whereas our
approach clearly outperforms it. We refer to our approach as the full intrinsic scheme
in order to distinguish it from schemes that are only intrinsic in the continuous setting.
Anisotropic diffusion based on an extrinsic view [12,31] and by means of the exponen-
tial map [6] has been proposed. In both cases the Euclidian metric is used to measure
distances between tensors. As a consequence, both approaches suffer from the eigen-
value swelling effect.
Our contribution. We derive an intrinsic anisotropic diffusion equation for the mani-
fold of positive definite tensors. To this end, second order derivatives in the continuous
as well as discrete approximations are derived as they occur in the anisotropic diffusion
equation. The derived numerical scheme could also be used to generalize other PDEs
involving mixed derivatives from scalar valued images to the manifold P (n) without
the need of local coordinates. In the experimental part, we provide a study in which we
compare different state of the art regularization approaches with our approach.
with initial condition f = u(x, 0) and diffusion tensor D with components dij . Note
that we could also formulate the image restoration task as a solution of a diffusion reac-
tion equation by adding a data depending term to (1). We will discuss the pure diffusion
process only. All following results keep valid also for a formulation with data depend-
ing reaction terms. The diffusion equation can be reformulated applying the chain rule
in the form ∂t u = i,j (∂i dij )(∂j u) + dij ∂i ∂j u which will be more convenient for the
formulation on tensor valued data. The diffusion process can be classified according to
the diffusion tensor D. If the diffusion tensor does not depend upon the evolving im-
age, the diffusion process is denoted as linear due to the linearity of (1) otherwise it
is termed nonlinear. The diffusion process can furthermore be classified into isotropic
when the diffusion tensor is proportional to the identity matrix otherwise it is denoted as
anisotropic. Except for the nonlinear anisotropic diffusion scheme, the diffusion equa-
tion can be derived from a corresponding energy functional E(u) via calculus of varia-
tion, i.e. the gradient descent scheme of these energy functionals can be identified with
a diffusion equation. Let L(u) denote the energy density such that E(u) = L(u) dx,
w : IRN → IR a test function and ε a real valued variable. The functional derivative
δE := δE(u+εw)δε of an energy functional E(u) can be written as
ε=0
δE = ∇L(u), wu dx , (2)
where ∇L(u) defines the gradient of the energy density and ∇L(u), wu denotes the
scalar product of the energy density gradient ∇L(u) and the test function evaluated at
x. Note that w as well as ∇L(u) are elements of the tangent space at u which is the
Euclidian space itself for scalar valued images. As we will see in Sect. 4, this formu-
lation allows a direct generalization to the space of symmetric positive definite tensors.
The gradient descent scheme of the energy functional leads to the diffusion equation in
terms of the energy density
∂t u = −∇L(u) . (3)
Let us now consider the linear anisotropic diffusion equation (1), i.e. D not depending
on the evolving signal. The corresponding energy function is known to be
1
E(u) = ∇uT D∇u dx . (4)
2
In the following we review the structure of the space of positive definite tensors P (n)
and introduce the differential geometric tools necessary for deriving anisotropic diffu-
sion equations for P (n). By introducing a basis, any tensor can be identified with its
corresponding matrix representation A ∈ Rn×n . The space of n × n matrices
consti-
tutes a vector space embodied with a scalar product A, B = Tr AT B , inducing
.
the norm ||A|| = A, A. However, tensors Σ frequently occurring in computer vi-
sion and image processing applications, e.g. covariance matrices and DT-MRI tensors,
embody further structure on the space of tensors: they are symmetric Σ T = Σ and
positive definite, i.e. it holds xT Σx > 0 for all nonzero x ∈ Rn . The approach to
anisotropic diffusion presented here, measures distances between tensors by the length
of the shortest path, the geodesic, with respect to GL(n) (affine) invariant Riemannian
metric on P (n). This metric takes the nonlinear structure of P (n) into account and it
has demonstrated in several other application its superiority over the flat Euclidean ma-
tric [17,18,20,21,7,22]. Such an intrinsic treatment requires the formulation of P (n) as
a Riemannian manifold, i.e. each tangent space is equipped with an inner product that
smoothly varies from point to point. A geodesic Γ X (t) parameterized by the ’time’ t
and going through the tensor Γ (0) = Σ at time t = 0 is uniquely defined by its tan-
gent vector X at Σ. This allows one to describe each geodesic by a mapping from the
subspace A = (tX), t ∈ R spanned by the tangent vector onto the manifold P (n). The
GL(n) invariant metric is induced by the scalar product
W1 , W2 Σ = Tr Σ − 2 W1 Σ −1 W2 Σ − 2 ,
1 1
(6)
as one can easily verify. The GL(n) invariant metric allows to derive an expression of
the geodesic equation going through Σ by tangent vectors X [7]
Γ Σ (t) = Σ 2 exp(tΣ − 2 XΣ − 2 )Σ 2 .
1 1 1 1
(7)
For t = 1 this map is denoted as the exponential map which is one to one in case of the
space of positive definite tensors. Its inverse, denoted as the logarithmic map, reads
1
X = Σ 2 log Σ − 2 Γ Σ (1)Σ − 2 Σ 2 .
1 1 1
(8)
As the gradient of any energy density ∇L is element of the tangent space [33], we can
formulate a diffusion process as ∂t Σ = −∇L on the tangent space. The evolution of
the tensor Σ is obtained by going a small step in the negative direction of the gradient
−dt∇L and mapping this point back on the manifold using the geodesic equation (7).
The energy density is then computed for the tangent vector at Γ Σ (dt) which in turn can
then be used for finding the next tensor in the evolving scheme as described above. This
is a gradient descent approach, denoted as the geodesic marching scheme, for energy
densities defined on P (n) and which per construction assures that we cannot leave the
manifold.
Riemannian Anisotropic Diffusion for Tensor Valued Images 331
Inserting this energy density in (3) results in the desired diffusion equation. Using the
identity ∂i Σ −1 = −Σ −1 (∂i Σ)Σ −1 the energy density gradient can be simplified to
∇L = −2 ∂i ∂j Σ − (∂i Σ)Σ −1 (∂j Σ) − 2 (∂i dij )(∂j Σ) (16)
i,j i,j
The terms on the right side of (16) for which i = j hold Δi Σ= ∂i2 Σ − (∂i Σ)Σ −1
(∂i Σ) are components of the Laplace Beltrami operator Δ = i Δi derived in [7]. In
addition to the work in [20,7], we also derived mixed components
Δij Σ = ∂i ∂j Σ − (∂i Σ)Σ −1 (∂j Σ), i = j (17)
332 K. Krajsek et al.
needed for the linear anisotropic diffusion equation. The nonlinear anisotropic diffu-
sion equation is defined exchanging the diffusion tensor components in (4) with com-
ponents depending on the evolved tensor field. So we have all components to define an
anisotropic diffusion equation on the space of positive definite matrices in an intrinsic
way. To this end, only the second order derivatives ∂i2 and ∂i ∂j occurring in (1) need
to be exchanged by their counterparts Δi and Δij . So far we have not specified the
explicit form of the diffusion tensor which should be made up here. We generalize the
structure tensor to the nonlinear space and afterwards, as in the case of scalar valued
images, construct the diffusion tensor from the spectral decomposition of the structure
tensor. Let ∇Σ = (∂1 Σ, ..., ∂N Σ)T denote the gradient and a a unite vector in RN
such that we can express the derivative in direction a as ∂a = aT ∇. The direction of
less variation in the tensor space can then analogous to the structure tensor in linear
spaces, be estimated by minimizing the local energy
E(a) = ∂a Σ, ∂a ΣΣ dx = aT Ja , (18)
V
5 Numerical Issues
So far we have assumed the tensor to be defined on a continuous domain. In the ex-
periential setting we are confronted with tensor fields defined on a discrete grid. The
application of Riemanian anisotropic diffusion requires a discrete approximation for
the derivatives derived in Sect. 4. In principle, we could use matrix differences to ap-
proximate the derivatives but this would contradict our effort to derive an intrinsic ex-
pression of the anisotropic diffusion equation. The finite differences are extrinsic since
they are based on Euclidian differences between tensors, i.e. they use the difference in
the space of symmetric matrices and not the Riemannian metric of the space P (n). In
order to approximate the gradient ∇L in (16) on a discrete grid, we need discrete ap-
proximations of derivatives of first and second order. Intrinsic approximations to first
order derivatives have already proposed in [20] and is reviewed here with the following
−−−−−−−−−−−−→
preposition. Let us denote with T Σ exj := Σ(x)Σ(x + εej ) the tangent vector defined
by the logarithmic map as
1
T Σ exj = Σ 2 log Σ − 2 Σ(x + εej )Σ − 2 Σ 2
1 1 1
(19)
Preposition 1. The first order discrete approximation of the first order derivative of Σ
in direction j reads
1
−−−−−−−−−−−−→ −−−−−−−−−−−−→
∂j Σ = Σ(x)Σ(x + εej ) − Σ(x)Σ(x − εej ) + O(ε) (20)
2ε
Riemannian Anisotropic Diffusion for Tensor Valued Images 333
A second order discrete approximation scheme to the second order derivative in di-
rection ej has been derived in [7]. We state it here as a second preposition, for the proof
see [7].
Preposition 2. The second order discrete approximation of the second order derivative
in direction ej is
1 −−−−−−−−−−−−→ −−−−−−−−−−−−→
Δj Σ = (Σ(x)Σ(x + εej ) + Σ(x)Σ(x − εej )) + O(ε2 ) . (21)
ε2
For the anisotropic diffusion equation we also need mixed derivatives Δij Σ that can be
approximated according to preposition 3.
Preposition 3. The second order discrete approximation of the second order mixed
derivative in direction i and j is given by
Δij Σ + Δji Σ 1 −−−−−−−−−−−−→ −−−−−−−−−−−−→
= 2 (Σ(x)Σ(x + εen ) + Σ(x)Σ(x − εen ) (22)
2 ε
−−−−−−−−−−−−→ −−−−−−−−−−−−→
−Σ(x)Σ(x + εep ) − Σ(x)Σ(x − εep )) + O(ε2 ) ,
with the abbreviation en = √1 (ei
2
+ ej ), ep = √1 (ei
2
− ej ).
Proof. We expend the tangent vector as
ε2 2 ε2
∂n Σ − (∂n Σ)Σ − 2 (∂n Σ) + O(ε3 ) .
1
T Σ exn = ε∂n Σ + (23)
2 2
Now, we express the derivative in direction n by derivatives along the coordinate axes
in i and j direction , ∂n = √12 ∂i + √12 ∂j , yielding
ε ε2
T Σ exn = √ (∂i Σ + ∂j Σ) + ( ∂i2 Σ + ∂j2 Σ + 2∂i ∂j Σ
2 4
−(∂i Σ)Σ − 2 (∂i Σ) − (∂j Σ)Σ − 2 (∂j Σ)
1 1
Expanding T Σ Δe
x
p
:= T Σ exp + T Σ −e
x
p
in the same way yields
ε2 2
( ∂i Σ + ∂j2 Σ − 2∂i ∂j Σ − (∂i Σ)Σ − 2 (∂i Σ) −
1
T Σ Δe
x
p
= (25)
4
(∂j Σ)Σ − 2 (∂j Σ) + (∂i Σ)Σ − 2 (∂j Σ) + (∂j Σ)Σ − 2 (∂i Σ)) + O(ε4 )
1 1 1
By subtracting (25) from (24) and dividing the square of the grid size ε2 we obtained
the claimed second order approximation for the mixed derivatives which concludes the
proof.
334 K. Krajsek et al.
6 Experiments
Fig. 1. Line reconstruction experiment; upper row (from left to right): original tensor field, EAD
scheme, LEAD scheme; lower row (from left to right): LEID scheme, RID scheme, RAD scheme
Riemannian Anisotropic Diffusion for Tensor Valued Images 335
Fig. 2. Denoising experiment; upper row (from left to right): noise corrupted tensor field, EAD
scheme, our LEAD scheme; lower row (from left to right): LEID scheme, RID scheme, RAD
scheme
Experiment 1. In the first experiment on synthetic data we examine the ability of the
different diffusion processes to complete interrupted line structures. To this end, we
generate a 32 × 32 large tensor field of 3 × 3 tensors (see Fig. 2 upper left; in order to
visualize details more precise only a cutout of the tensor field is shown). Each tensor is
represented by an ellipsoid and the orientation of its main axis is additional color coded
whereas the FA is encoded in the saturation of the depicted tensors. The line structure is
interrupted by isotropic tensors with small eigenvalues (λj = 0.05) that are hardly vis-
ible due to the saturation encoding of the FA. The results for all diffusion processes are
shown in Fig. 1. The nonlinear isotropic processes LEID and RID stops at the line inter-
ruption and is not able to complete the line. This results from the fact that, although the
smoothing process is also anisotropic for nonlinear isotropic diffusion processes [20],
the diffusivity function depends only on its direct neighbors and therefore does not ’see’
the line behind the gap. The anisotropic diffusion schemes are steered by the diffusion
tensor which encodes the directional information of a neighborhood depending on the
average region for the structure tensor. The anisotropic diffusion approaches fill the gap
and reconstruct the line. However, again the EAD-process suffers from the eigenvalue
swelling effect and only one tensor connects both interrupted line structures. However
increasing the average region of the structure tensor might fill the gap more clearly. Our
RAD and LEAD schemes reconstruct the line structure. However, we observe a small
336 K. Krajsek et al.
decreasing of the anisotropy for the log-Euclidean metric, whereas the anisotropy for
the affine invariant metric increases in the vicinity of image borders.
Fig. 3. Denoising experiment 3; (upper row, from left to right): noisy DT-MRI image, LEID
scheme, RID scheme; (lower row, from left to right) EAD scheme, LEAD scheme, RAD scheme
Riemannian Anisotropic Diffusion for Tensor Valued Images 337
as follows : TR = 6925 ms / TE=104ms / 192 matrix with 6/8 phase partial fourier, 23
cm field of view (FOV), and 36 2.4-mm-thick contiguous axial slices. The in-plane res-
olution was 1.2 mm/pixel. We estimate a volumetric tensor field of size 192 × 192 × 36
and take one slice for further processing. For evaluation purposes we recorded tensor
fields of the brain with 6 different signal-to-noise ratios (SNR), denoted as DTI1-6 in
the following. Thus, we can use the DT-MRI-images (DTI6) from the long measure-
ment (i.e. good SNR) as a reference data set, where we compare the FA of the tensor
with the results obtained from the lower SNR data set (DTI1-5), which can be obtained
in a clinical feasible measurement time. We compute, starting from the five different
noisy tensor fields, the evolved tensor fields for all considered diffusion schemes (Fig. 3
shows cutouts of the noisy field and evolved fields) and compare its FA with the ref-
erence field. All schemes lead to rather smooth tensor fields. However, the anisotropic
diffusion schemes (EAD, LEAD and RAD) lead to an enhancement of orientated struc-
tures within in the tensor fields which is most distinct for our RAD scheme. As in the
previous experiments, the eigenvalue swelling effect in case of the EAD scheme can be
observed. Our RAD/LEAD schemes yield the best results among anisotropic regular-
ization schemes with respect to the FA measure as shown in Tab. 1.
Table 1. Results of experiment 3: The average and standard deviation of the the fractional
anisotropy error |F A − F A| (FA belongs to the reference tensor field) over 1000 time steps
for each diffusion scheme as well as for five different noise levels are computed
7 Conclusion
We generalized the concept of anisotropic diffusion to tensor valued data with respect
to the affine invariant Riemannian metric. We derived the intrinsic mixed second order
derivatives as they are required for the anisotropic diffusion process. Furthermore, we
derived a discrete intrinsic approximation scheme for the mixed second order deriva-
tives. Since mixed second order derivatives appear also in other methods based on par-
tial differential equation, this contribution could also serve as a basis for generalizing
these methods in an intrinsic way in a discrete formulation. Experiments on synthetic
as well as real world data demonstrate the value of our full intrinsic differential geo-
metrical formulation of the anisotropic diffusion concept. As a computational effective
alternative, we proposed an anisotropic diffusion scheme based on the log-Euclidean
metric. Summing up, our proposed anisotropic diffusion schemes show promising re-
sults on the given test images. Further work might examine the reconstruction proper-
ties of other tensor characteristics as well as the influence on so far heuristically chosen
parameters, e.g. the diffusivity function.
338 K. Krajsek et al.
References
1. Berger, M.-O., Deriche, R., Herlin, I., Jaffré, J., Morel, J.-M. (eds.): Icaos 1996: Images and
wavelets and PDEs. Lecture Notes in Control and Information Sciences, vol. 219 (1996)
2. Weickert, J.: Anisotropic diffusion in image processing. Teubner, Stuttgart (1998)
3. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 12 (1990)
4. Bigün, J., Granlund, G.H.: Optimal orientation detection of linear symmetry. In: ICCV, Lon-
don, UK, pp. 433–438 (1987)
5. Tschumperlé, D., Deriche, R.: Diffusion tensor regularization with constraints preservation.
In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR 2001), pp. 948–953 (2001)
6. Chefd’hotel, C., Tschumperlé, D., Deriche, R., Faugeras, O.: Regularizing flows for con-
strained matrix-valued images. J. Math. Imaging Vis. 20(1-2), 147–162 (2004)
7. Pennec, X., Fillard, P., Ayache, N.: A Riemannian framework for tensor computing. Interna-
tional Journal of Computer Vision 66(1), 41–66 (2006)
8. Castano-Moraga, C.A., Lenglet, C., Deriche, R., Ruiz-Alzola, J.: A Riemannian approach to
anisotropic filtering of tensor fields. Signal Processing 87(2), 263–276 (2007)
9. Weickert, J., Hagen, H.: Visualization and Processing of Tensor Fields (Mathematics and
Visualization). Springer, New York (2005)
10. Rao, C.: Information and accuracy attainable in estimation of statistical parameters. Bull.
Calcutta Math. Soc. 37, 81–91 (1945)
11. Martin-Fernandez, M., San-Jose, R., Westin, C.F., Alberola-Lopez, C.: A novel Gauss-
Markov random field approach for regularization of diffusion tensor maps. In: Moreno-Dı́az
Jr., R., Pichler, F. (eds.) EUROCAST 2003. LNCS, vol. 2809, pp. 506–517. Springer, Hei-
delberg (2003)
12. Weickert, J., Brox, T.: Diffusion and regularization of vector- and matrix-valued images. In:
Inverse Problems, Image Analysis, and Medical Imaging. Contemporary Mathematics, pp.
251–268 (2002)
13. Westin, C.-F., Knutsson, H.: Tensor field regularization using normalized convolution. In:
Moreno-Dı́az Jr., R., Pichler, F. (eds.) EUROCAST 2003. LNCS, vol. 2809, pp. 564–572.
Springer, Heidelberg (2003)
14. Burgeth, B., Didas, S., Florack, L., Weickert, J.: A generic approach to the filtering of matrix
fields with singular PDEs. In: Sgallari, F., Murli, A., Paragios, N. (eds.) SSVM 2007. LNCS,
vol. 4485, pp. 556–567. Springer, Heidelberg (2007)
15. Gur, Y., Sochen, N.A.: Denoising tensors via Lie group flows. In: Paragios, N., Faugeras, O.,
Chan, T., Schnörr, C. (eds.) VLSM 2005. LNCS, vol. 3752, pp. 13–24. Springer, Heidelberg
(2005)
16. Moakher, M.: A differential geometric approach to the geometric mean of symmetric
positive-definite matrices. SIAM J. Matrix Anal. Appl (2003)
17. Fletcher, P., Joshi, S.: Principle geodesic analysis on symmetric spaces: Statistics of diffusion
tensors. In: Computer Vision and Mathematical Methods in Medical and Biomedical Image
Analysis, ECCV 2004 Workshops CVAMIA and MMBIA, pp. 87–98 (2004)
18. Lenglet, C., Rousson, M., Deriche, R., Faugeras, O.D., Lehericy, S., Ugurbil, K.: A Rie-
mannian approach to diffusion tensor images segmentation. In: Christensen, G.E., Sonka, M.
(eds.) IPMI 2005. LNCS, vol. 3565, pp. 591–602. Springer, Heidelberg (2005)
19. Batchelor, P.G., Moakher, M., Atkinson, D., Calamante, F., Connelly, A.: A rigorous frame-
work for diffusion tensor calculus. Magn. Reson. Med. 53(1), 221–225 (2005)
20. Fillard, P., Arsigny, V., Ayache, N., Pennec, X.: A Riemannian framework for the processing
of tensor-valued images. In: Fogh Olsen, O., Florack, L.M.J., Kuijper, A. (eds.) DSSCV
2005. LNCS, vol. 3753, pp. 112–123. Springer, Heidelberg (2005)
Riemannian Anisotropic Diffusion for Tensor Valued Images 339
21. Lenglet, C., Rousson, M., Deriche, R., Faugeras, O.: Statistics on the manifold of multivariate
normal distributions: Theory and application to diffusion tensor MRI processing. J. Math.
Imaging Vis. 25(3), 423–444 (2006)
22. Zéraı̈, M., Moakher, M.: Riemannian curvature-driven flows for tensor-valued data. In: Sgal-
lari, F., Murli, A., Paragios, N. (eds.) SSVM 2007. LNCS, vol. 4485, pp. 592–602. Springer,
Heidelberg (2007)
23. Helgason, S.: Differential Geometry, Lie groups and symmetric spaces. Academic Press,
London (1978)
24. El-Fallah, A., Ford, G.: On mean curvature diffusion in nonlinear image filtering. Pattern
Recognition Letters 19, 433–437 (1998)
25. Sochen, N., Kimmel, R., Malladi, R.: A geometrical framework for low level vision. IEEE
Transaction on Image Processing, Special Issue on PDE based Image Processing 7(3), 310–
318 (1998)
26. Begelfor, E., Werman, M.: Affine invariance revisited. In: CVPR ’06: Proceedings of the
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.
2087–2094. IEEE Computer Society Press, Washington (2006)
27. Zhang, F., Hancock, E.: Tensor MRI regularization via graph diffusion. In: BMVC 2006, pp.
578–589 (2006)
28. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Log-Euclidean metrics for fast and simple
calculus on diffusion tensors. Magnetic Resonance in Medicine 56(2), 411–421 (2006)
29. Gur, Y., Sochen, N.A.: Fast invariant Riemannian DT-MRI regularization. In: Proc. of IEEE
Computer Society Workshop on Mathematical Methods in Biomedical Image Analysis (MM-
BIA), Rio de Janeiro, Brazil, pp. 1–7 (2007)
30. Mémoli, F., Sapiro, G., Osher, S.: Solving variational problems and partial differential equa-
tions mapping into general target manifolds. Journal of Computational Physics 195(1), 263–
292 (2004)
31. Brox, T., Weickert, J., Burgeth, B., Mrázek, P.: Nonlinear structure tensors. Revised version
of technical report no. 113. Saarland University, Saarbrücken, Germany (2004)
32. Nielsen, M., Johansen, P., Olsen, O., Weickert, J. (eds.): Scale-Space 1999. LNCS, vol. 1682.
Springer, Heidelberg (1999)
33. Maaß, H.: Siegel’s Modular Forms and Dirichlet Series. Lecture notes in mathematics,
vol. 216. Springer, Heidelberg (1971)
34. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Fast and simple calculus on tensors in the log-
Euclidean framework. In: Duncan, J.S., Gerig, G. (eds.) MICCAI 2005. LNCS, vol. 3749,
pp. 115–122. Springer, Heidelberg (2005)
35. Bihan, D.L., Mangin, J.F., Poupon, C., Clark, C.A., Pappata, S., Molko, N., Chabriat, H.:
Diffusion tensor imaging: Concepts and applications. Journal of Magnetic Resonance Imag-
ing 13(4), 534–546 (2001)
FaceTracer: A Search Engine for
Large Collections of Images with Faces
Columbia University
Abstract. We have created the first image search engine based entirely
on faces. Using simple text queries such as “smiling men with blond hair
and mustaches,” users can search through over 3.1 million faces which
have been automatically labeled on the basis of several facial attributes.
Faces in our database have been extracted and aligned from images down-
loaded from the internet using a commercial face detector, and the num-
ber of images and attributes continues to grow daily. Our classification
approach uses a novel combination of Support Vector Machines and Ad-
aboost which exploits the strong structure of faces to select and train on
the optimal set of features for each attribute. We show state-of-the-art
classification results compared to previous works, and demonstrate the
power of our architecture through a functional, large-scale face search
engine. Our framework is fully automatic, easy to scale, and computes
all labels off-line, leading to fast on-line search performance. In addition,
we describe how our system can be used for a number of applications,
including law enforcement, social networks, and personal photo manage-
ment. Our search engine will soon be made publicly available.
1 Introduction
We have created the first face search engine, allowing users to search through
large collections of images which have been automatically labeled based on the
appearance of the faces within them. Our system lets users search on the basis
of a variety of facial attributes using natural language queries such as, “men
with mustaches,” or “young blonde women,” or even, “indoor photos of smiling
children.” This face search engine can be directed at all images on the internet,
tailored toward specific image collections such as those used by law enforcement
or online social networks, or even focused on personal photo libraries.
The ability of current search engines to find images based on facial appear-
ance is limited to images with text annotations. Yet, there are many problems
with annotation-based search of images: the manual labeling of images is time-
consuming; the annotations are often incorrect or misleading, as they may refer
to other content on a webpage; and finally, the vast majority of images are
Supported by the National Defense Science & Engineering Graduate Fellowship.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 340–353, 2008.
c Springer-Verlag Berlin Heidelberg 2008
FaceTracer: A Search Engine for Large Collections of Images with Faces 341
(a) (b)
Fig. 1. Results for the query “smiling asian men with glasses,” using (a) the Google
image search engine and (b) our face search engine. Our system currently has over
3.1 million faces, automatically detected and extracted from images downloaded from
the internet, using a commercial face detector [1]. Rather than use text annotations
to find images, our system has automatically labeled a large number of different facial
attributes on each face (off-line), and searches are performed using only these labels.
Thus, search results are returned almost instantaneously. The results also contain links
pointing back to the original source image and associated webpage.
simply not annotated. Figures 1a and 1b show the results of the query, “smil-
ing asian men with glasses,” using a conventional image search engine (Google
Image Search) and our search engine, respectively. The difference in quality of
search results is clearly visible. Google’s reliance on text annotations results in
it finding images that have no relevance to the query, while our system returns
only the images that match the query.
Like much of the work in content-based image retrieval, the power of our
approach comes from automatically labeling images off-line on the basis of a
large number of attributes. At search time, only these labels need to be queried,
resulting in almost instantaneous searches. Furthermore, it is easy to add new
images and face attributes to our search engine, allowing for future scalability.
Defining new attributes and manually labeling faces to match those attributes
can also be done collaboratively by a community of users.
Figures 2a and 2b show search results of the queries, “young blonde women”
and “children outdoors,” respectively. The first shows a view of our extended
interface, which displays a preview of the original image in the right pane when
the user holds the mouse over a face thumbnail. The latter shows an example of
a query run on a personalized set of images. Incorporating our search engine into
photo management tools would enable users to quickly locate sets of images and
then perform bulk operations on them (e.g., edit, email, or delete). (Since current
tools depend on manual annotation of images, they are significantly more time-
consuming to use.) Another advantage of our attribute-based search on personal
collections is that with a limited number of people, simple queries can often find
images of a particular person, without requiring any form of face recognition.
342 N. Kumar, P. Belhumeur, and S. Nayar
(a) (b)
Fig. 2. Results of queries (a)“young blonde women” and (b) “children outside,” using
our face search engine. In (a), search results are shown in the left panel, while the right
panel shows a preview of the original image for the selected face. (b) shows search
results on a personalized dataset, displaying the results as thumbnails of the original
images. Note that these results were correctly classified as being “outside” using only
the cropped face images, showing that face images often contain enough information
to describe properties of the image which are not directly related to faces.
Our search engine owes its superior performance to the following factors:
– A large and diverse dataset of face images with a significant subset
containing attribute labels. We currently have over 3.1 million aligned
faces in our database – the largest such collection in the world. In addition to
its size, our database is also noteworthy for being a completely “real-world”
dataset. The images are downloaded from the internet and encompass a wide
range of pose, illumination, imaging conditions, and were taken using a large
variety of cameras. The faces have been automatically extracted and aligned
using a commercial face and fiducial point detector [1]. In addition, 10 at-
tributes have been manually labeled on more than 17,000 of the face images,
creating a large dataset for training and testing classification algorithms.
– A scalable and fully automatic architecture for attribute classi-
fication. We present a novel approach tailored toward face classification
problems, which uses a boosted set of Support Vector Machines (SVMs) [2]
to form a strong classifier with high accuracy. We describe the results of this
algorithm on a variety of different attributes, including demographic infor-
mation such as gender, age, and race; facial characteristics such as eye wear
and facial hair; image properties such as blurriness and lighting conditions;
and many others as well. A key aspect of this work is that classifiers for
new attributes can be trained automatically, requiring only a set of labeled
examples. Yet, the flexibility of our framework does not come at the cost of
reduced accuracy – we compare against several state-of-the-art classification
methods and show the superior classification rates produced by our system.
We will soon be releasing our search engine for public use.
FaceTracer: A Search Engine for Large Collections of Images with Faces 343
2 Related Work
Our work lies at the intersection of several fields, including computer vision,
machine learning, and content-based image retrieval. We present an overview of
the relevant work, organized by topic.
(e.g., celebrity names and professions). The latter allows us to sample from the
more general distribution of images on the internet. In particular, it lets us
include images that have no corresponding textual information, i.e., that are
effectively invisible to current image search engines. Our images are downloaded
from a wide variety of online sources, such as Google Images, Microsoft Live
Image Search, and Flickr, to name a few. Relevant metadata such as image and
page URLs are stored in the EXIF tags of the downloaded images.
Next, we apply the OKAO face detector [1] to the downloaded images to
extract faces. This detector also gives us the pose angles of each face, as well as
the locations of six fiducial points (the corners of both eyes and the corners of
the mouth). We filter the set of faces by resolution and face pose (±10◦ from
front-center). Finally, the remaining faces are aligned to a canonical pose by
applying an affine transformation. This transform is computed using linear least
squares on the detected fiducial points and corresponding points defined on a
template face. (In future work, we intend to go beyond near frontal poses.)
We present various statistics of our current face database in Table 1, divided
by image source. We would like to draw attention to three observations about
our data. First, from the statistics of randomly downloaded images, it appears
that a significant fraction of them contain faces (25.7%), and on average, each
image contains 0.5 faces. Second, our collection of aligned faces is the largest
such collection of which we are aware. It is truly a “real-world” dataset, with
completely uncontrolled lighting and environments, taken using unknown cam-
eras and in unknown imaging conditions, with a wide range of image resolutions.
In this respect, our database is similar to the LFW dataset [15], although ours is
larger by 2 orders of magnitude and not targeted specifically for face recognition.
In contrast, existing face datasets such as Yale Face A&B [16], CMU PIE [17],
and FERET [6] are either much smaller in size and/or taken in highly controlled
settings. Even the more expansive FRGC version 2.0 dataset [18] has a limited
number of subjects, image acquisition locations, and all images were taken with
the same camera type. Finally, we have labeled a significant number of these im-
ages for our 10 attributes, enumerated in Table 2. In total, we have over 17,000
attribute labels.
Table 1. Image database statistics. We have collected what we believe to be the largest
set of aligned real-world face images (over 3.1 million so far). These faces have been
extracted using a commercial face detector [1]. Notice that more than 45% of the
downloaded images contain faces, and on average, there is one face per two images.
Total # Average #
Image Source # Images # Images % Images Faces Found
Faces
Downloaded With Faces With Faces Per Image
Found
Randomly Downloaded 4,289,184 1,102,964 25.715 2,156,287 0.503
Celebrities 428,312 411,349 96.040 285,627 0.667
Person Names 17,748 7,086 39.926 10,086 0.568
Face-Related Words 13,028 5,837 44.804 14,424 1.107
Event-Related Words 1,658 997 60.133 1,335 0.805
Professions 148,782 75,105 50.480 79,992 0.538
Series 7,472 3,950 52.864 8,585 1.149
Camera Defaults 895,454 893,822 99.818 380,682 0.425
Miscellanous 417,823 403,233 96.508 194,057 0.464
Total 6,219,461 2,904,343 46.698 3,131,075 0.503
Table 2. List of labeled attributes. The labeled face images are used for training our
classifiers, allowing for automatic classification of the remaining faces in our database.
Note that these were labeled by a large set of people, and thus the labels reflect a group
consensus about each attribute rather than a single user’s strict definition.
image. Instead, we use our large sets of manually-labeled images to build accurate
classifiers for each of the desired attributes.
In creating a classifier for a particular attribute, we could simply choose all
pixels on the face, and let our classifier figure out which are important for the
task and which are not. This, however, puts too great a burden on the classifier,
confusing it with non-discriminative features. Instead, we create a rich set of local
feature options from which our classifier can automatically select the best ones.
Each option consists of four choices: the region of the face to extract features
from, the type of pixel data to use, the kind of normalization to apply to the
data, and finally, the level of aggregation to use.
Face Regions. We break up the face into a number of functional regions, such as
the nose, mouth, etc., much like those defined in the work on modular eigenspaces
346 N. Kumar, P. Belhumeur, and S. Nayar
Fig. 4. The face regions used for automatic feature selection. On the left is one region
corresponding to the whole face, and on the right are the remaining regions, each
corresponding to functional parts of the face. The regions are large enough to be robust
against small differences between individual faces and overlap slightly so that small
errors in alignment do not cause a feature to go outside of its region. The letters in
parentheses denote the code letter for the region, used later in the paper.
[19]. The complete set of 10 regions we use are shown in Fig. 4. Our coarse divi-
sion of the face allows us to take advantage of the common geometry shared by
faces, while allowing for differences between individual faces, as well as robust-
ness to small errors in alignment.
Types of Pixel Data. We include different color spaces and image derivatives
as possible feature types. These can often be more discriminative than standard
RGB values for certain attributes. Table 3 lists the various options.
Normalizations. Normalizations are important for removing lighting effects,
allowing for better generalization across images. We can remove illumination
gains by using mean normalization, x̂ = μx , or both gains and offsets by using
σ . In these equations, x refers to the input value, μ
energy normalization, x̂ = x−μ
and σ are the mean and standard deviation of all the x values within the region,
and x̂ refers to the normalized output value.
Aggregations. For some attributes, aggregate information over the entire re-
gion might be more useful than individual values at each pixel. This includes
histograms of values over the region, or simply the mean and variance.
To concisely refer to a complete feature option, we define a shorthand nota-
tion using the format, “Region:pixel type.normalization.aggregation.” The re-
gion notation is shown in Fig. 4; the notation for the pixel type, normalization,
and aggregation is shown in Table 3.
In recent years, Support Vector Machines (SVMs) [2] have been used success-
fully for many classification tasks [20,21]. SVMs aim to find the linear hyper-
plane which best separates feature vectors of two different classes, so as to
FaceTracer: A Search Engine for Large Collections of Images with Faces 347
Table 3. Feature type options. A complete feature type is constructed by first convert-
ing the pixels in a given region to one of the pixel value types from the first column,
then applying one of the normalizations from the second column, and finally aggregat-
ing these values into the output feature vector using one of the options from the last
column. The letters in parentheses are used as code letters in a shorthand notation for
concisely designating feature types.
Table 4. Error rates and top feature combinations for each attribute, computed by
training on 80% of the labeled data and testing on the remaining 20%, averaging over
5 runs (5-fold cross-validation). Note that the attribute-tuned global SVM performs as
well as, or better than, the local SVMs in all cases, and requires much less memory and
computation than the latter. The top feature combinations selected by our algorithm
are shown in ranked order from more important to less as “Region:feature type” pairs,
where the region and feature types are listed using the code letters from Fig. 4 and
Table 3. For example, the first combination for the hair color classifier, “H:r.n.s,” takes
from the hair region (H) the RGB values (r) with no normalization (n) and using only
the statistics (s) of these values.
(a) (b)
(c) (d)
Fig. 5. Illustrations of automatically-selected region and feature types for (a) gender,
(b) smiling, (c) environment, and (d) hair color. Each face image is surrounded by
depictions of the top-ranked feature combinations for the given attribute, along with
their corresponding shorthand label (as used in Table 4). Notice how each classifier
uses different regions and feature types of the face.
We emphasize the fact that these numbers are computed using our real-world
dataset, and therefore reflect performance on real images.
A limitation of this architecture is that classification will require keeping a
possibly large number of SVMs in memory, and each one will need to be evalu-
ated for every input image. Furthermore, one of the drawbacks of the Adaboost
formulation is that different classifiers can only be combined linearly. Attributes
which might depend on non-linear combinations of different regions or feature
types would be difficult to classify using this architecture.
We solve both of these issues simultaneously by training one “global” SVM
on the union of the features from the top classifiers selected by Adaboost. We do
this by concatenating the features from the N highest-weighted SVMs (from the
output of Adaboost), and then training a single SVM classifier over these features
(optimizing over N ). In practice, the number of features chosen is between 2
(for “mustache”) and 6 (e.g., for “hair color”). Error rates for this algorithm,
denoted as “Attribute-Tuned Global SVM,” are shown in the third column of
Table 4. Notice that for each attribute, these rates are equal to, or less than,
the rates obtained using the combination of local SVMs, despite the fact that
these classifiers run significantly faster and require only a fraction of the memory
(often less by an order of magnitude).
The automatically-selected region and feature type combinations for each at-
tribute are shown in the last column of Table 4. Listed in order of decreasing
importance, the combinations are displayed in a shorthand notation using the
codes given in Fig. 4 and Table 3. In Fig. 5, we visually illustrate the top feature
350 N. Kumar, P. Belhumeur, and S. Nayar
combinations chosen for the gender, smiling, environment, and hair color at-
tributes. This figure shows the ability of our feature selection approach to iden-
tify the relevant regions and feature types for each attribute.
(a) (b)
Fig. 6. Results of queries (a) “older men with mustaches” and (b) “dark-haired people
with sunglasses” on our face search engine. The results are shown with aligned face
images on the left, and a preview of the original image for the currently selected face
on the right. Notice the high quality of results in both cases.
For a search engine, the design of the user interface is important for enabling
users to easily find what they are looking for. We use simple text-based queries,
since these are both familiar and accessible to most internet users. Search queries
are mapped onto attribute labels using a dictionary of terms. Users can see the
current list of attributes supported by the system on the search page, allowing
them to construct their searches without having to guess what kinds of queries
are allowed. This approach is simple, flexible, and yields excellent results in prac-
tice. Furthermore, it is easy to add new phrases and attributes to the dictionary,
or maintain separate dictionaries for different languages.
Results are ranked in order of decreasing confidence, so that the most relevant
images are shown first. (Our classifier gives us confidence values for each labeled
attribute.) For searches with multiple query terms, we combine the confidences
of different labels such that the final ranking shows images in decreasing order of
relevance to all search terms. To prevent high confidences for one attribute from
dominating the search results, we convert the confidences into probabilities, and
then use the product of the probabilities as the sort criteria. This ensures that
the images with high confidences for all attributes are shown first.
Example queries on our search engine are shown in Figs. 1b, 2, and 6. The
returned results are all highly relevant, and the user can view the results in a
variety of ways, as shown in the different examples. Figure 2b shows that we can
learn useful things about an image using just the appearance of the faces within
it – in this case determining whether the image was taken indoors or outdoors.
Our search engine can be used in many other applications, replacing or aug-
menting existing tools. In law enforcement, eyewitnesses to crimes could use our
system to quickly narrow a list of possible suspects and then identify the actual
criminal from this reduced list, saving time and increasing the chances of finding
the right person. On the internet, our face search engine is a perfect match for
352 N. Kumar, P. Belhumeur, and S. Nayar
social networking websites such as Facebook and Myspace, which contain large
numbers of images with people. Additionally, the community aspect of these
websites would allow for collaborative creation of new attributes. Finally, users
can utilize our system to more easily organize and manage their own personal
photo collections. For example, searches for blurry or other poor-quality images
can be used to find and remove all such images from the collection.
7 Discussion
In this work, we have described a new approach to searching for images in large
databases and have constructed the first face search engine using this approach.
By limiting our focus to images with faces, we are able to align the images to a
common coordinate system. This allows us to exploit the commonality of facial
structures across people to train accurate classifiers for real-world face images.
Our approach shows the power of combining the strengths of different algorithms
to create a flexible architecture without sacrificing classification accuracy.
As we continue to grow and improve our system, we would also like to ad-
dress some of our current limitations. For example, to handle more than just
frontal faces would require that we define the face regions for each pose bin.
Rather than specifying the regions manually, however, we can define them once
on a 3D model, and then project the regions to 2D for each pose bin. The other
manual portion of our architecture is the labeling of example images for train-
ing classifiers. Here, we can take advantage of communities on the internet by
offering a simple interface for both defining new attributes and labeling example
images. Finally, while our dictionary-based search interface is adequate for most
simple queries, taking advantage of methods in statistical natural language pro-
cessing (NLP) could allow our system to map more complex queries to the list
of attributes.
References
1. Omron: OKAO vision (2008), http://www.omron.com/rd/vision/01.html
2. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3) (1995)
3. Golomb, B.A., Lawrence, D.T., Sejnowski, T.J.: Sexnet: A neural network identifies
sex from human faces. NIPS, 572–577 (1990)
4. Belhumeur, P.N., Hespanha, J., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recog-
nition using class specific linear projection. In: Buxton, B.F., Cipolla, R. (eds.)
ECCV 1996. LNCS, vol. 1065, pp. 45–58. Springer, Heidelberg (1996)
5. Moghaddam, B., Yang, M.-H.: Learning gender with support faces. TPAMI 24(5),
707–711 (2002)
6. Phillips, P., Moon, H., Rizvi, S., Rauss, P.: The FERET evaluation methodology
for face-recognition algorithms. TPAMI 22(10), 1090–1104 (2000)
FaceTracer: A Search Engine for Large Collections of Images with Faces 353
7. Shakhnarovich, G., Viola, P.A., Moghaddam, B.: A unified learning framework for
real time face detection and classification. ICAFGR, 14–21 (2002)
8. Baluja, S., Rowley, H.: Boosting sex identification performance. IJCV (2007)
9. Freund, Y., Shapire, R.E.: Experiments with a new boosting algorithm. In: ICML
(1996)
10. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple
features. In: CVPR (2001)
11. Bartlett, M.S., Littlewort, G., Fasel, I., Movellan, J.R.: Real time face detection and
facial expression recognition: Development and applications to human computer
interaction. CVPRW 05 (2003)
12. Wang, Y., Ai, H., Wu, B., Huang, C.: Real time facial expression recognition with
adaboost. In: ICPR, pp. 926–929 (2004)
13. Datta, R., Li, J., Wang, J.Z.: Content-based image retrieval: Approaches and trends
of the new age. Multimedia Information Retrieval, 253–262 (2005)
14. Pentland, A., Picard, R., Sclaroff, S.: Photobook: Content-based manipulation of
image databases. IJCV, 233–254 (1996)
15. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild:
A database for studying face recognition in unconstrained environments. Technical
Report 07-49 (2007)
16. Georghiades, A.S., Belhumeur, P.N., Kriegman, D.J.: From few to many: Il-
lumination cone models for face recognition under variable lighting and pose.
TPAMI 23(6), 643–660 (2001)
17. Sim, T., Baker, S., Bsat, M.: The CMU pose, illumination, and expression (PIE)
database. In: ICAFGR, pp. 46–51 (2002)
18. Phillips, P.J., Flynn, P.J., Scruggs, T., Bowyer, K.W., Chang, J., Hoffman, K.,
Marques, J., Min, J., Worek, W.: Overview of the face recognition grand challenge.
CVPR, 947–954 (2005)
19. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces
for face recognition. CVPR, 84–91 (1994)
20. Huang, J., Shao, X., Wechsler, H.: Face pose discrimination using support vector
machines (SVM). In: ICPR, pp. 154–156 (1998)
21. Osuna, E., Freund, R., Girosi, F.: Training support vector machines: An application
to face detection. CVPR (1997)
22. Schapire, R., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new
explanation for the effectiveness of voting methods. The Annals of Statistics 26(5),
1651–1686 (1998)
23. Drucker, H., Cortes, C.: Boosting decision trees. NIPS, 479–485 (1995)
24. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001),
http://www.csie.ntu.edu.tw/cjlin/libsvm/
What Does the Sky Tell Us about the Camera?
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 354–367, 2008.
c Springer-Verlag Berlin Heidelberg 2008
What Does the Sky Tell Us about the Camera? 355
Fig. 1. The sky appearance is a rich source of information about the scene illumination
time. When the scene is mostly static, the resulting sequence of images contains a
wealth of information that has been exploited in several different ways, the most
commonly known being background subtraction, but also shadow detection and
removal [1], video factorization and compression [2], radiometric calibration [3],
camera geo-location [4], temporal variation analysis [5] and color constancy [6].
The main contribution of this paper is to show what information about the cam-
era is available in the visible portion of the sky in a time-lapse image sequence,
and how to extract this information to calibrate the camera.
The sky appearance has long been studied by physicists. One of the most pop-
ular physically-based sky model was introduced by Perez et al [7]. This model has
been used in graphics for relighting [8] and rendering [9]. Surprisingly however,
very little work has been done on extracting information from the visible sky.
One notable exception is the work of Jacobs et al [10] where they use the sky to
infer the camera azimuth by using a correlation-based approach. In our work, we
address a broader question: what does the sky tell us about the camera? We show
how we can recover the viewing geometry using an optimization-based approach.
Specifically, we estimate the camera focal length, its zenith angle (with respect
to vertical), and its azimuth angle (with respect to North). We will assume that
a static camera is observing the same scene over time, with no roll angle (i.e.
the horizon line is parallel to the image horizontal axis). Its location (GPS co-
ordinates) and the times of image acquisition are also known. We also assume
that the sky region has been segmented, either manually or automatically [5].
Once the camera parameters are recovered, we then show how we can use our
sky model in two applications. First, we present a novel sky-cloud segmentation
algorithm that identifies cloud regions within an image. Second, we show how
we can use the resulting sky-cloud segmentation in order to find matching skies
across different cameras. To do so, we introduce a novel bi-layered sky model
which captures both the physically-based sky parameters and cloud appearance,
and determine a similarity measure between two images. This distance can then
be used for finding images with similar skies, even if they are captured by differ-
ent cameras at different locations. We show qualitative cloud segmentation and
sky matching results that demonstrate the usefulness of our approach.
In order to thoroughly test our algorithms, we require a set of time-lapse
image sequences which exhibit a wide range of skies and cameras. For this, we
use the AMOS (Archive of Many Outdoor Scenes) database [5], which contains
image sequences taken by static webcams over more than a year.
356 J.-F. Lalonde, S.G. Narasimhan, and A.A. Efros
Fig. 2. Geometry of the problem, when a camera is viewing a sky element (blue patch
in the upper-right). The sky element is imaged at pixel (up , vp ) in the image, and the
camera is rotated by angles (θc , φc ). The camera focal length fc , not shown here, is
the distance between the origin (center of projection), and the image center. The sun
direction is given by (θs , φs ), and the angle between the sun and the sky element is γp .
Here (up , vp ) are known because the sky is segmented.
single parameter, the turbidity t. Intuitively, the turbidity encodes the amount
of scattering in the atmosphere, so the lower t, the clearer the sky. For clear
skies, the constants take on the following values: a = −1, b = −0.32, c = 10,
d = −3, e = 0.45, which corresponds approximately to t = 2.17.
The model expresses the absolute luminance Lp of a sky element as a function
of another arbitrary reference sky element. For instance, if the zenith luminance
Lz is known, then
f (θp , γp )
Lp = Lz , (2)
f (0, θs )
where θs is the zenith angle of the sun.
Luminance as a function of pixel height and field of view Luminance as a function of pixel height and camera azimuth
1 1
fov = 40° θc = 70°
° °
fov = 60 θc = 80
0.9 ° 0.9
fov = 80 θc = 90°
°
fov = 100
° θ = 100°
fov = 120 c
0.8 0.8 θ = 110°
c
Scaled luminance
Scaled luminance
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Pixel height of sky element in the image (vp) Pixel height of sky element in the image (vp)
(a) (b)
Fig. 3. Luminance profiles predicted by the azimuth-independent model (5). For clear
skies, intensity diminishes as pixel height (x-axis) increases. (a) The camera zenith
angle is kept constant at θc = 90◦ , while the field of view is varied. (b) The field of
view is kept constant at 80◦ , while the camera zenith angle is varied. Both parameters
have a strong influence on the shape and offset of the predicted sky gradient.
Before we present how we use the models presented above, recall that we are
dealing with ratios of sky luminances, and that a reference element is needed.
Earlier, we used the zenith luminance Lz as a reference in (2) and (4), which
unfortunately is not always visible in images. Instead, we can treat this as an
additional unknown in the equations. Since the denominators in (2) and (4) do
not depend on camera parameters, we can combine them with Lz into a single
unknown scale factor k.
2
min yp(i) − k (i) g (vp , θc , fc ) , (8)
θc ,fc ,k(i)
i∈I p∈P
(i)
where yp is the observed intensity of pixel p in image i, and k (i) are unknown
scale factors (Sect. 2.3), one per image. This non-linear least-squares minimiza-
tion can be solved iteratively using standard optimization techniques such as
Levenberg-Marquadt, or fminsearch in Matlab. fc is initialized to a value
corresponding to a 35◦ field of view, and θc is set such that the horizon line is
aligned with the lowest visible sky pixel. All k (i) ’s are initialized to 1.
From the azimuth-independent model (5) and images where the sun is far from
the camera field of view, we were able to estimate the camera focal length fc
and its zenith angle θc . Now if we consider the general model (7) that depends
on the sun position, we can also estimate the camera azimuth angle using the
same framework as before.
Suppose we are given a set of images J where the sky is clear, but where the
sun is now closer to the camera field of view. Similarly to (8), we seek to find
the camera azimuth angle which minimizes
2
min yp(j) − k (j) g(up , vp , θc , φc , fc , θs , φs ) . (9)
φc ,k(j)
j∈J p∈P
We tested our model and fitting technique on a very diverse set of scenarios
using data synthetically generated by using the original Perez sky model in (1).
During these experiments, the following parameters were varied: the camera
focal length fc , the camera zenith and azimuth angles (θc , φc ), the number of
360 J.-F. Lalonde, S.G. Narasimhan, and A.A. Efros
Table 1. Camera calibration from the sky on 3 real image sequences taken from the
AMOS database [5]. Error in focal length, zenith and azimuth angle estimation is shown
for each sequence. The error is computed with respect to values obtained by using the
sun position to estimate the same parameters [14].
input images used in the optimization, the number of visible sky pixels, and the
camera latitude (which effects the maximum sun height). In all our experiments,
1000 pixels are randomly selected from each input image, and each experiment
is repeated for 15 random selections.
The focal length can be recovered with at most 4% error even in challenging
conditions: 30% visibility, over a wide range of field of view ([13◦ , 93◦ ] interval),
zenith angles ([45◦ , 135◦]), azimuth angles ([−180◦, 180◦ ]), and sun positions
(entire hemisphere). We note a degradation in performance at wider fields of
view (> 100◦ ), because the assumption of independent zenith and azimuth angles
starts to break down (Sect. 2.3). Less than 0.1◦ error for both zenith and azimuth
angles is obtained in similar operating conditions.
Fig. 4. Illustration of estimated camera parameters. First row: Example image for the
three sequences in Table 1. The horizon line is drawn in red. Note that the horizon line
in sequence 414 is found to be just below the image. Second row: Graphical illustration
of all three estimated parameters: focal length, zenith and azimuth angles. The sun is
drawn at the position corresponding to the image in the first row.
we follow [9] and express the five weather coefficients as a linear function of
a single
# value, the turbidity
$ t. Strictly speaking, this means minimizing over
x = t k (1) k (2) k (3) :
3
2
min yp(i) − k (i) g(up , vp , θs , φs , τ (i) (t)) , (10)
x
i=1 p∈P
where i indexes the color channel. Here the camera parameters are fixed, so
we omit them for clarity. The vector τ (i) (t) represents the coefficients (a, . . . , e)
obtained by multiplying the turbidity t with the linear transformation M (i) :
# $T
τ (i) (t) = M (i) t 1 . The entries of M (i) for the xyY space are given in the
appendix in [9]. The k (i) are initialized to 1, and t to 2 (low turbidity).
Unfortunately, solving this simplified minimization problem does not yield
satisfying results because the L2-norm is not robust to outliers, so even a small
amount of clouds will bias the results.
3
2
min wp yp(i) − k (i) g(up , vp , θs , φs , τ (i) (t)) + βx − xc 2 , (11)
x
i=1 p∈P
where, wp ∈ [0, 1] is a weight given to each pixel, and β = 0.05 controls the
importance of the prior term in the optimization. We initialize x to the prior xc .
Let us now look at how xc is obtained. We make the following observation:
clear skies should have low turbidities, and they should be smooth (i.e. no patchy
clouds). Using this insight, if minimizing (10) on a given image yields low residual
error and turbidity, then the sky must be clear. We compute a database of clear
skies by keeping all images with turbidity less than a threshold (we use 2.5), and
keep the best 200 images, sorted by residual error. Given an image, we compute
xc by taking the mean over the K nearest neighbors in the clear sky database,
using the angular deviation between sun positions as distance measure (we use
K = 2). This allows us to obtain a prior model of what the clear sky should look
like at the current sun position. Note that we simply could have used the values
for (a, . . . , e) from Sect. 2 and fit only the scale factors k (i) , but this tends to
over-constrain, so we fit t as well to remain as faithful to the data as possible.
To obtain the weights wp in (11), the color distance λ between each pixel and
the prior model is computed and mapped to the [0, 1] interval with an inverse
exponential: wp = exp{−λ2 /σ 2 } (we use σ 2 = 0.01 throughout this paper). After
the optimization is over, we re-estimate wp based on the new parameters x, and
repeat the process until convergence, or until a maximum number of iterations
What Does the Sky Tell Us about the Camera? 363
Fig. 5. Sky-cloud separation example results. First row: input images (radiometrically
corrected). Second row: sky layer. Third row: cloud segmentation. The clouds are color-
coded by weight: 0 (blue) to 1 (red). Our fitting algorithm is able to faithfully extract
the two layers in all these cases.
is reached. The process typically converges in 3 iterations, and the final value
wp is used as the cloud segmentation. Cloud coverage is then computed as
for
p∈P wp .
1
|P|
Fig. 6. More challenging cases for the sky-cloud separation, and failure cases. First
row: input images (radiometrically corrected). Second row: sky layer. Third row: cloud
layer. The clouds are color-coded by weight: 0 (blue) to 1 (red). Even though the sky
is more than 50% occluded in the input images, our algorithm is able to recover a good
estimate of both layers. The last two columns illustrate a failure case: the sun (either
when very close or in the camera field of view) significantly alters the appearance of
the pixels such that they are labeled as clouds.
statistics in order to find skies that have similar properties. We first present our
novel bi-layered representation for sky and clouds, which we then use to define
a similarity measure between two images. We then present qualitative matching
results on real image sequences.
Input Nearest-neighbors
Sequence 466 Sequence 257 Sequence 407 Sequence 414
Fig. 7. Sky matching results across different cameras. The left-most column shows
several images taken from different days of sequence 466 in the AMOS database. The
three other columns are the nearest-neighbor matches in sequences 257, 407 and 414
respectively, obtained using our distance measure. Sky conditions are well-matched,
even though cameras have different parameters.
Once this layered sky representation is computed, similar images can be re-
trieved by comparing their turbidities and cloud statistics (we use χ2 distance
for histogram comparison). A combined distance is obtained by taking the sum
of cloud and turbidity distance, with the relative importance between the two
determined by the cloud coverage.
366 J.-F. Lalonde, S.G. Narasimhan, and A.A. Efros
7 Summary
In this paper, we explore the following question: what information about the
camera is available in the visible sky? We show that, even if a very small portion
of the hemisphere is visible, we can reliably estimate three important camera
parameters by observing the sky over time. We do so by expressing a well-
known physically-based sky model in terms of the camera parameters, and by
fitting it to clear sky images using standard minimization techniques. We then
demonstrate the accuracy of our approach on synthetic and real data. Once the
camera parameters are estimated, we show how we can use the same model to
segment out clouds from sky and build a novel bi-layered representation, which
can then be used to find similar skies across different cameras.
We plan to use the proposed sky illumination model to see how it can help
us predict the illumination of the scene. We expect that no parametric model
will be able to capture this information well enough, so data-driven methods will
become even more important.
Acknowledgements
This research is supported in parts by an ONR grant N00014-08-1-0330 and NSF
grants IIS-0643628, CCF-0541307 and CCF-0541230. A. Efros is grateful to the
WILLOW team at ENS Paris for their hospitality.
References
1. Weiss, Y.: Deriving intrinsic images from image sequences. In: IEEE International
Conference on Computer Vision (2001)
2. Sunkavalli, K., Matusik, W., Pfister, H., Rusinkiewicz, S.: Factored time-lapse
video. ACM Transactions on Graphics (SIGGRAPH 2007) 26(3) (August 2007)
3. Kim, S.J., Frahm, J.M., Polleyfeys, M.: Radiometric calibration with illumination
change for outdoor scene analysis. In: IEEE Conference on Computer Vision and
Pattern Recognition (2008)
4. Jacobs, N., Satkin, S., Roman, N., Speyer, R., Pless, R.: Geolocating static cameras.
In: IEEE International Conference on Computer Vision (2007)
5. Jacobs, N., Roman, N., Pless, R.: Consistent temporal variations in many outdoor
scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (2007)
What Does the Sky Tell Us about the Camera? 367
6. Sunkavalli, K., Romeiro, F., Matusik, W., Zickler, T., Pfister, H.: What do color
changes reveal about an outdoor scene? In: IEEE Conference on Computer Vision
and Pattern Recognition (2008)
7. Perez, R., Seals, R., Michalsky, J.: All-weather model for sky luminance distribution
– preliminary configuration and validation. Solar Energy 50(3), 235–245 (1993)
8. Yu, Y., Malik, J.: Recovering photometric properties of architectural scenes from
photographs. Proceedings of ACM SIGGRAPH 1998 (July 1998)
9. Preetham, A.J., Shirley, P., Smits, B.: A practical analytic model for daylight.
Proceedings of ACM SIGGRAPH 1999 (August 1999)
10. Jacobs, N., Roman, N., Pless, R.: Toward fully automatic geo-location and geo-
orientation of static outdoor cameras. In: Workshop on applications of computer
vision (2008)
11. Committee, C.T.: Spatial distribution of daylight – luminance distributions of var-
ious reference skies. Technical Report CIE-110-1994, International Commission on
Illumination (1994)
12. Ineichen, P., Molineaux, B., Perez, R.: Sky luminance data validation: comparison
of seven models with four data banks. Solar Energy 52(4), 337–346 (1994)
13. Reda, I., Andreas, A.: Solar position algorithm for solar radiation applications.
Technical Report NREL/TP-560-34302, National Renewable Energy Laboratory
(November 2005)
14. Lalonde, J.F., Narasimhan, S.G., Efros, A.A.: Camera parameters estimation from
hand-labelled sun positions in image sequences. Technical Report CMU-RI-TR-08-
32, Robotics Institute. Carnegie Mellon University (July 2008)
15. Lin, S., Gu, J., Yamazaki, S., Shum, H.Y.: Radiometric calibration from a single
image. In: IEEE Conference on Computer Vision and Pattern Recognition (2004)
16. Lalonde, J.F., Hoiem, D., Efros, A.A., Rother, C., Winn, J., Criminisi, A.: Photo
clip art. ACM Transactions on Graphics (SIGGRAPH 2007) 26(3) (August 2007)
Three Dimensional Curvilinear Structure Detection
Using Optimally Oriented Flux
Abstract. This paper proposes a novel curvilinear structure detector, called Op-
timally Oriented Flux (OOF). OOF finds an optimal axis on which image gradi-
ents are projected in order to compute the image gradient flux. The computation
of OOF is localized at the boundaries of local spherical regions. It avoids con-
sidering closely located adjacent structures. The main advantage of OOF is its
robustness against the disturbance induced by closely located adjacent objects.
Moreover, the analytical formulation of OOF introduces no additional computa-
tion load as compared to the calculation of the Hessian matrix which is widely
used for curvilinear structure detection. It is experimentally demonstrated that
OOF delivers accurate and stable curvilinear structure detection responses under
the interference of closely located adjacent structures as well as image noise.
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 368–382, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Three Dimensional Curvilinear Structure Detection Using Optimally Oriented Flux 369
structure direction. Bouix et al. proposed to compute the image gradient flux for ex-
tracting centerlines of curvilinear structures [3]. Siddqi et al. [15] showed promising
vascular segmentation results by evolving an image gradient flux driven active surface
model. But the major disadvantage of the image gradient flux is its regardless of direc-
tional information.
Grounded on the multiscale based Hessian matrix, Sato et al. [12] presented a thor-
ough study on the properties of the eigenvalues extracted from the Hessian matrix in
different scales, and their performance in curvilinear structure segmentation and visu-
alization. The study showed that the eigenvalues extracted from the Hessian matrix can
be regarded as the results of convolving the image with the second derivative of a Gaus-
sian function. This function offers differential effects which compute the difference
between the intensity inside an object and in the vicinity of the object. However, if the
intensity around the objects is not homogeneous due to the presence of closely located
adjacent structures, the differential effect given by the second derivatives of Gaussian
is adversely affected.
In this paper, we propose a novel detector of curvilinear structures, called optimally
oriented flux (OOF). Specifically, the oriented flux encodes directional information by
projecting the image gradient along some axes, prior to measuring the amount of the
projected gradient that flows in or out of a local spherical region. Meanwhile, OOF dis-
covers the structure direction by finding an optimal projection axis which minimizes
the oriented flux. OOF is evaluated for each voxel in the entire image. The evaluation
of OOF is based on the projected image gradient at the boundary of a spherical region
centered at a local voxel. When the local spherical region boundary touches the object
boundary of a curvilinear structure, the image gradient at the curvilinear object bound-
ary produces an OOF detection response. Depending on whether the voxels inside the
local spherical region have stronger intensity, the sign of the OOF detection response
varies. It can be utilized to distinguish between regions inside and outside curvilinear
structures.
The major advantage of the proposed method is that the OOF based detection is lo-
calized at the boundary of the local spherical region. Distinct from the Hessian matrix,
OOF does not consider the region in the vicinity of the structure where a nearby ob-
ject is possibly present. As such, OOF detection result is robust against the disturbance
introduced by closely located objects. With this advantage, utilizing OOF for curvilin-
ear structure analysis is highly beneficial when closely located structures are present.
Moreover, the computation of OOF does not introduce additional computation load
compared to the Hessian matrix. Validated by a set of experiments, OOF is capable of
providing more accurate and stable detection responses than the Hessian matrix, with
the presence of closely located adjacent structures.
2 Methodology
2.1 Optimally Oriented Flux (OOF)
The notion of oriented flux along a particular direction refers to the amount of image
gradient projected along that direction at the surface of an enclosed local region. The
image gradient can flow either in or out of the enclosed local region. Without loss of
370 M.W.K. Law and A.C.S. Chung
generality, our elaboration focuses on the situation where the structures have stronger
intensity than background regions. As such, optimally oriented flux (OOF) aims at find-
ing an optimal projection direction that minimizes the inward oriented flux for the de-
tection of curvilinear structure.
The outward oriented flux along a direction ρ̂ is calculated by projecting the image
gradient v(·) along the direction of ρ̂ prior to the computation of flux in a local spherical
region Sr with radius r. Based on the definition of flux [13], the computation of the
outward oriented flux along the direction of ρ̂ is,
f (x; r, ρ̂) = (v(x + h) · ρ̂)ρ̂ · n̂dA, (1)
∂Sr
where dA is the infinitesimal area on ∂Sr , n̂ is the outward unit normal of ∂Sr at the
position ĥ. As ∂Sr is a sphere surface, h = rn̂, thus
3 3 1
f (x; r, ρ̂) = vk (x + rn̂)ρk ρl nl dA = ρ̂T Qr,x ρ̂, (2)
∂Sr k=1 l=1
a matrix that the entry at the ith row and jth column (i, j ∈ {1, 2, 3}) is,
qr,x
i,j
= vi (x + rn̂)nj dA. (3)
∂Sr
where aˆ1 , aˆ2 and aˆ3 are the unit vectors along the x-, y- and z-directions respectively.
Assuming that v is continuous, by the divergence theorem,
∂
qr,x =
i,j
∇ · [vi (x + y)âj ]dV = vi (x + y)dV, (6)
Sr Sr aˆj
∂
Three Dimensional Curvilinear Structure Detection Using Optimally Oriented Flux 371
where y is the position vector inside the sphere Sr and dV is the infinitesimal volume in
Sr . The continuous image gradient v(x) is acquired by convolving the discrete image
with the first derivatives of Gaussian with a small scale factor, i.e. vi (x) = (gaˆi ,σ ∗
I)(x), where ∗ is the convolution operator, gaˆi ,σ is the first derivative of Gaussian
along the direction of aˆi and σ = 1 in all our implementations. Furthermore, the volume
integral of Equation5 6 is extended to the entire image domain Ω by employing a step
1, ||x|| ≤ r,
function, br (x) = hence
0, otherwise ,
qr,x =
i,j
br (y)((gaˆi aˆj ,σ ∗ I)(x + y))dV = (br ∗ gaˆi aˆj ,σ )(x) ∗ I(x), (7)
Ω
where gaˆi aˆj ,σ is the second derivative of Gaussian along the axes aˆi and aˆj . Therefore,
the set of linear filters of Equation 4 is ψr,i,j (x) = (br ∗ gaˆi ,aˆj ,σ )(x). The next step is to
obtain the analytical Fourier expression of ψr,i,j (x) in order to compute the convolution
in Equation 4 by Fourier coefficient multiplication. Denote Ψr,i,j (u) be the Fourier
expression of ψr,i,j (x), where u = (u1 , u2 , u3 )T is the position vector in the frequency
domain. The values of u1 , u2 and u3 are in ”cycle per unit voxel” and in a range of
[−0.5, 0.5). By employing Fourier transforms on gaˆi aˆj ,σ and Hankel transforms [4] on
br (x),
−2(π||u||σ)2 1 sin(2πr||u||)
Ψr,i,j (u) = 4πrui uj e cos(2πr||u||) − . (8)
||u||2 2πr||u||
Based on the above formulation, the optimal projection axis which minimizes inward
oriented flux can be computed analytically. Denote the optimal direction is ωr,x , min-
imizing inward oriented flux is equivalent to maximizing fr (x; ωr,x ) subject to the
constraint ||ω r,x || = [ω r,x ]T ω r,x = 1. The solution is found by taking the first deriva-
tive on the Lagrange equation,
L(ω r,x ) = [ω r,x ]T Qr,x ω r,x + λr,x (1 − [ω r,x ]T ωr,x ), (9)
for ∇L(ω r,x ) = 0, and since qr,x
i,j
= qr,x
j,i
(see Equation 7), and thus Q = QT ,
Qr,x ω r,x = λr,x ω r,x . (10)
Equation 10 is in turn solved as a generalized eigenvalue problem. For volumetric im-
ages, there are at most three distinct pairs of λr,x and w r,x . The eigenvalues can be pos-
itive, zero or negative. These eigenvalues are denoted as λi (x; r), for λ1 (·) ≤ λ2 (·) ≤
λ3 (·), and the corresponding eigenvectors are ωi (x; r). Inside a curvilinear structure
having stronger intensity than the background, the first two eigenvalues would be much
smaller than the third one, λ1 (·) ≤ λ2 (·) << λ3 (·) and λ3 (·) ≈ 0. The first two eigen-
vectors span the normal plan of the structure and the third eigenvector is the structure
direction.
sphere surface (∂Sr in Equation 3). In contrast, as pointed out by Sato et al. [12], the
computation of the Hessian matrix is closely related to the results of applying the sec-
ond derivative of Gaussian function on the image. This function computes the weighted
intensity average difference between the regions inside the structure and in the vicinity
of the structure. As such, the coverage of this function extends beyond the boundary
of target structures and possibly includes structures nearby. As a result, the weighted
intensity average difference computed by the second derivative of Gaussian function
can be affected by the adjacent objects. It can be harmful to the detection accuracy of
the Hessian matrix when closely located adjacent structures are present.
On the contrary, the evaluation of OOF is performed on the boundary of a local spher-
ical region ∂Sr . Detection response of OOF is induced from the intensity discontinuities
at the object boundary when the local sphere surface touches the object boundary of the
structure. The detection of OOF is localized at the boundary of the local spherical re-
gion. The localized detection avoids the inclusion of objects nearby. Therefore, the OOF
based detection is robust against the disturbance introduced by closely located adjacent
structures.
The eigenvalues extracted from OOF are the values of oriented flux along the corre-
sponding eigenvectors,
The image gradient at the object boundary of a strong intensity curvilinear structure
points to the centerline of the structure. Inside the structure, when the local spherical
region boundary ∂Sr (see Equation 1) touches the object boundary, at the contacting
position of these two boundaries, the image gradient v(·) is aligned in the opposite
direction of the outward normal n̂, hence λ1 (·) ≤ λ2 (·) << 0. On the other hand, the
image gradient is perpendicular to the structure direction, the projected image gradient
along ω 3 (·) has zero or very small magnitude, thus λ3 (·) ≈ 0. In contrast, if OOF
is computed for a voxel which is just outside the curvilinear structure, at the position
where ∂Sr touches the curvilinear structure boundary, the image gradient v(·) is in the
same direction as the outward normal n̂. It results in a large positive eigenvalue, that is
λ3 (·) >> 0.
Combining multiple eigenvalues to tailor a measure for identifying structures in a
specific shape is now possible. For instance Λ12 (x; r) = λ1 (x; r) + λ2 (x; r) can pro-
vide responses at the curvilinear object centerline with circular cross section. According
to Equations 1 and 11,
0.8
0.6
0.4
0.2
0
(a) (b) (c) (d)
Fig. 1. (a, b, c) The values of ||[W12 (·)]T n̂||. (d) The intensity scale of the images in (a-c).
Three Dimensional Curvilinear Structure Detection Using Optimally Oriented Flux 373
Λ12 (x; r) = [W12 (x; r)]T v(x + h) · [W12 (x; r)]T n̂ dA.
∂Sr
where W12 (x; r) = [ω 1 (x; r) ω 2 (x; r)]. The term involving the projection of n̂ in
the second half of the surface integral of the above equation is independent to the im-
age gradient. This term varies along the boundary of the spherical region ∂Sr . It is a
weighting function that makes the projected image gradients at various positions on the
sphere surface contribute differently to the resultant values of Λ12 (x; r). The values of
||[W12 (x; r)]T n̂|| on the local spherical region surface are shown in Figures 1a-c. A
large value of ||[W12 (x; r)]T n̂|| represents the region where Λ12 (x; r) is sensitive, as
the projected image gradient at that region receives a higher weight for the computa-
tion of Λ12 (x; r). The large valued regions of ||[W12 (x; r)]T n̂|| are distributed in a
ring shape around the axis ω 3 (x; r). In a curvilinear structure having circular cross sec-
tion, the image gradient at the object boundary points to the centerline of the structure.
Therefore, at the centerline of the structure, Λ12 (x; r) delivers the strongest response if
r and the radius of the structure are matched.
Finally, it is worth mentioning that the elaboration of Λ12 (·) merely demonstrates
a possibility to integrate different eigenvalues to facilitate the analysis of curvilinear
structures. It is possible to devise other combinations of eigenvalues of the proposed
method analogous to those presented in [12] and [6].
Multiscale detection is an essential technique for handling structures with various sizes.
The multiscale detection of OOF involves repetitive computations of OOF using a set
of radii (r in Equation 1). The radius set should cover both the narrowest and the widest
curvilinear structures in an image volume. Since the evaluation of OOF is localized
at the spherical region boundary, the spherical region has to touch the target structure
boundary to obtain detection responses of OOF. As such, linear radius samples should
be taken for OOF with the consideration of the voxel length in order to properly detect
vessels in a given range of radii. It also ensures that a structure with non-circular cross
section can induce detection responses of OOF obtained in at least one radius sample.
We suggest that radius samples are taken in every 0.5 voxel length according to the
Nyquist sampling rate.
For different values of r, the area covered by the surface integral of Equation 1
varies. Dividing the computation result of Equation 1 by 4πr2 (the surface area of the
spherical region) is an appropriate mean to normalize the detection response over radii
and hence, the computation of Equation 1 is scale-invariant. Such normalization is es-
sential to aggregating OOF responses in a multiple scale setting. For the same rea-
son, the eigenvalues of Qr,x , λi (r, x) are divided by 4πr2 prior to being utilized in
any multiscale framework. This OOF normalization scheme is distinct to the average-
outward-flux (AOF) measure [5], which divides the outward flux by the surface area of
the spherical region to attain the AOF-limiting-behavior. The AOF measure works only
on the gradient of a distance function of a shape with its boundary clearly delineated.
OOF, in contrast, is applied to a gradient of a gray-scale image, where no explicit shape
boundary is embedded and noise is possibly present.
374 M.W.K. Law and A.C.S. Chung
(a) (b)
Intensity Intensity
1 1
0.5 0.5
0 0
L1 C1 R1 L2 C2 R2 L1 C1 R1 L2 C2 R2
(c) (d)
Tr(Qr,x)/4πr 2
0.1 0.1
0 0
−0.1 1
−0.2 2
r=1 r=2 r=3 r=4 r=5 r=6 r=1 r=2 r=3 r=4 r=5 r=6
L1 C1 R1 L2 C2 R2 L1 C1 R1 L2 C2 R2
(e) (f)
Fig. 2. Examples of evaluating OOF using multiple radii. (a, b) The slices of z = 0 (left) and
x = 0 (right) of two synthetic image volumes consisting of synthetic tubes with a radius of 4
voxel length. C1 and C2 are the positions of the centers of the tubes. L1, R1 and L2, R2 are the
positions of the boundaries of the tubes centered at C1 and C2 respectively. (b) The width of the
separation between the closely located tubes is 2 voxel length. (c, d) The intensity profiles along
the line x = 0, z = 0 of the synthetic image volumes shown in (a) and (b) respectively. (e, f) The
normalized trace of Qr,x along the line x = 0, z = 0 of the image volumes shown in (a) and (b)
respectively.
In Figures 2a-f, we show two examples of evaluating OOF on image volumes con-
sisting of one synthetic tube (Figures 2a and c) and two closely located synthetic tubes
(Figures 2b and d) using multiple radii. The normalized trace of the matrix Qr,x (Equa-
tions 9), which is equal to the sum of the normalized eigenvalues of Qr,x , is utilized
to quantify the detection response strength of OOF. The normalized trace of the matrix
Qr,x is computed using multiple radii in both of the synthetic image volumes. In Fig-
ures 2e and f, it is observed that the normalized trace of Qr,x is negative for all radii
inside the tubes. It attains its maximal negative values at the tube centers and with the
radius r matching the tube radius, i.e. r = 4. The magnitudes of the normalized trace
of Qr,x with r = 4 decline at positions away from the tube centers. In these positions,
it attains its maximal magnitudes with smaller values of r when approaching the tube
boundaries. Therefore, making use of the normalized trace of Qr,x as well as the nor-
malized eigenvalues of Qr,x , (the trace of Qr,x is equal to the sum of its eigenvalues),
with maximal negative values or maximal magnitudes over radii is capable of delivering
a strong detection responses inside curvilinear structures.
When OOF is computed using multiple radii, the spherical regions of OOF with
large radii possibly overshoot the narrow structure boundaries. The computation of OOF
with overshot radii can include the objects nearby and adversely affects the detection
responses of OOF (see Figure 2e, r = 5 and 6 versus Figure 2f, r = 5 and 6). In which,
utilizing the normalized eigenvalues or the normalized trace of the matrix Qr,x with
the maximal negative values or maximal magnitudes over radii as mentioned above can
Three Dimensional Curvilinear Structure Detection Using Optimally Oriented Flux 375
eliminate the responses obtained by using overshot radii. Furthermore, it excludes the
OOF responses associated with undersized radii at the center of curvilinear structures
(see Figures 2e and f, r = 1, 2 and 3). In the case that the radius of the spherical region
r matches the target structures, OOF avoids the inclusion of objects nearby. It therefore
reports the same response at the centerlines of the tubes with r = 4 despite the presence
of closely located structures (see Figure 2e, r = 4 versus Figure 2f, r = 4).
3 Experimental Results
In this section, we compare the performance of OOF and the Hessian matrix by using
both synthetic data and real clinical cases. The differential terms of the Hessian matrix
are obtained by employing the central mean difference scheme on the image smoothed
by a Gaussian kernel with scale factor !. The eigenvalues and eigenvectors extracted
from the Hessian matrix and Q for OOF (Equation 10) are represented as λH i (x; r),
Q Q
ωHi (x; r) and λi (x; r), ω i (x; r), respectively. The order of the eigenvalues and the
Q
notation of sums of the first two eigenvalues (ΛH 12 (x; r) and Λ12 (x; r)) are analogous
to those described in Section 2.2.
Fig. 3. The description of the tori. These tori have been used in the synthetic data experiments.
The center of the tori in each layer is randomly selected from the positions of (x = 35, y =
35), (x = 45, y = 35), (x = 35, y = 45) and (x = 45, y = 45). The values of d and R are
fixed to generate a torus image. In the experiments, there are 10 torus images generated by using
10 pairs of {d, R}, {2, 1}, {2, 2}, {2, 3}, {2, 4}, {2, 5}, {5, 1}, {5, 2}, {5, 3}, {5, 4} and {5, 5}.
376 M.W.K. Law and A.C.S. Chung
of that torus is randomly selected among the positions (x = 45, y = 45, z = 8),
(x = 35, y = 45, z = 8), (x = 45, y = 35, z = 8) and (x = 35, y = 35, z = 8). We
keep deploying adjacent tori centered at the same position of the first torus but having
larger values of D in an interval of 2R + d until D ≤ 42. Each successive layer of tori
is generated in a 2R + d interval of altitude z for z ≤ 90. The center of each layer of
tori is randomly selected among the positions of (x = 35, y = 35), (x = 45, y = 35),
(x = 35, y = 45) and (x = 45, y = 45). The background intensity of these images
is 0 and the intensity inside the tori is assigned to 1. The torus images are smoothed
by a Gaussian kernel with scale factor 1 to mimic the smooth intensity transition from
structures to background. Each synthetic image is corrupted by two levels of additive
Gaussian noise, with standard deviations of σnoise = {0.75, 1}. Finally, 20 testing cases
are generated for this experiment.
The experiment results are based on the measures obtained in the estimated object
scales of the both methods. For the testing objects with circular cross section such as the
tori used in this experiment, computing the sums of the first two eigenvalues ΛH 12 (·) and
ΛQ12 (·) at structure centerlines is useful to determine the structure scales. The reason is
that ΛH 12 (·) of the Hessian matrix quantifies the second order intensity change occurred
along the radial direction of a circle on the normal plane of the structure. Meanwhile, for
OOF, ΛQ 12 (·) evaluates the amount of gradient pointing to the centerlines of tubes with
circular cross section. Based on the above observation, the object scale is obtained as
2
SxH = arg max(− s3 ΛH √s
12 (x; 3 )) for the Hessian matrix (see [7,14] for details regarding
s∈E
the structure scale detection and [10] for Hessian matrix based feature normalization
Q
over scales) and SxQ = arg max(− 4πs1
2 Λ12 (x; s)) for OOF. The set of discrete detection
s∈F
scales of OOF and detection scales of the Hessian matrix are represented as F and E
respectively. These scales cover the structure radii ranged from 1 to 6 voxel length. The
radii of OOF are taken for each 0.5 voxel length and there are in total 11 different radii in
F . Meanwhile, the same number of scales are logarithmically sampled for the Hessian
matrix scale set E so as to minimize the detection error of the Hessian matrix [12].
There are two measures being studied for the comparison of OOF and the Hessian
matrix, “Angular discrepancy” and “Response fluctuation”. For objects with circular
cross section and having stronger intensity than the background, the third eigenvector
represents the structure direction. At the estimated structure scales, we measure the
angular discrepancy of the Hessian matrix and OOF by
Q Q
arccos(|Gt · ω H
3 (x; St )|), arccos(|Gt · ω 3 (x; St )|),
H
(12)
respectively, where Gt is the ground truth direction, which is defined as the tangent
direction of the torus inner-tube centerline at the position t, t ∈ T , where T is a set
of samples taken in every unit voxel length at the inter-tube centerlines of the tori.
Bilinear interpolation is applied if t does not fall on an integer coordinate. The value
of the angular discrepancy is in a range of [0, π/2] and a small value of the angular
discrepancy represents an accurate estimation of structure direction.
The second measure, “Response fluctuation” for the tori having circular cross section
is defined as the ratio between the variance and the mean absolute value of Λ12 (·). The
“Response fluctuation” of the Hessian matrix and OOF are defined as
Three Dimensional Curvilinear Structure Detection Using Optimally Oriented Flux 377
Table 1. The performance of optimally oriented flux and the Hessian matrix obtained in the syn-
thetic data experiments. The entries in the columns of ”Angular discrepancy” include two values,
the mean and the standard deviation (the bracketed values) of the resultant values of Equation 12.
The values in the columns of ”Response fluctuation” are the results based on Equation 13.
d = 5, σnoise = 1 d = 2, σnoise = 1
Angular discrepancy Response fluctuation Angular discrepancy Response fluctuation
R Hessian matrix OOF Hessian matrix OOF R Hessian matrix OOF Hessian matrix OOF
1 0.518 (0.288) 0.409 (0.239) 0.321 0.291 1 0.532 (0.305) 0.414 (0.243) 0.338 0.298
2 0.331 (0.252) 0.246 (0.148) 0.210 0.200 2 0.435 (0.278) 0.319 (0.192) 0.272 0.239
3 0.204 (0.218) 0.169 (0.109) 0.129 0.105 3 0.279 (0.243) 0.200 (0.132) 0.177 0.134
4 0.112 (0.158) 0.110 (0.080) 0.089 0.080 4 0.181 (0.220) 0.125 (0.095) 0.127 0.108
5 0.107 (0.159) 0.082 (0.044) 0.073 0.061 5 0.157 (0.217) 0.097 (0.088) 0.107 0.085
Var ΛH 12 (x; St )
H
Var ΛQ Q
12 (x; St )
t∈T
, t∈T
, (13)
Mean ΛH 12 (x; S H )
t
Mean ΛQ (x; S Q )
t∈T 12 t
t∈T
Table 2. The changes of mean angular discrepancy and response fluctuation from the case of
”d = 5, σnoise = 0.75” to other three cases presented in Table 1
(a)
From ”d = 5, σnoise = 0.75” to ”d = 2, σnoise = 0.75”
Changes of mean angular discrepancy Changes of response fluctuation
R Hessian matrix OOF Hessian matrix OOF
1 +0.002 -0.005 +0.013 +0.006
2 +0.073 +0.048 +0.052 +0.035
3 +0.053 +0.025 +0.041 +0.023
4 +0.035 +0.024 +0.033 +0.031
5 +0.025 +0.005 +0.034 +0.012
(b)
From ”d = 5, σnoise = 0.75” to ”d = 5, σnoise = 1”
Changes of mean angular discrepancy Changes of response fluctuation
R Hessian matrix OOF Hessian matrix OOF
1 +0.112 +0.100 +0.050 +0.045
2 +0.099 +0.067 +0.044 +0.040
3 +0.095 +0.059 +0.036 +0.010
4 +0.049 +0.047 +0.030 +0.026
5 +0.053 +0.023 +0.021 +0.004
(c)
From ”d = 5, σnoise = 0.75” to ”d = 2, σnoise = 1”
Changes of mean angular discrepancy Changes of response fluctuation
R Hessian matrix OOF Hessian matrix OOF
1 +0.126 +0.104 +0.068 +0.052
2 +0.203 +0.139 +0.106 +0.079
3 +0.170 +0.090 +0.085 +0.039
4 +0.118 +0.062 +0.068 +0.054
5 +0.103 +0.037 +0.054 +0.029
matrix are larger than those of OOF, especially when R increases, in the cases that the
torus separation is reduced from 5 voxel length to 2 voxel length (the second and the
forth columns versus the first and the third columns of Table 2a).
Moreover, in the situation where noise is increased (Table 2b), it is observed that
OOF (the second and the forth columns) has less increment of the mean angular dis-
crepancies than the Hessian matrix (the first and the third columns), particularly when
R increases. Although the Gaussian smoothing taken by the Hessian matrix partially
eliminates noise from the image volume, the smoothing process also reduces the edge
sharpness of the structure boundaries. In particular, the scale factor of the Gaussian
smoothing process of the Hessian matrix has to rise to deal with large scale structures.
Consequently, the Hessian matrix performs detection based on the smoothed object
boundaries which are easier to be corrupted by image noise. For OOF, the detection
does not require Gaussian smoothing using a large scale factor (σ = 1 for OOF). It re-
tains the edge sharpness of the structure boundaries. Therefore, the OOF detection has
higher robustness against image noise than the Hessian matrix. As expected, when the
torus separation is reduced to 2 voxel length and the noise level is raised to σnoise = 1,
OOF has higher robustness than the Hessian matrix, against the presence of both closely
located adjacent structures and high level noise than the Hessian matrix (Table 2c).
To summarize the results of the synthetic data experiments (Tables 1 and 2), OOF
is validated in several aspects, the structure direction estimation accuracy, the stabil-
ity of responses, the robustness against the disturbance introduced by closely located
structures and the increment of noise levels. In some applications, an accurate structure
direction estimation is vital. For instance, vascular image enhancement, the estimated
Three Dimensional Curvilinear Structure Detection Using Optimally Oriented Flux 379
structure direction is to avoid smoothing along the directions across object boundaries.
Furthermore, for tracking curvilinear structure centerlines (a centerline tracking exam-
ple is in [1]), estimated structure direction is to guide the centerline tracking process.
Also, small response fluctuation facilitates the process to extract curvilinear structures
or locate object centerlines by discovering the local maxima or ridges of the response.
On the other hand, the structure direction estimation accuracy and the stability of
structure responses of OOF are robust against the reduction of structure separation and
the increment of noise levels. As such, employing OOF to provide information of curvi-
linear structures is highly beneficial for curvilinear structure analysis.
The vessel extraction results are shown in Figures 4b and c. The interesting positions
in the results are highlighted by the numbered arrows in Figures 4b and c. In the regions
pointed at by the fifth and sixth arrows in Figures 4b and c, the Hessian based method
misidentifies closely located vessels as merged structures. On the contrary, the OOF
based method is capable of discovering the small separation between the closely located
vessels. This result is consistent with the findings in the synthetic experiments, where
OOF is more robust than the Hessian matrix when handling closely located structures
(Table 2a).
In Figure 4c, it is found that several vessels with weak intensity (arrows 1, 2, 3, 4
and 7) are missed by the Hessian based method where the OOF based method has
no problem to extract them (Figure 4b). The reason is that the noise level relative to
the weak intensity structures is higher than those relative to strong intensity structures.
Coronal view
(a)
(b) (c)
Fig. 4. (a) A phase contrast magnetic resonance angiographic image volume with the size of
213 × 143 × 88 voxels. (b) The vessel extraction results obtained by using the optimally oriented
flux based method. (c) The vessel extraction results obtained by using the Hessian matrix based
method.
Three Dimensional Curvilinear Structure Detection Using Optimally Oriented Flux 381
Coherent to the synthetic experiments, in which OOF shows higher robustness against
image noise as compared to Hessian matrix (see Table 2b). The vessel extraction results
in this real case experiment reflects that robustness against image noise is important on
extracting vessels with weak intensity.
In this paper, we have presented the use of optimally oriented flux (OOF) for detecting
curvilinear structures. With the aid of the analytical Fourier expression of OOF, no dis-
cretization and orientation sampling are needed. It therefore leads to a highly efficient
computation of OOF. Computation-wise, it has the same complexity as in the compu-
tation of the commonly used approach, Hessian matrix. Furthermore, computation of
OOF is based on the image gradient at the boundary of local spheres. It focuses on the
detection of intensity discontinuities occurred at the object boundaries of curvilinear
structures.
The OOF based detection avoids including the adjacent objects. Thus, it exhibits the
robustness against the interference introduced by closely located adjacent structures.
This advantage is validated and demonstrated by a set of experiments on synthetic and
real image volumes. In addition, in the experiments, it is observed that OOF has higher
structure direction estimation accuracy and stable detection responses under the distur-
bance of high level image noise. With the aforementioned high detection accuracy and
robustness, OOF as opposed to the Hessian matrix, to supply information of curvilinear
structures, is more beneficial for curvilinear structure analysis.
In this paper, our current focus is on formulating OOF as a general detector for
extracting reliable information of curvilinear structures. Identifying branches, high cur-
vature curvilinear structures or distinguishing between blob-like, sheet-like and tubular
structures would involve post-processing steps of the information extracted by the curvi-
linear structure detector, such as those presented in [12]. Considering the robustness of
OOF against image noise and interference of closely located adjacent structures, tailor-
ing appropriate post-processing steps of OOF for various kinds of structures will be an
interesting direction for the future developments of this work.
References
1. Aylward, S., Bullitt, E.: Initialization, noise, singularities, and scale in height ridge traversal
for tubular object centerline extraction. TMI 21(2), 61–75 (2002)
2. Bouix, S., Siddiqi, K., Tannenbaum, A.: Flux driven fly throughs. CVPR 1, 449–454 (2003)
3. Bouix, S., Siddiqi, K., Tannenbaum, A.: Flux driven automatic centerline extraction. Me-
dIA 9(3), 209–221 (2005)
4. Bracewell, R.: The Fourier Transform and Its Application. McGraw-Hill, New York (1986)
5. Dimitrov, P., Damon, J.N., Siddiqi, K.: Flux invariants for shape. CVPR 1, I–835–I–
841(2003)
6. Frangi, A., Niessen, W., Viergever, M.: Multiscale vessel enhancement filtering. In: Wells,
W.M., Colchester, A.C.F., Delp, S.L. (eds.) MICCAI 1998. LNCS, vol. 1496, pp. 130–137.
Springer, Heidelberg (1998)
382 M.W.K. Law and A.C.S. Chung
7. Koller, T., Gerig, G., Szekely, G., Dettwiler, D.: Multiscale detection of curvilinear structures
in 2-d and 3-d image data. In: IEEE International Conference on Computer Vision, pp. 864–
869 (1995)
8. Krissian, K.: Flux-based anisotropic diffusion applied to enhancement of 3-d angiogram.
TMI 21(11), 1440–1442 (2002)
9. Krissian, K., Malandain, G., Ayache, N., Vaillant, R., Trousset, Y.: Model-based multiscale
detection of 3d vessels. CVPR 3, 722–727 (1998)
10. Lindeberg, T.: Edge detection and ridge detection with automatic scale selection. IJCV 30(2),
117–156 (1998)
11. Manniesing, W.N.R., Viergever, M.A.: Vessel enhancing diffusion a scale space representa-
tion of vessel structures. MedIA 10(6), 815–825 (2006)
12. Sato, Y., Nakajima, S., Shiraga, N., Atsumi1, H., Yoshida, S., Koller, T., Gerig, G., Kikinis,
R.: Three-dimensional multi-scale line filter for segmentation and visualization of curvilinear
structures in medical images. MedIA 2(2), 143–168 (1998)
13. Schey, H.M.: div, grad, curl, and all that, 3rd edn. W.W.Norton & Company (1997)
14. Steger, C.: An unbiased detector of curvilinear structures. PAMI 20(2), 113–125 (1998)
15. Vasilevskiy, A., Siddiqi, K.: Flux maximizing geometric flows. PAMI 24(12), 1565–1578
(2002)
Scene Segmentation for Behaviour Correlation
1 Introduction
Automatic abnormal behaviour detection has been a challenging task for visual
surveillance. Traditionally, anomaly is defined according to how individuals be-
have in isolation over space and time. For example, objects can be tracked across
a scene and if a trajectory cannot be matched by a set of known trajectory model
templates, it is considered to be abnormal [1,2]. However, due to scene complex-
ity, many types of abnormal behaviour are not well defined by only analysing
how individuals behave alone. In other words, many types of anomaly definition
are only meaningful when behavioural interactions/correlations among differ-
ent objects are taken into consideration. In this paper, we present a framework
for detecting abnormal behaviour by examining correlations of behaviours from
multiple objects. Specifically, we are interested in subtle multiple object abnor-
mality detection that is only possible when behaviours of multiple objects are
interpreted in correlation as the behaviour of each object is normal when viewed
in isolation. To that end, we formulate a novel approach to representing visual
behaviours and modelling behaviour correlations among multiple objects.
In this paper, a type of behaviour is represented as a class of visual events
bearing similar features in position, shape and motion information [3]. However,
instead of using per frame image events, atomic video events as groups of image
events with shared attributes over a temporal window are extracted and utilised
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 383–395, 2008.
c Springer-Verlag Berlin Heidelberg 2008
384 J. Li, S. Gong, and T. Xiang
as the basic units of representation in our approach. This reduces the sensitivity
of events to image noise in crowded scenes. The proposed system relies on both
globally and locally classifying atomic video events. Behaviours are inherently
context-aware, exhibited through constraints imposed by scene layout and the
temporal nature of activities in a given scene. In order to constrain the number of
meaningful behavioural correlations from potentially a very large number of all
possible correlations of all the objects appearing everywhere in the scene, we first
decompose semantically the scene into different spatial regions according to the
spatial distribution of atomic video events. In each region, events are re-clustered
into different groups with ranking on both types of events and their dominating
features to represent how objects behave locally within each region. As shown in
Section 5, by avoiding any attempt to track individual objects over a prolonged
period in space, our representation provides an object-independent representa-
tion that aims to capture categories of behaviour regardless contributing objects
that are associated with scene location. We demonstrate in our experiments
that such an approach is more suitable and effective for discovering unknown
and detecting subtle abnormal behaviours attributed by unusual presence of
and correlation among multiple objects.
Behavioural correlation has been studied before, although it is relatively new
compared to the more established traditional trajectory matching based tech-
niques. Xiang and Gong [3] clustered local events into groups and activities
are modelled as sequential relationships among event groups using Dynamic
Bayesian Networks. Their extended work was shown to have the capability of
detecting suspicious behaviour in front of a secured entrance [4]. However, the
types of activities modelled were restricted to a small set of events in a small
local region without considering any true sense of global context. Brand and
Kettnaker [5] attempted modelling scene activities from optical flows using a
Multi-Observation-Mixture+Counter Hidden Markov Model (MOMC-HMM). A
traffic circle at a crossroad is modelled as sequential states and each state is a
mixture of multiple activities (observations). However, their anomaly detection
is based only on how an individual behaves in isolation. How activities inter-
act in a wider context is not considered. Wang et al [6] proposed hierarchical
Bayesian models to learn visual interactions from low-level optical flow features.
However, their framework is difficult to be extended to model behaviour cor-
relation across different type of features, in which adding more features would
significantly increase complexity of their models.
In our work, we model behaviour correlation by measuring the frequency of
co-occurrence of any pairs of commonly occurred behaviours both locally and re-
motely over spatial locations. An accumulated concurrence matrix is constructed
for a given training video set and matched with an instance of this matrix cal-
culated for any testing video clip in order to detect irregular object correlations
in the video clip both within the same region and across different regions in the
scene. The proposed approach enables behaviour correlation to be modelled be-
yond a local spatial neighbourhood. Furthermore, representing visual behaviours
using different dominant features at different spatial locations makes it possible
Scene Segmentation for Behaviour Correlation 385
Fig. 1. Semantic scene segmentation and behaviour correlation for anomaly detection
to discover subtle unusual object behaviour correlations that either human prior
knowledge is unaware of or it is difficult to be defined by human analysis. An
overall data flow of the system is shown in Fig. 1.
vf = [x, y, w, h, rs , rp , u, v, ru , rv ], (1)
where (x, y) and (w, h) are the centroid position and the width and height of the
bounding box respectively, rs = w/h is the ratio between width and height, rp is
the percentage of foreground pixels in a bounding box, (u, v) is the mean optic
flow vector for the bounding box, ru = u/w and rv = v/h are the scaling features
between motion information and blob shape. Clearly, some of these features are
more dominant for certain image events depending on their loci in a scene, as
386 J. Li, S. Gong, and T. Xiang
they are triggered by the presence and movement of objects in those areas of the
scene. However, at this stage of the computation, we do not have any information
about the scene therefore all 10 features are used at this initial step to represent
all the detected image events across the entire scene.
Given detected image events, we wish to seek a behavioural grouping of these
image events with each group associated with a similar type of behaviour. This
shares the spirit with the work of Xiang and Gong [3]. However, direct grouping
of these image events is unreliable because they are too noisy due to their spread
over a wide-area outdoor scene under variable conditions. It has been shown
by Gong and Xiang [9] that precision of feature measurements for events af-
fects strongly the performance of event grouping. When processing video data of
crowded outdoor scenes of wide-areas, variable lighting condition and occlusion
can inevitably introduce significant noise to the feature measurement. Instead
of directly grouping image events, we introduce an intermediate representation
of atomic video event which is less susceptible to scene noise.
3 Scene Segmentation
This behavioural grouping of atomic video events gives a concise and semanti-
cally more meaningful representation of a scene (top middle plot in Fig. 1). We
consider that each group represents a behaviour type in the scene. However, such
a behaviour representation is based on a global clustering of all the atomic video
events detected in the entire scene without any spatial or temporal restriction.
It thus does not provide a good model for capturing behaviour correlations more
selectively, both in terms of spatial locality and temporal dependency. In order
to impose more contextual constraints, we segment a scene semantically into
regions according to event distribution with behaviour labelling, as follows.
We treat the problem similar to an image segmentation problem except that
we represent each image position by a multivariate feature vector instead of RGB
values. To that end, we introduce a mapping procedure transferring features from
event domain to image domain. We assign each image pixel location of the scene
a feature vector p with K components, where K is the number of groups of
atomic video events estimated for a given scene, i.e. the number of behaviour
types automatically determined by the BIC algorithm (Section 2.3). The value
of the kth component pk is given as the count of the kth behaviour type occurred
at this image position throughout the video. In order to obtain reliable values
of p, we use the following procedure. First of all, the behavioural type label for
an atomic video event is applied to all image events belonging to this atomic
video event. Secondly, given an image event, its label is applied to all pixels
within its rectangular bounding box. In other words, each image position is
assigned with a histogram of different types of behaviours occurred at that pixel
location for a given video. Moreover, because we perform scene segmentation by
activities, those locations without or with few activities are to be removed from
the segmentation procedure. For doing this, we apply a lower bound threshold
T Hp to the number of events happened at each pixel location, i.e. the sum of
component values of p. Finally the value of this K dimensional feature vector p
at each pixel location is scaled to [0, 1] for scene segmentation.
With this normalised behavioural histogram representation in the image do-
main, we employ a spectral clustering technique modified from the method pro-
posed by Zelnik-Manor and Perona [12]. Given a scene in which N locations with
activities, an N × N affinity matrix A is constructed and the similarity between
the features at the ith position and the jth position is computed according to
Eqn. (3),
⎧
⎨ (d(pi ,pj ))2 (d(xi ,xj ))2
exp − σi σj exp − , if xi − xj ≤ r
A(i, j) = σx 2
(3)
⎩
0, otherwise
where pi and pj are feature vectors at the ith and the jth locations, d represents
Euclidean distance, σi and σj correspond to the scaling factors for the feature
vectors at the ith and the jth positions, xi and xj are the coordinates and σx
is the spatial scaling factor. r is the radius indicating a circle only within which,
similarity is computed.
388 J. Li, S. Gong, and T. Xiang
Proper computation of the scaling factors is a key for reliable spectral cluster-
ing. The original Zelnik-Perona’s method computes σi using the distance between
the current feature and the feature for a specific neighbour. This setting is very
arbitrary and we will show that it suffers from under-fitting in our experiment. In
order to capture more accurate statistics of local feature similarities, we compute
σi as the standard deviation of feature distances between the current location
and all locations within a given radius r. The scaling factor σx is computed as
the mean of the distances between all positions and the circle center within the
radius r. The affinity matrix is then normalised according to:
Ā = L− 2 AL− 2
1 1
(4)
N
where L is a diagonal matrix with L(s, s) = t=1 (A(s, t)). Ā is then used as
the input to the Zelnik-Perona’s algorithm which automatically determines the
number of segments and performs segmentation. This procedure groups those
pixel locations with activities into M regions for a given scene.
5 Experiments
We evaluated the performance of the proposed system using video data captured
from two different public road junctions (Scene-1 and Scene-2). Example frames
are shown in Fig. 2. Scene-1 is dominated by three types of traffic patterns:
the vertical traffic, the leftward horizontal traffic and the rightward traffic, from
multiple entry and exit points. In addition, vehicles are allowed to stop between
the vertical traffic lanes waiting for turning right or left. In Scene-2, vehicles
usually move in from the entrances near the left boundary and near the right
bottom corner. They move towards the exits located on the top, at left bottom
corner and near the right boundary. Both videos were recorded at 25Hz and have
a frame size of 360×288 pixels.
x y w h rs rp u v ru rv
√ √ √ √√
R1
√ √ √ √ √
R2
√ √ √√ √
R3
√ √ √√ √
R4
√ √ √ √ √
R5
√ √√√ √
R6
distinguished by colour and labels. After mapping from feature domain to image
domain, the modified Zelnik-Manor and Perona’s image segmentation algorithm
was then used to segment Scene-1 and Scene-2 into 6 regions and 9 regions,
respectively, as shown in Fig. 4 (b) and (e). For comparison, we also segmented
the scenes using Zelnik-Manor and Perona’s original algorithm (ZP) which re-
sulted in 4 segments for Scene-1 and 2 segments for Scene-2 (Fig. 4 (c) and
(f)). It is evident that Zelnik-Manor and Perona’s original algorithm suffered
from under-fitting severely and was not able to segment those scenes correctly
according to expected traffic behaviours. In contrast, our approach provides a
more meaningful semantic segmentation of both scenes.
Fig. 5. Local events classification and anomaly detection. In (a), the mean and covari-
ance of the location of different classes of regional events are illustrated using ellipses
in different colour.
detected in Clip 30. Moreover, the second fire engine also caused strange driving
behaviour for another car labelled in Clip 28 which strongly conflicted with the
normal traffic. In Clip 9 and 37, two right-turn vehicles were detected in Region 2
and Region 5 respectively showing that they were quite close to each other which
were not observed in the training data. Clip 27 indicates a false alarm mainly due
to the imperfect blob detection which resulted in regional events being classified
into wrong classes. In Clip 38, the irregular atomic events were detected in the
same clip without frame overlapping (Fig. 6 (g) and (h)). This is an example
that when the size of objects are large enough to cover two regions, error could
also be introduced as most of vehicles in the training data have smaller size.
For comparison, we performed irregular concurrence detection without scene
segmentation, i.e. only using globally clustered behaviours. The results are shown
in Fig. 7. Compared with the proposed scheme, the scheme without scene seg-
mentation gave much more false alarms (comparing (a) of Fig. 7 with (c) of
Fig. 5). From the examples of false detections in Fig. 7 (b) and (c), it can be
seen that using global behaviours without scene decomposition cannot accu-
rately represent how objects behave locally. In other words, each of the global
behaviour categories for the vehicles and pedestrians may not truly reflect the
local behaviours of the objects and this would introduce more errors in detect-
ing such abnormal correlations of subtle and short-duration behaviours. On the
other hand, true irregular incidents were missed, e.g. the interruption from the
fire engine was ignored. To summarise, when only using global classification,
contextual constraints on local behaviour is not described accurately enough
and general global correlation is too arbitrary. This demonstrates the advantage
in behaviour correlation based on contextual constraint from semantic scene
segmentation.
6 Conclusion
This paper presented a novel framework for detecting abnormal pedestrian and
vehicle behaviour by modelling cross-correlation among different co-occurring
objects both locally and globally in a given scene. Without tracking objects, the
system was built based on local image events and atomic video events, which
Scene Segmentation for Behaviour Correlation 395
made the system more suitable for crowded scenes. Based on globally classified
atomic video events, a scene was semantically segmented into regions and in
each region, more detailed local events were re-classified. Local and global events
correlations were learned by modelling event concurrence within the same region
and across different regions. The correlation model was then used for detecting
anomaly.
The experiments with public traffic data have shown the effectiveness of the
proposed system on scene segmentation and anomaly detection. Compared with
the scheme which identified irregularities only using atomic video events clas-
sified globally, the proposed system provided more detailed description of local
behaviour, and showed more accurate anomaly detection and less false alarms.
Furthermore, the proposed system is entirely unsupervised which ensures its gen-
eralisation ability and flexibility on processing video data with different scene
content and complexity.
References
1. Hu, W., Xiao, X., Fu, Z., Xie, D., Tan, T., Maybank, S.: A system for learning
statistical motion patterns. PAMI 28 (9), 1450–1464 (2006)
2. Johnson, N., Hogg, D.: Learning the distribution of object trajectories for event
recognition. BMVC 2, 583–592 (1995)
3. Xiang, T., Gong, S.: Beyond tracking: Modelling activity and understanding be-
haviour. IJCV 67 (1), 21–51 (2006)
4. Xiang, T., Gong, S.: Video behavior profiling for anomaly detection. PAMI 30(5),
893–908 (2008)
5. Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video.
PAMI 22(8), 844–851 (2000)
6. Wang, X., Ma, X., Grimson, W.E.L.: Unsupervised activity perception by hierar-
chical bayesian models. In: CVPR, Minneapolis, USA, June 18-23, pp. 1–8 (2007)
7. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time
tracking. In: CVPR, vol. 2, pp. 246–252 (1999)
8. Russell, D., Gong, S.: Minimum cuts of a time-varying background. In: BMVC,
Edinburgh, UK, 1–10 (September 2006)
9. Gong, S., Xiang, T.: Scene event recognition without tracking. Special issue on
visual surveillance, Acta Automatica Sinica 29(3), 321–331 (2003)
10. Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6(2), 461–
464 (1978)
11. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society, series B 39(1), 1–38
(1977)
12. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: NIPS (2004)
Robust Visual Tracking Based on an Effective
Appearance Model
Abstract. Most existing appearance models for visual tracking usually construct
a pixel-based representation of object appearance so that they are incapable of
fully capturing both global and local spatial layout information of object ap-
pearance. In order to address this problem, we propose a novel spatial Log-
Euclidean appearance model (referred as SLAM) under the recently introduced
Log-Euclidean Riemannian metric [23]. SLAM is capable of capturing both the
global and local spatial layout information of object appearance by constructing
a block-based Log-Euclidean eigenspace representation. Specifically, the process
of learning the proposed SLAM consists of five steps—appearance block division,
online Log-Euclidean eigenspace learning, local spatial weighting, global spatial
weighting, and likelihood evaluation. Furthermore, a novel online Log-Euclidean
Riemannian subspace learning algorithm (IRSL) [14] is applied to incrementally
update the proposed SLAM. Tracking is then led by the Bayesian state inference
framework in which a particle filter is used for propagating sample distributions
over the time. Theoretic analysis and experimental evaluations demonstrate the
promise and effectiveness of the proposed SLAM.
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 396–408, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Robust Visual Tracking Based on an Effective Appearance Model 397
Black et al.[4] employ a mixture model to represent and recover the appearance
changes in consecutive frames. Jepson et al.[5] develop a more elaborate mixture model
with an online EM algorithm to explicitly model appearance changes during tracking.
Zhou et al.[6] embed appearance-adaptive models into a particle filter to achieve a ro-
bust visual tracking. Wang et al.[20] present an adaptive appearance model based on the
Gaussian mixture model (GMM) in a joint spatial-color space (referred to as SMOG).
SMOG captures rich spatial layout and color information. Yilmaz [15] proposes an
object tracking algorithm based on the asymmetric kernel mean shift with adaptively
varying the scale and orientation of the kernel. Nguyen et al.[17] propose a kernel-
based tracking approach based on maximum likelihood estimation.
Lee and Kriegman [7] present an online learning algorithm to incrementally learn
a generic appearance model for video-based recognition and tracking. Lim et al.[8]
present a human tracking framework using robust system dynamics identification and
nonlinear dimension reduction techniques. Black et al.[2] present a subspace learning
based tracking algorithm with the subspace constancy assumption. A pre-trained, view-
based eigenbasis representation is used for modeling appearance variations. However,
the algorithm does not work well in the scene clutter with a large lighting change due
to the subspace constancy assumption. Ho et al.[9] present a visual tracking algorithm
based on linear subspace learning. Li et al.[10] propose an incremental PCA algorithm
for subspace learning. In [11], a weighted incremental PCA algorithm for subspace
learning is presented. Limy et al.[12] propose a generalized tracking framework based
on the incremental image-as-vector subspace learning methods with a sample mean
update. Chen and Yang [19] present a robust spatial bias appearance model learned
dynamically in video. The model fully exploits local region confidences for robustly
tracking objects against partial occlusions and complex backgrounds. In [13], li et al.
present a visual tracking framework based on online tensor decomposition.
However, the aforementioned appearance-based tracking methods share a problem
that their appearance models lack a competent object description criterion that captures
both statistical and spatial properties of object appearance. As a result, they are usually
sensitive to the variations in illumination, view, and pose. In order to tackle this prob-
lem, Tuzel et al. [24] and Porikli et al.[21] propose a covariance matrix descriptor for
characterizing the appearance of an object. The covariance matrix descriptor, based on
several covariance matrices of image features, is capable of fully capturing the infor-
mation of the variances and the spatial correlations of the extracted features inside an
object region. In particular, the covariance matrix descriptor is robust to the variations
in illumination, view, and pose. Since a nonsingular covariance matrix is a symmetric
positive definite (SPD) matrix lying on a connected Riemannian manifold, statistics for
covariance matrices of image features may be computed through Riemannian geome-
try. Nevertheless, most existing algorithms for statistics on a Riemannian manifold rely
heavily on the affine-invariant Riemannian metric, under which the Riemannian mean
has no closed form. Recently, Arsigny et al.[23] propose a novel Log-Euclidean Rie-
mannian metric for statistics on SPD matrices. Under this metric, distances and Rieman-
nian means take a much simpler form than the widely used affine-invariant Riemannian
metric.
398 X. Li et al.
Euclidean Riemannian eigenspace observation model follows the updating online. The
architecture of the framework is shown in Fig. 1.
The process of learning the SLAM consists of five steps—appearance block division, on-
line Log-Euclidean eigenspace learning, local spatial weighting, global spatial weight-
ing, and likelihood evaluation. The details of these five steps are given as follows.
Euclidean covariance subtensors {(LAij )m∗ ×n∗ }t=1,2,...,N forms a Log-Euclidean co-
variance tensor LA associated with the object appearance tensor F ∈ Rm×n×N . With
the emergence of new object appearance subtensors, F is extended along the time axis t
(i.e., N increases gradually), leading to the extension of each Log-Euclidean covariance
subtensor LAij along the time axis t. Consequently, we need to track the changes of
LAij , and need to identify the dominant projection subspace for a compact representa-
tion of LAij as new data arrive.
400 X. Li et al.
Due to the vector space structure of log(Ctij ) under the Log-Euclidean Riemannian
metric, log(Ctij ) is unfolded into a d2 -dimensional vector vect(i) which is formulated as:
where UT(·) is an operator unfolding a matrix into a column vector. The unfolding
process can be illustrated by Figs. 2(e) and 3(a). In Fig. 3(a), the left part displays the
covariance tensor Aij ∈ Rd×d×N , the middle part corresponds to the Log-Euclidean
covariance tensor LAij , and the right part is associated with the Log-Euclidean unfold-
ing matrix LAij with the t-th column being vectij for 1 ≤ t ≤ N . As a result, LAij is
formulated as:
LAij = vec1ij vec2ij · · · vectij · · · vecN
ij . (3)
The next step of the SLAM is to learn an online Log-Euclidean eigenspace model
for LAij . Specifically, we will introduce an incremental Log-Euclidean Riemannian
subspace learning algorithm (IRSL) [14] for the Log-Euclidean unfolding matrix LAij .
IRSL applies the online learning technique (R-SVD [12,27]) to find the dominant pro-
jection subspaces of LAij . Furthermore, a new operator CVD(·) used in IRSL is de-
fined as follows. Given a matrix H = {K1 , K2 , . . . , Kg } and its column mean K,
we let CVD(H) denote the SVD (i.e., singular value decomposition) of the matrix
{K1 − K, K2 − K, . . . , Kg − K}.
Fig. 3. Illustration of Log-Euclidean unfolding and IRSL. (a) shows the generative process of the
Log-Euclidean unfolding matrix; (b) displays the incremental learning process of IRSL.
model. For a better understanding of IRSL, Fig. 3(b) is used to illustrate the incremental
learning process of IRSL. Please see the details of IRSL in [14].
The distance between a candidate sample Bi,j and the learned block-(i, j) Log-
Euclidean eigenspace model (i.e. LAij ’s column mean L̄ij and CVD(LAij ) =
Uij Dij VTij ) is determined by the reconstruction error norm:
where · is the Frobenius norm, and vecij = UT(log(Bi,j )) is obtained from Eq. (2).
Thus, the block-(i, j) likelihood pij is computed as: pij ∝ exp(−REij ). The smaller
the REij , the larger the likelihood pij . As a result, we can obtain a likelihood map
∗ ∗
M = (pij )m∗ ×n∗ ∈ Rm ×n for all the blocks.
(3) Local spatial weighting. In this step, the likelihood map M is filtered into a new
∗ ∗
one Ml ∈ Rm ×n . The details of the filtering process are given as follows. Denote the
Fig. 4. Illustration of local spatial weighting for the i-th and j-th block. (a) shows the original
likelihood map while (b) displays the filtered map by local spatial weighting for the i-th and j-th
block.
402 X. Li et al.
original map M = (pij )m∗ ×n∗ , and the filtered map Ml = (plij )m∗ ×n∗ . After filtering
by local spatial weighting, the entry plij of Ml is formulated as:
9 :
Nij+ − Nij−
plij ∝ pij · exp , (5)
σij
% & % |puv −pij |−(puv −pij ) &
|p −p |+(p −p )
where Nij+ = k1ij sgn uv ij 2 uv ij , Nij−= k1ij sgn 2 ,
u,v∈Nij u,v∈Nij
|·| is a function returning the absolute value of its argument, sgn[·] is a sign function, σij
is a positive scaling factor (σij = 8 in the paper), Nij denotes the neighbor elements
of pij , and ki,j stands for the number of the neighbor elements. In this paper, if all
the 8-neighbor elements of pij exist, ki,j = 8; otherwise, ki,j is the number of the
valid 8-neighbor elements of pij . A brief discussion on the theoretical properties of
Eq. (5) is given as follows. The second term of Eq. (5)(i.e., exp(·)) is a local spatial
weighting factor. If Nij+ is smaller than Nij− , the factor will penalize pij ; otherwise it
will encourage pij . The process of local spatial weighting is illustrated in Fig. 4.
(4) Global spatial weighting. In this step, the filtered likelihood map Ml =(plij )m∗ ×n∗
is further globally weighted by a spatial Gaussian kernel into a new one Mg = (pgij ) ∈
∗ ∗
Rm ×n . The global spatial weighting process is formulated as follows.
pgij ∝ plij · exp −posij − poso 2 /2σp2ij
+ −
Nij −Nij (6)
∝ pij · exp −posij − poso /2σpij · exp
2 2
σij
where posij is the block-(i, j) positional coordinate vector, poso is the positional co-
ordinate vector associated with the center O of the likelihood map Ml , and σpij is a
scaling factor (σpij = 3.9 in the paper). The process of global spatial weighting can
be illustrated by Fig. 5, where the likelihood map Ml (shown in Fig. 5(a)) is spatially
weighted by the Gaussian kernel (shown in Fig. 5(b)).
Fig. 5. Illustration of global spatial weighting. (a) shows the original likelihood map Ml while
(b) exhibits the spatial weighting kernel for Ml .
Robust Visual Tracking Based on an Effective Appearance Model 403
(5) Likelihood evaluation for SLAM. In this step, the overall likelihood between
a candidate object region and the learned SLAM is computed by multiplying all the
block-specific likelihoods after local and global spatial weighting. Mathematically, the
likelihood is formulated as:
LIKI ∝ pgij
∗ ∗
1≤i≤m 1≤j≤n
+ −
Nij −Nij (7)
∝ pij · exp −posij − poso 2 /2σp2ij · exp σij
i j
For visual tracking, a Markov model with a hidden state variable is used for motion
estimation. In this model, the object motion between two consecutive frames is usually
assumed to be an affine motion. Let Xt denote the state variable describing the affine
motion parameters (the location) of an object at time t. Given a set of observed images
Ot = {O1 , . . . , Ot }, the posterior probability is formulated by Bayes’ theorem as:
p(Xt |Ot )∝p(Ot |Xt ) p(Xt |Xt−1 )p(Xt−1 |Ot−1 )dXt−1 (8)
where p(Ot | Xt ) denotes the observation model, and p(Xt | Xt−1 ) represents the dy-
namic model. p(Ot |Xt ) and p(Xt |Xt−1 ) decide the entire tracking process. A particle
filter [3] is used for approximating the distribution over the location of the object using
a set of weighted samples.
In the tracking framework, we apply an affine image warping to model the object
motion of two consecutive frames. The six parameters of the affine transform are used
to model p(Xt | Xt−1 ) of a tracked object. Let Xt = (xt , yt , ηt , st , βt , φt ) where
xt , yt , ηt , st , βt , φt denote the x, y translations, the rotation angle, the scale, the aspect
ratio, and the skew direction at time t, respectively. We employ a Gaussian distribution
to model the state transition distribution p(Xt | Xt−1 ). Also the six parameters of the
affine transform are assumed to be independent. Consequently, p(Xt |Xt−1 ) is formu-
lated as:
p(Xt |Xt−1 ) = N (Xt ; Xt−1 , Σ) (9)
where Σ denotes a diagonal covariance matrix whose diagonal elements are σx2 , σy2 , ση2 ,
σs2 , σβ2 , σφ2 , respectively. The observation model p(Ot | Xt ) reflects the similarity be-
tween a candidate sample and the learned SLAM. In this paper, p(Ot |Xt ) is formulated
as: p(Ot |Xt ) ∝ LIKI, where LIKI is defined in Eq. (7). After maximum a posterior
(MAP) estimation, we just use the block related Log-Euclidean covariance matrices of
features inside the affinely warped image region associated with the highest weighted
hypothesis to update the block related Log-Euclidean eigenspace model.
3 Experiments
In order to evaluate the performance of the proposed tracking framework, four videos
are used in the experiments. The first three videos are recorded with moving cameras
404 X. Li et al.
while the last video is taken from a stationary camera. The first two videos consist of
8-bit gray scale images while the last two are composed of 24-bit color images. Video
1 consists of dark gray scale images, where a man moves in an outdoor scene with
drastically varying lighting conditions. In Video 2, a man walks from left to right in a
bright road scene; his body pose varies over the time, with a drastic motion and pose
change (bowing down to reach the ground and standing up back again) in the middle of
the video stream. In Video 3, a girl changes her facial pose over the time in a color scene
with varying lighting conditions. Besides, the girl’s face is severely occluded by a man
in the middle of the video stream. In the last video, a pedestrian moves along a corridor
in a color scene. In the middle of the video stream, his body is severely occluded by the
bodies of two other pedestrians.
During the visual tracking, the size of each object region is normalized to 36 × 36
pixels. Then, the normalized object region is uniformly divided into thirty-six 6 × 6
blocks. Further, a block-specific SLAM is online learned and online updated by IRSL
every three frames. The maintained dimension rij of the block-(i, j) Log-Euclidean
eigenspace model (i.e., Uij referred in Sec. 2.2) learned by IRSL is obtained from the
experiments. For the particle filtering in the visual tracking, the number of particles is
set to be 200. The six diagonal elements (σx2 , σy2 , ση2 , σs2 , σβ2 , σφ2 ) of the covariance
matrix Σ in Eq. (9) are assigned as (52 , 52 , 0.032, 0.032 , 0.0052, 0.0012 ), respectively.
Three experiments are conducted to demonstrate the claimed contributions of the
proposed SLAM. In these four experiments, we compare tracking results of SLAM with
those of a state-of-the-art Riemannian metric based tracking algorithm [21], referred
here as CTMU, in different scenarios including drastic illumination changes, object pose
variation, and occlusion. CTMU is a representative Riemannian metric based tracking
algorithm which uses the covariance matrix of features for object representation. By us-
ing a model updating mechanism, CTMU adapts to the undergoing object deformations
and appearance changes, resulting in a robust tracking result. In contrast to CTMU,
SLAM constructs a block-based Log-Euclidean eigenspace representation to reflect the
appearance changes of an object. Consequently, it is interesting and desirable to make a
comparison between SLAM and CTMU. Furthermore, CTMU does not need additional
parameter settings since CTMU computes the covariance matrix of image features as
the object model. More details of CTMU are given in [21].
The first experiment is to compare the performances of the two methods SLAM and
CTMU in handling drastic illumination changes using Video 1. In this experiment, the
maintained eigenspace dimension rij in SLAM is set as 8. Some samples of the final
tracking results are demonstrated in Fig. 6, where rows 1 and 2 are for SLAM and
CTMU, respectively, in which five representative frames (140, 150, 158, 174, and 192)
of the video stream are shown. From Fig. 6, we see that SLAM is capable of tracking
the object all the time even in a poor lighting condition. In comparison, CTMU is lost
in tracking from time to time.
The second experiment is for a comparison between SLAM and CTMU in the sce-
narios of drastic pose variation using Video 2. In this experiment, rij in SLAM is set as
6. Some samples of the final tracking results are demonstrated in Fig. 7, where rows 1
and 2 correspond to SLAM and CTMU, respectively, in which five representative frames
(142, 170, 178, 183, and 188) of the video stream are shown. From Fig. 7, it is clear
Robust Visual Tracking Based on an Effective Appearance Model 405
Fig. 6. The tracking results of SLAM (row 1) and CTMU (row 2) over representative frames with
drastic illumination changes
Fig. 7. The tracking results of SLAM (row 1) and CTMU (row 2) over representative frames with
drastic pose variation
that SLAM is capable of tracking the target successfully even with a drastic pose and
motion change while CTMU gets lost in tracking the target after this drastic pose and
motion change.
The last experiment is to compare the tracking performance of SLAM with that of
CTMU in the color scenarios with severe occlusions using Videos 3 and 4. The RGB
color space is used in this experiment. rij for Videos 3 and 4 are set as 6 and 8, re-
spectively. We show some samples of the final tracking results for SLAM and CTMU
in Fig. 8, where the first and the second rows correspond to the performances of SLAM
and CTMU over Video 3, respectively, in which five representative frames (158, 160,
162, 168, and 189) of the video stream are shown, while the third and the last rows
correspond to the performances of SLAM and CTMU over Video 4, respectively, in
which five representative frames (22, 26, 28, 32, and 35) of the video stream are shown.
Clearly, SLAM succeeds in tracking for both Video 3 and Video 4 while CTMU fails.
In summary, we observe that SLAM outperforms CTMU in the scenarios of illu-
mination changes, pose variations, and occlusions. SLAM constructs a block-based
Log-Euclidean eigenspace representation to capture both the global and local spatial
properties of object appearance. The spatial correlation information of object appear-
ance is incorporated into SLAM. Even if the information of some local blocks is partially
lost or drastically varies, SLAM is capable of recovering the information using the cues
of the information from other local blocks. In comparison, CTMU only captures the
statistical properties of object appearance in one mode, resulting in the loss of the local
406 X. Li et al.
Fig. 8. The tracking results of SLAM and CTMU over representative frames in the color scenarios
of severe occlusions. Rows 1 and 2 show the tracking results of SLAM and CTMU for Video
4, respectively. Rows 3 and 4 display the tracking results of SLAM and CTMU for Video 5,
respectively.
spatial correlation information inside the object region. In particular, SLAM constructs a
robust Log-Euclidean Riemannian eigenspace representation of each object appearance
block. The representation fully explores the distribution information of covariance ma-
trices of image features under the Log-Euclidean Riemannian metric, whereas CTMU
relies heavily on an intrinsic mean in the Lie group structure without considering the
distribution information of the covariance matrices of image features. Consequently,
SLAM is an effective appearance model which performs well in modeling appearance
changes of an object in many complex scenarios.
4 Conclusion
In this paper, we have developed a visual tracking framework based on the proposed
spatial Log-Euclidean appearance model (SLAM). In this framework, a block-based
Log-Euclidean eigenspace representation is constructed by SLAM to reflect the appear-
ance changes of an object. Then, the local and global spatial weighting operations on
the block-based likelihood map are performed by SLAM to capture the local and global
spatial layout information of object appearance. Moreover, a novel criterion for the
likelihood evaluation, based on the Log-Euclidean Riemannian subspace reconstruc-
tion error norms, has been proposed to measure the similarity between the test image
and the learned subspace model during the tracking. SLAM is incrementally updated by
the proposed online Log-Euclidean Riemannian subspace learning algorithm (IRSL).
Experimental results have demonstrated the robustness and promise of the proposed
framework.
Robust Visual Tracking Based on an Effective Appearance Model 407
Acknowledgment
This work is partly supported by NSFC (Grant No. 60520120099, 60672040 and
60705003) and the National 863 High-Tech R&D Program of China (Grant No.
2006AA01Z453). Z.Z. is supported in part by NSF (IIS-0535162). Any opinions, find-
ings, and conclusions or recommendations expressed in this material are those of the
authors and do not necessarily reflect the views of the NSF.
References
1. Hager, G., Belhumeur, P.: Real-time tracking of image regions with changes in geometry and
illumination. In: Proc. CVPR, pp. 410–430 (1996)
2. Black, M.J., Jepson, A.D.: Eigentracking: Robust matching and tracking of articulated ob-
jects using view-based representation. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996.
LNCS, vol. 1064, pp. 329–342. Springer, Heidelberg (1996)
3. Isard, M., Blake, A.: Contour tracking by stochastic propagation of conditional density. In:
Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1065, pp. 343–356. Springer, Hei-
delberg (1996)
4. Black, M.J., Fleet, D.J., Yacoob, Y.: A framework for modeling appearance change in image
sequence. In: Proc. ICCV, pp. 660–667 (1998)
5. Jepson, A.D., Fleet, D.J., El-Maraghi, T.F.: Robust Online Appearance Models for Visual
Tracking. In: Proc. CVPR, vol. 1, pp. 415–422 (2001)
6. Zhou, S.K., Chellappa, R., Moghaddam, B.: Visual Tracking and Recognition Using
Appearance-Adaptive Models in Particle Filters. IEEE Trans. on Image Processing 13, 1491–
1506 (2004)
7. Lee, K., Kriegman, D.: Online Learning of Probabilistic Appearance Manifolds for Video-
based Recognition and Tracking. In: Proc. CVPR, vol. 1, pp. 852–859 (2005)
8. Lim, H., Morariu3, V.I., Camps, O.I., Sznaier1, M.: Dynamic Appearance Modeling for Hu-
man Tracking. In: Proc. CVPR, vol. 1, pp. 751–757 (2006)
9. Ho, J., Lee, K., Yang, M., Kriegman, D.: Visual Tracking Using Learned Linear Subspaces.
In: Proc. CVPR, vol. 1, pp. 782–789 (2004)
10. Li, Y., Xu, L., Morphett, J., Jacobs, R.: On Incremental and Robust Subspace Learning.
Pattern Recognition 37(7), 1509–1518 (2004)
11. Skocaj, D., Leonardis, A.: Weighted and Robust Incremental Method for Subspace Learning.
In: Proc. ICCV, pp. 1494–1501 (2003)
12. Limy, J., Ross, D., Lin, R., Yang, M.: Incremental Learning for Visual Tracking. In: NIPS,
pp. 793–800. MIT Press, Cambridge (2005)
13. Li, X., Hu, W., Zhang, Z., Zhang, X., Luo, G.: Robust Visual Tracking Based on Incremental
Tensor Subspace Learning. In: Proc. ICCV (2007)
14. Li, X., Hu, W., Zhang, Z., Zhang, X., Luo, G.: Visual Tracking Via Incremental Log-
Euclidean Riemannian Subspace Learning. In: Proc. CVPR (2008)
15. Yilmaz, A.: Object Tracking by Asymmetric Kernel Mean Shift with Automatic Scale and
Orientation Selection. In: Proc. CVPR (2007)
16. Silveira, G., Malis, E.: Real-time Visual Tracking under Arbitrary Illumination Changes. In:
Proc. CVPR (2007)
17. Nguyen, Q.A., Robles-Kelly, A., Shen, C.: Kernel-based Tracking from a Probabilistic View-
point. In: Proc. CVPR (2007)
18. Zhao, Q., Brennan, S., Tao, H.: Differential EMD Tracking. In: Proc. ICCV (2007)
408 X. Li et al.
19. Chen, D., Yang, J.: Robust Object Tracking Via Online Dynamic Spatial Bias Appearance
Models. IEEE Trans. on PAMI 29(12), 2157–2169 (2007)
20. Wang, H., Suter, D., Schindler, K., Shen, C.: Adaptive Object Tracking Based on an Effective
Appearance Filter. IEEE Trans. on PAMI 29(9), 1661–1667 (2007)
21. Porikli, F., Tuzel, O., Meer, P.: Covariance Tracking using Model Update Based on Lie Al-
gebra. In: Proc. CVPR, vol. 1, pp. 728–735 (2006)
22. Tuzel, O., Porikli, F., Meer, P.: Human Detection via Classification on Riemannian Mani-
folds. In: Proc. CVPR (2007)
23. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric Means in a Novel Vector Space
Structure on Symmetric Positive-Definite Matrices. SIAM Journal on Matrix Analysis and
Applications (2006)
24. Tuzel, O., Porikli, F., Meer, P.: Region Covariance: A Fast Descriptor for Detection and
Classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952,
pp. 589–600. Springer, Heidelberg (2006)
25. Pennec, X., Fillard, P., Ayache, N.: A Riemannian Framework for Tensor Computing. In:
IJCV, pp. 41–66 (2006)
26. Rossmann, W.: Lie Groups: An Introduction Through Linear Group. Oxford Press (2002)
27. Levy, A., Lindenbaum, M.: Sequential Karhunen-Loeve Basis Extraction and Its Application
to Images. IEEE Trans. on Image Processing 9, 1371–1374 (2000)
Key Object Driven Multi-category Object Recognition,
Localization and Tracking Using Spatio-temporal
Context
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 409–422, 2008.
c Springer-Verlag Berlin Heidelberg 2008
410 Y. Li and R. Nevatia
Spatial
relationship
Temporal
relationship
Result
Human (head-shoulder)
Table
Whiteboard
Computer
Projector
Paper
These concepts are modeled by a dynamic Markov random field (MRF). Figure 1
gives an illustration. Instead of letting each node represent a pixel or image blob in a
pre-defined grid, as is commonly done in segmentation, in our model a node represents
a hypothetical object in one frame, which enables integration of object-level informa-
tion during inference. Spatial and temporal relationships are modeled by intra-frame
and inter-frame edges respectively. Since objects are recognized on-the-fly and change
with time, the structure of the MRF is also dynamic. To avoid building an MRF with
excessive false hypothetical object nodes, key objects are detected first and provide con-
textual guidance for finding other objects. Inference over the resulting MRF gives an
estimate of the states of all objects through the sequence. We apply our approach to
meeting room scenes with humans as the key objects.
The rest of the paper is organized as follows: Section 2 summarizes related work by
categories; Section 3 gives the formulation of the model; Section 4 defines the potential
functions of the MRF and Section 5 describes the inference algorithm; Section 6 shows
the experimental results; Section 7 discusses about future work and concludes the paper.
2 Related Work
Our approach uses elements from both object recognition and detection. Object recog-
nition focuses on categorization of objects [1][2]; many approaches assume a close-up
view of a single object in the input image. Object detection focuses on single category
object classification and localization from the background [3][4][5]. Both have received
intense research interest recently, bringing forward a large body of literature. While our
approach assimilates several established ideas from the two, our emphasis is on inte-
gration of spatio-temporal context. We hereby focus on the recent growing effort in
tackling object-related problems based on contextual relationships.
Key Object Driven Multi-category Object Recognition, Localization and Tracking 411
Object in the scene. Modeling object-scene relationship enables the use of prior knowl-
edge regarding object category, position, scale and appearance. [6] learns a scene-
specific prior distribution of the reference position of each object class to improve
classification accuracy of image features. It assumes that a single reference position
explains all observed features. [7] proposes a framework for placing local object detec-
tion in the 3D scene geometry of an image. Some other work seeks to classify the scene
and objects at the same time [8][9]. [8] uses the recognized scene to provide strong prior
of object position and scale. Inter-object relationship is not considered. [9] proposes an
approach to recognize events and label semantic regions in images, but the focus is not
on localizing individual objects.
Object and human action. There have been several attempts in collaborative recog-
nition of object category and human action [13][14][15]. [13] uses the hand motion to
improve the shape-based object classification from the top-down view of a desktop. In
[14], objects such as chair, keyboards are recognized from surveillance video of an of-
fice scene. Bayesian classification of regions is done completely based on human pose
and action signatures. Given estimated human upper body pose, [15] accomplishes hu-
man action segmentation and object recognition at the same time. All these approaches
require the ability of tracking human poses or recognizing action, which is not a trivial
task. But they have reflected the fact that many visual tasks are human centered. Namely
the objects of most interest for recognition are those interacting closely with humans.
This is also our motivation to choose human as the key object in our framework.
t 1 t
\u t 1 ,u t
( xut 1 , xut )
xut1 xut \ u ,v ( xu , xv )
t t t t
xvt1 xvt
xwt1 xwt
\ v ( xv , yv )
t t t
yut1 yut
yvt 1 yvt
ywt 1 ywt
Fig. 2. The MRF defined in our problem (left) and an ideal graph structure for one input frame
(right). Section 5 explains how to build such a graph.
where x = {xv |v ∈ V} and y = {yv |v ∈ V}, ψv,u (xv , xu ) models the spatio-temporal
relationship, and ψv (xv , yv ) models the image observation likelihood. Given the struc-
ture of the MRF and the potential functions, the states of the objects can be inferred.
Note that rather than letting each node correspond to an image blob in a pre-defined
image grid or a pixel, as is commonly done in segmentation literature [12][10], we let
each node represent an object, which is similar to some tracking frameworks such as
the Markov chain in Particle Filtering and MRF in collaborative tracking proposed by
[16]. The reason is twofold: 1) object-based graph enables us to use object-level infor-
mation during inference, while pixel- or grid-based graph can only model inter-object
relationships locally along the boundary of objects; 2) object-based graph has fewer
nodes and therefore the complexity of inference is much lower. However, one draw-
back of object-based graph is that accurate segmentation cannot be directly obtained.
One new property of the object-based graph is that its structure is dynamic. In Section
5 we show that the nodes for new objects can be added to the graph online driven by
detected key objects. Before that we first give our models for the potential functions.
4 Potential Functions
There are three types of edges in our model, each associated with one kind of potential
function representing a specific semantic meaning.
1
Logarithm is used because scale is multiplicative.
Key Object Driven Multi-category Object Recognition, Localization and Tracking 413
where R(xv ) stands for the set of segments that is associated with the object v; ζ(ri , xv )
is a position weight for ri , which allows the use of object shape prior. In our implemen-
tation we let R(xv ) include all segments which has at least 50% of its area within v’s
bounding box, and ζ(ri , xv ) is defined as a Gaussian centered at pv . Figure 3 shows an
example for the category paper. We can see that it is hard to distinguish the paper from
a bright computer screen or the white board by appearance (feature point and region).
Fig. 3. An example of finding paper based on appearance. (a) Input image; (b) SIFT features
(green: feature with positive weight in the classifier, red: feature with negative weight); (c) Seg-
mentation; (d) observation likelihood p(paper |ri ) for each region ri (yellow: high likelihood).
414 Y. Li and R. Nevatia
Note that the observation potential here can be substituted by any object recognition
method, possibly with a more complicated model and higher accuracy such as [2]. Here
we do not elaborate on this since our emphasis is on the effect of introducing contextual
relationship.
(j) (j)
where {(xv , xu )} is the set of all pairs of objects that co-exist in a training sample
and satisfy cv = c1 and cu = c2 .
1 (i)
m
(i)
Position : p̃vt = pvt−1 + (q − qt−1 ), (6)
m i=1 t
9 :
m (i)
i=1 Dist (qt , p̃vt )
Scale : s̃vt = svt−1 + log m (i)
, (7)
i=1 Dist(qt−1 , pvt−1 )
Key Object Driven Multi-category Object Recognition, Localization and Tracking 415
where Dist(·) is the distance between two points. The temporal potential is defined as
a Gaussian distribution centered at the estimated position and scale with fixed variance:
5.1 Inference
We choose BP as the inference method because of two main reasons. First, our graph has
cycles and the structure is not fixed (due to addition and removal of object nodes, also
inference is not done over the whole sequence but over a sliding window). Therefore it is
inconvenient to use methods that require rebuilding the graph (such as the junction tree
algorithm). While loopy BP is not guaranteed to converge to the true marginal, it has
proven excellent empirical performance. Second, BP is based on local message passing
and update, which is efficient and more importantly, gives us an explicit representation
of the interrelationship between nodes (especially useful for the augmenting nodes).
At each iteration of BP, the message passing and update process is as follows. Define
the neighborhood of a node u ∈ V as Γ (u) = {v|(u, v) ∈ E}, each node u send a
message to its neighbor v ∈ Γ (u):
mu,v (xv ) = α ψu,v (xu , xv )ψu (xu , yu ) mw,u (xu )dxu . (9)
xu w∈Γ (u)\v
Augmenting nodes find new objects by receiving “hints” (messages) from key object
nodes. It is reasonable because we are more interested in finding objects that are closely
related to key objects; by combining inter-category spatial relationships with detection
techniques specially developed for key objects, other objects can be detected and recog-
nized more robustly and efficiently.
Let the set of key objects in one frame be K, and consider finding new objects of
category c = c∗ . The ideal way is: for every subset K of K, make the hypothesis
that there is a new object a which is in context with K . Based on the NBP paradigm,
we estimate a’s state by p(xa |y) ∝ ψu (xa , ya ) v∈K mv,a (xa ). The number of such
hypotheses is exponential of |K|, so we simplify it by letting K contain only one key
object (it is reasonable because if a new object is hinted by a set of key objects it is at
least hinted by one in some extent). In this case K = {v}, the distribution of a’s state is
estimated as p(xa |y) ∝ ψa (xa , ya )mv,a (xa ). This is done for each v in K, each result
in a weighted sample set of a hypothetic new object’s state.
Further, if two hypotheses of the same category are close in position and scale,
they should be the same new object. So for each category, Agglomerative Clustering
is done on the union of the |K| sample sets to avoid creating duplicated nodes. For each
Fig. 4. Use of augmenting nodes to update graph structure. Augmenting nodes for each category
are shown as one (dotted circle). For weighted samples, red indicates the highest possible weight,
while blue indicates the lowest.
Key Object Driven Multi-category Object Recognition, Localization and Tracking 417
– Output the estimated state x̂v for each object node v of frame (t − L). Remove sub-graph
(Vt−L , Et−L ) from G and move the sliding window one frame forward.
– Add new sub-graph (Vt , Et ) for frame t to G by algorithm in Table 2.
– Inference: perform the nonparametric BP algorithm over G. For each object node v a weighted sample
(i) (i)
set is obtained: {xv , ωv }Mi=1 ∼ p(xv |y). t M (i)
– Evaluate confidence of each object v by W = j=t−L+1
ω . If W < γ, remove node vj
i=1 vj
from frame j for each j = (t − L + 1) . . . t. γ is an empirical threshold.
– For each object node vt−1 ∈ Vt−1 , let Vt ← Vt ∪ {vt }, E ← E ∪ {(vt−1 , vt )}. Pass message
forward along edge (vt−1 , vt ) to get an approximation of p(xvt |y) ∝ ψvt (xvt , yvt )mvt−1 ,vt (xvt ).
– Detect key object by applying p(c∗ |x) to all possible state x in the image. Cluster responses with
confidence higher than τc∗ . For each cluster non-overlapping with any existing node, create a new node
vt . Let the initial estimated state x̂vt be the cluster mean. Denote the set of all key object node as K.
– For each category c = c∗ :
- Create an augmenting node a for each key object node v ∈ K and add an edge (v, a) between
them.
(i) (i)
- For each such augmenting node and key object node pair {a, v}, sample {xa , ωa }M i=1 ∼
p(xa |y) ∝ ψa (xa , ya )mv,a (xa ).
(i) (i)
- Define the union of samples S = a {xa , ωa }M i=1 ; let S be the subset of S with samples
whose weight are higher than τc .
- Do clustering on S ; for each cluster non-overlapping with any existing node, create an object node
ut of category c. Let the initial estimated state x̂ut be the cluster mean.
- Vt ← Vt ∪ {ut }. Et ← Et ∪ {(ut , vt )|vt ∈ Vt , ψut ,vt (x̂ut , x̂vt ) > λ}.
- Remove augmenting nodes and corresponding edges.
high-weight cluster, a new object node is created. Figure 4 illustrates how to use aug-
menting nodes to update the graph.
More details of our overall algorithm and the algorithm of building sub-graph for
each new frame are shown in Table 1 and Table 2.
6 Experiments
Experiments are done on the CHIL meeting video corpus [23]. Eight categories of ob-
jects are of interest: human, table, chair, computer, projector, paper, cup and whiteboard
(or projection screen).
For testing we use 16 videos captured from three sites (IBM, AIT and UPC) and three
camera views for each site. Each sequence has about 400 frames. One frame out of every
418 Y. Li and R. Nevatia
60 is fully annotated for evaluation. For training the parameters of spatial potential func-
tion, we selected 200 images from two views of the IBM and UPC site (no intersection
between training images and test videos), and manually annotated the object size and
position. Observation models for objects are trained with object instances from images
of various meeting room and office scenes including a training part from CHIL.
We design our experiments to compare three methods with different levels of context:
1) no context, i.e. object observation model is directly applied to each frame; 2) spatial
context only, i.e. a MRF without the temporal edges is applied in a frame-by-frame
manner; 3) spatio-temporal context, i.e. the full model with both spatial and temporal
edges is applied to the sequence.
Quantitative analysis is performed with metrics focusing on three different aspects: ob-
ject detection and tracking, image segment categorization and pixel-level segmentation
accuracy.
Object-level detection and tracking. The overall object detection rate and false alarm
rate is shown in Figure 6(left). Two methods are compared: frame-based method with
only spatial context, and the spatio-temporal method using the complete model. For the
spatial-only method, an ROC curve is obtained by changing the threshold τc for creating
new object nodes. The result shows that integrating temporal information helps improve
detection rate and reduce false alarms, which is the effect of temporal smoothing and
evidence accumulation. In object-level evaluation we do not include the non-contextual
method, because the object observation model is based on classifying image segments,
and we find that applying exhaustive search using such a model does not give a mean-
ingful result. Some visual results of these methods can be found in Figure 5(a)(c)(d)
respectively.
Fig. 5. Comparison among observation with no context, inference using spatial relationship only
and inference using spatio-temporal relationship
Key Object Driven Multi-category Object Recognition, Localization and Tracking 419
0.8 0.8
No context
Spatial−temoral
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
Frame−based (spatial only)
0.1 Spatial−temoral 0.1
0 0
0 0.1 0.2 0.3 0.4 0.2 0.4 0.6 0.8 1
False Alarm No. / GT No. Pixel−level Segmentation Precision
Fig. 6. Object detection rate and false alarm rate (left); pixel-level segmentation precision and
recall (right)
For the spatio-temporal method, we further evaluate its performance by the number
of objects that are consistently tracked through the sequence, as shown in Table 3. All
the numbers stand for trajectories, where mostly tracked is defined as at least 80% of the
trajectory is tracked, and partially tracked defined as at least 50% is tracked. When a
trajectory is broken into two, a fragment is counted. We can see that small objects such
as cups and computers are harder to detect and track. Paper has a high false alarm rate,
probably due to lack of distinct interior features (Figure 8(h) shows segments of hu-
man clothes detected as paper). Most fragments belong to human trajectories, because
humans exhibit much more motion than other objects.
und
und
und
a rd
ard
ard
r
r
pute
pute
pute
Bac or
tebo
kgro
kgro
kgro
tebo
tebo
ecto
ecto
an
an
an
ect
er
le
ir
er
le
er
le
ir
ir
Hum
Com
Hum
Hum
Com
Com
Whi
Cup
Proj
Whi
Whi
Cha
Cup
Proj
Cup
Proj
Ba c
Tab
Cha
Bac
Cha
Pa p
Tab
Tab
Pa p
Pa p
Human .38 .04 .50 .41 .07 .49 .51 .45
Chair .10 .21 .10 .57 .30 .09 .59 .30 .05 .64
Paper .12 .16 .03 .24 .12 .06 .23 .08 .21 .43 .03 .21 .04 .40 .19 .35
Cup .14 .10 .04 .03 .18 .19 .28 .09 .03 .38 .46 .17 .38 .40
Computer .15 .04 .31 .17 .30 .04 .27 .30 .38 .03 .44 .20 .40
Table .08 .03 .61 .24 .03 .03 .62 .27 .63 .32
Whiteboard .04 .03 .40 .51 .44 .56 .48 .51
Projector .03 .06 .04 .04 .35 .10 .36 .53 .18 .25 .06 .24 .53 .17
Background .08 .85 .03 .93 .93
(a) Without context (b) Spatial-only (c) Spatio-temporal
Fig. 7. Confusion matrix of image region categorization by different methods. The value at (i, j)
stands for the proportion of segments of category i classified as category j.
accuracy is not high, since this is only a simple post-process. But from the sample re-
sults of the spatio-temporal method in Figure 8 we can see that most detected objects
are reasonably segmented when the object position and scale are correctly inferred.
7 Conclusion
In this paper we address the problem of recognizing, localizing and tracking multiple
categories of objects in a certain type of scenes. Specifically, we consider eight cate-
gories of common objects in meeting room videos. Given the difficulty of approaching
this problem by purely appearance-based methods, we propose the integration of spatio-
temporal context through a dynamic MRF, in which each node represents an object and
the edges represent inter-object relationships. New object hypotheses are proposed on-
line by adding augmenting nodes, which receive belief messages from the detected key
objects of the scene (humans in our case). Experimental results show that the perfor-
mance is greatly enhanced by incorporating contextual information.
There are many open problems and promising directions regarding the topic of ob-
ject analysis in video. First, a stronger object observation model is needed, and our
current training and testing sets are very limited. Second, we made no assumption of a
fixed camera, but it can be a strong cue for inference, e.g. the position and scale of the
stationary objects (such as tables) can be inferred from the activity area of the moving
objects (such as humans). Third, 3D geometry of the scene or depth information should
be useful for modeling occlusions. Last but not least, object recognition and tracking
can be combined with action recognition [14][15] so as to better understand the seman-
tics of human activities.
422 Y. Li and R. Nevatia
References
1. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-
invariant learning. In: CVPR (2003)
2. Cao, L., Fei-Fei, L.: Spatially coherent latent topic model for concurrent object segmentation
and classification. In: ICCV (2007)
3. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In:
CVPR (2001)
4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
5. Wu, B., Nevatia, R.: Cluster boosted tree classifier for multi-view, multi-pose object detec-
tion. In: ICCV (2007)
6. Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Learning hierarchical models of
scenes, objects, and parts. In: ICCV (2005)
7. Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. In: CVPR (2006)
8. Torralba, A., Murphy, K., Freeman, W., Rubin, M.: Context-based vision system for place
and object recognition. In: ICCV (2003)
9. Li, L.-J., Fei-Fei, L.: What, where and who? classifying events by scene and object recogni-
tion. In: ICCV (2007)
10. Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost: Joint appearance, shape and
context modeling for multi-class object recognition and segmentation. In: ECCV (2006)
11. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context.
In: ICCV (2007)
12. Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general contextual object
recognition. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 350–362.
Springer, Heidelberg (2004)
13. Moore, D.J., Essa, I.A., Heyes, M.H.: Exploiting human actions and object context for recog-
nition tasks. In: ICCV (1999)
14. Peursum, P., West, G., Venkatesh, S.: Combining image regions and human activity for indi-
rect object recognition in indoor wide-angle views. In: ICCV (2005)
15. Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding
and object perception. In: CVPR (2007)
16. Yu, T., Wu, Y.: Collaborative tracking of multiple targets. In: CVPR (2004)
17. Wu, B., Nevatia, R.: Tracking of multiple humans in meetings. In: V4HCI (2006)
18. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of
boosting. Annals of Statistics 28(2), 337–407 (2000)
19. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE
Transaction on Pattern Analysis and Machine Intelligence 24(5), 603–619 (2002)
20. Sutton, C., McCallum, A.: Piecewise training for undirected models. In: Conference on Un-
certainty in Artificial Intelligence (2005)
21. Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Mateo
(1988)
22. Sudderth, E.B., Ihler, A.T., Freeman, W.T., Willsky, A.S.: Nonparametric belief propagation.
In: CVPR (2003)
23. CHIL: The chil project, http://chil.server.de/
A Pose-Invariant Descriptor for Human
Detection and Segmentation
1 Introduction
Human detection is a widely-studied problem in vision. It still remains challeng-
ing due to highly articulated body postures, viewpoint changes, varying illumi-
nation conditions, and background clutter. Combinations of these factors result
in large variability of human shapes and appearances in images. We present
an articulation-insensitive feature extraction method and apply it to machine
learning-based human detection. Our research goal is to robustly and efficiently
detect and segment humans under varying poses.
Numerous approaches have been developed for human detection in still im-
ages or videos. Most of them use shape information as the main discriminative
cue. These approaches can be roughly classified into two categories. The first
category models human shapes globally or densely over image locations, e.g.
shape template hierarchy in [1], an over-complete set of haar wavelet features
in [2], rectangular features in [3], histograms of oriented gradients (HOG) in [4]
or locally deformable Markov models in [5]. Global schemes such as [4, 6] are
designed to tolerate certain degrees of occlusions and shape articulations with a
large number of samples and have been demonstrated to achieve excellent per-
formance with well-aligned, more-or-less fully visible training data. The second
category of approaches uses local feature-based approaches to learn body part
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 423–436, 2008.
c Springer-Verlag Berlin Heidelberg 2008
424 Z. Lin and L.S. Davis
1
For negative samples, pose estimation is forced to proceed even though no person in
them.
A Pose-Invariant Descriptor for Human Detection and Segmentation 425
3 Pose-Invariant Descriptors
3.1 Low-Level Feature Representation
For pedestrian detection, histograms of oriented gradients (HOG) [4] exhib-
ited superior performance in separating image patches into human/non-human.
These descriptors ignore spatial information locally, hence are very robust to
small alignment errors. We use a very similar representation as our low-level fea-
ture description, i.e. (gradient magnitude-weighted) edge orientation histograms.
Given an input image I, we calculate gradient magnitudes |GI | and edge
orientations OI using simple difference operators (−1, 0, 1) and (−1, 0, 1)t in
horizontal-x and vertical-y directions, respectively. We quantize the image re-
gion into local 8 × 8 non-overlapping cells, each represented by a histogram
of (unsigned) edge orientations (each surrounding pixel contributes a gradient
magnitude-weighted vote to the histogram bins). Edge orientations are quan-
tized into Nb = 9 orientation bins [k Nπb , (k + 1) Nπb ), where k = 0, 1...Nb − 1.
For reducing aliasing and discontinuity effects, we also use trilinear interpola-
tion as in [4] to vote for the gradient magnitudes in both spatial and orientation
dimensions. Additionally, each set of neighboring 2 × 2 cells form a block. This
results in overlapping blocks where each cell is contained in multiple blocks. For
reducing illumination sensitivity, we normalize the group of histograms in each
block using L2 normalization with a small regularization constant to avoid
dividing-by-zero. Figure 2 shows example visualizations of our low-level HOG
descriptors.
The above computation results in our low-level feature representation consist-
ing of a set of raw (cell) histograms (gradient magnitude-weighted) and a set of
normalized block descriptors indexed by image locations. As will be explained in
the following, both unnormalized cell histograms and block descriptors are used
for inferring poses and computing final features for detection.
parameters. As shown in the figure, the tree consists of 186 part-templates, i.e.
6 head-torso (ht) models, 18 upper-leg (ul) models, 162 lower-leg (ll) models,
and organized hierarchically based on the layout of human body parts in a top-
to-bottom manner. Due to the tree structure, a fast hierarchical shape (or pose)
matching scheme can be applied using the model. For example, using hierarchical
part-template matching (which will be explained later), we only need to match
24 part-templates to account for the complexity of matching 486 global shape
models using the method in [1], so it is extremely fast. For the details of the tree
model construction method, readers are referred to [14].
L(θ|I) = L(θht , θul , θll |I) = L(θj |I). (1)
j∈{ht,ul,ll}
For the purpose of pose estimation, we should jointly consider different parts θj
for optimization of L. Hence, based on the layer structure of the tree in Figure 3,
the likelihood L is decomposed into conditional likelihoods as follows:
where |Tθj | denotes the length of the part-template, and t represents individual
contour points along the template. Suppose the edge orientation of contour point
t is O(t), its corresponding orientation bin index B(t) is computed as: B(t) =
[O(t)/(π/9)] ([x] denotes the maximum integer less-or-equal to x), and the un-
normalized (raw) orientation histogram at location (x + st) is H = {hi }. Then,
the individual matching score dI at contour point t is expressed as:
δ
dI (x + st) = w(b)hB(t)+b , (4)
b=−δ
Optimization. The structure of our part-template model and the form (sum-
mation) of the global object likelihood L suggest that the optimization problem
4
For simplicity, we use δ = 1, and w(1) = w(−1) = 0.25, w(0) = 0.5 in our
experiments.
430 Z. Lin and L.S. Davis
4 Experiments
4.1 Datasets
We use both the INRIA person dataset and the MIT-CBCL pedestrian dataset
for detection and segmentation performance evaluation. The MIT-CBCL dataset
contains 924 front/back-view positive images (no negative images), and the INRIA
dataset contains 2416 positive training samples and 1218 negative training images
plus 1132 positive testing samples and 453 negative testing images. Comparing to
the MIT dataset, the INRIA dataset is much more challenging due to significant
pose articulations, occlusion, clutter, viewpoint and illumination changes.
Training. We first extract pose-invariant descriptors for the set of 2416 posi-
tive and 12180 negative samples and batch-train a discriminative classifier for
the initial training algorithm. We use the publically available LIBSVM tool [24]
432 Z. Lin and L.S. Davis
Testing. For evaluation on the MIT dataset, we chose its first 724 image patches
as positive training samples and 12180 training image images from the INRIA
dataset as negative training samples. The test set contains 200 positive samples
Detection Error Tradeoff (DET) curves Detection Error Tradeoff (DET) curves
0.5 0.5
Pose−inv descriptor Pose−inv descriptor, single scale
Class. on Riemannian Man. (1132 pos and 898016 neg windows)
Dalal&Triggs, Ker. HoG Pose−inv descriptor, multiple scales
Dalal&Triggs, Lin. HoG (1132 pos and 2156585 neg windows)
0.2 Zhu et al. Cascade of Rej. 0.2
0.1 0.1
Miss Rate
Miss Rate
0.05 0.05
0.02 0.02
0.01 −6 −5 −4 −3 −2 −1
0.01 −6 −5 −4 −3 −2 −1
10 10 10 10 10 10 10 10 10 10 10 10
False Positives Per Window (FPPW) False Positives Per Window (FPPW)
Confidence Distribution of Positive Test Samples Confidence Distribution of Negative Test Samples
7
1000 10
Positive test samples Single scale
900 Multiple scales
6
10
800
5
700 10
600
Frequency
Frequency
4
10
500
3
400 10
300
2
10
200
1
100 10
0
0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Positive probability (Confidence) Positive probability (Confidence)
from the MIT dataset and 1200 negative samples from the INRIA dataset. As a
result, we achieve 1.0% true positive rate, and a 0.00% false positive rate even
without retraining. Direct comparisons on the MIT dataset are difficult since
there are no negative samples and no separation of training and testing samples
in this dataset. Indirect comparisons show that our result on this dataset are
similar to the performance achieved previously in [4].
For the INRIA dataset, we evaluated our detection performance on 1132 pos-
itive image patches and 453 negative images. Negative test images are scanned
exhaustively in the same way as in retraining. The detailed comparison of our
detector with current state of the art detectors on the INRIA dataset is plotted
using the DET curves as shown in Figure 5. The comparison shows that our
approach is comparable to state of the art human detectors. The dimensionality
of our features is less than half of that used in HOG-SVM [4], but we achieve
better performance. Another advantage of our approach is that it is capable
of not only detecting but also segmenting human shapes and poses. In this re-
gard, our approach can be further improved because our current pose model is
very simple and can be extended to cover a much wider range of articulations.
Fig. 6. Detection results. Top: Example detections on the INRIA test images, nearby
windows are merged based on distances; Bottom: Examples of false negatives (FNs)
and false positives (FPs) generated by our detector.
434 Z. Lin and L.S. Davis
Figure 6 shows examples of detection on whole images and examples of false neg-
atives and false positives from our experiments. Note that FNs are mostly due to
unusual poses or illumination conditions, or significant occlusions; FPs mostly
appeared in highly-textured samples (such as trees) and structures resembling
human shapes.
5 Conclusion
Acknowledgement
This work was funded, in part, by Army Research Laboratory Robotics Col-
laborative Technology Alliance program (contract number: DAAD 19-012-0012
ARL-CTA-DJH). We would like to thank Fatih Porikli, Oncel Tuzel, and Mo-
hamed Hussein for providing results of their approaches for comparison.
References
1. Gavrila, D.M., Philomin, V.: Real-time object detection for smart vehicles. In:
ICCV (1999)
2. Papageorgiou, C., Evgeniou, T., Poggio, T.: A trainable pedestrian detection syste.
In: Proc. of Intelligent Vehicles (1998)
3. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple
features. In: CVPR (2001)
4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR (2005)
5. Wu, Y., Yu, T., Hua, G.: A statistical field model for pedestrian detection. In:
CVPR (2005)
6. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on riemannian
manifold. In: CVPR (2007)
7. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a proba-
bilistic assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV
2004. LNCS, vol. 3021, pp. 69–82. Springer, Heidelberg (2004)
8. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In:
CVPR (2005)
9. Shotton, J., Blake, A., Cipolla, R.: Contour-based learning for object detection.
In: ICCV (2005)
436 Z. Lin and L.S. Davis
10. Opelt, A., Pinz, A., Zisserman, A.: A boundary-fragment-model for object detec-
tion. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952,
pp. 575–588. Springer, Heidelberg (2006)
11. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour segments
for object detection. IEEE Trans. PAMI 30(1), 36–51 (2008)
12. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single
image by bayesian combination of edgelet part detectors. In: ICCV (2005)
13. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in im-
ages by components. IEEE Trans. PAMI 23(4), 349–361 (2001)
14. Lin, Z., Davis, L.S., Doermann, D., DeMenthon, D.: Hierarchical part-template
matching for human detection and segmentation. In: ICCV (2007)
15. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and
appearance. In: ICCV (2003)
16. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of
flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006)
17. Sharma, V., Davis, J.W.: Integrating appearance and motion cues for simultaneous
detection and segmentation of pedestrians. In: ICCV (2007)
18. Zhu, Q., Avidan, S., Yeh, M.C., Cheng, K.T.: Fast human detection using a cascade
of histograms of oriented gradients. In: CVPR (2006)
19. Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support
vector machines is efficient. In: CVPR (2008)
20. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In:
CVPR (2007)
21. Wu, B., Nevatia, R.: Optimizing discrimination-efficientcy tradeoff in integrating
heterogeneous local features for object detection. In: CVPR (2008)
22. Gavrila, D.M.: A bayesian, exemplar-based approach to hierarchical shape match-
ing. IEEE Trans. PAMI 29(8), 1408–1421 (2007)
23. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial Structures for Object Recogni-
tion. International Journal of Computer Vision 61(1), 55–79 (2005)
24. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001),
http://www.csie.ntu.edu.tw/∼ cjlin/libsvm
Texture-Consistent Shadow Removal
1 Introduction
Shadow removal is often required in digital photography as well as in many vision
applications. For clarity, we define the problem of shadow removal at the very
beginning. Following previous work [1,2,3], an image I can be represented as the
composition of the reflectance field R and the illumination field L as follows:
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 437–450, 2008.
c Springer-Verlag Berlin Heidelberg 2008
438 F. Liu and M. Gleicher
P U L
(a) original image (b) shadow boundary (c) zoom in (d) our result
Fig. 1. Given a rough shadow boundary ’P’ provided by users (b and c), our algorithm
removes the shadow (d). The red curve inside the brush stroke is the trajectory of the
brush center. Users do not need to provide a precise shadow boundary as shown in
(c) (Notice the eagle’s right wing.). The brush strokes divide the image into 3 areas:
definite umbra areas,’U’, definite lit areas, ’L’, and boundary, ’P’, which contains the
penumbra area as well as parts of the umbra and lit area.
examine how the illumination change surface C affects an image. Since an image
can be reconstructed from its gradient field with proper boundary conditions,
we focus on how C affects the gradient field in the log domain.
1. C will affect the gradients in the penumbra area where it is not uniform.
Ideally, C will not affect the gradients in the umbra and lit area since it is
uniform in these 2 areas, and is canceled off in calculating the gradients.
However, this is not often true in practice as explained in the following.
2. In practice, the imaging process suffers from noise and quantization errors.
Usually the signal to noise/quantization error ratio in the shadow area is
lower than in the lit area. In this way, C makes the effect of noise/quantization
error on the gradients in the shadow area more significant than in the lit area.
3. Normally, the poor lighting in shadow areas can weaken the texture, and
even diminish the details. However, this is not always true for many im-
ages containing highly specular surfaces. If the illumination is strong in the
scene, texture details in the lit area disappear; while in the shadow area, the
reduction of the illumination can keep the textures there.
4. If the surface response curve has a different shape in the shadow and lit area,
scaling up the shadow region to cancel C will change the texture character-
istics.
From the above observations, we can see that applying the illumination change
surface C not only affects the gradients in the penumbra area, it also affects
the characteristics of the gradient fields in the whole shadow area. We call the
former the shadow effect on the penumbra gradients and the latter the shadow
effect on the gradient characteristics in the shadow area.
This paper focuses on removing shadows from a single image. Many methods
have been presented to address this problem. Shadow removal is usually achieved
Texture-Consistent Shadow Removal 439
(d) texture preserving [3] (e) in-painting [4] (f) our result
Fig. 2. Motivating example. (b): multiplying constant to the image intensities inside
the shadow region. (c): zeroing gradients inside the shadow boundary. (d): texture-
preserving shadow removal [3]. (e): in-painting the shadow boundary region [4].
These methods usually work in the log image domain. As shown in Fig. 2(c),
zeroing gradients in the penumbra area nullifies the texture there, however. To
solve this problem, in-painting techniques are applied to fill in the missing tex-
ture [12,4]. However, in-painting sometimes introduces inconsistent textures as
illustrated in Fig. 2(e). Alternatively, Mohan et al. [10] estimate a soft shadow
model in the penumbra area, and remove shadow effect in the gradient domain
accordingly.
Although previous methods vary in estimating the illumination change sur-
face C, they share common ideas to reconstruct the shadow-free image in the
umbra area: multiplying a constant scalar to cancel the effect of C. Applying 2D
integration in the log domain with proper boundary conditions is equivalent to
multiplying a constant in the image domain. This scheme can effectively match
the overall illumination in the umbra area to that in the lit area. And using
proper scalar constants to the penumbra area can also cancel the shadow effect
on the penumbra area. However, these methods can not remove the shadow ef-
fect on the texture characteristics of the shadow area. Multiplying a constant
can magnify the noise and quantization error in the original shadow region. For
particular images with strong specular surface and strong lighting, the details
in the shadow area, which disappear in the lit area, will be enhanced. All these
lead to inconsistent texture between the shadow area and lit area. For example,
the texture in the shadow area in Fig. 2(c), (d) and (e) is not compatible with
that in the lit area.
In this paper, we provide a brush tool for users to mark the shadow boundary.
As illustrated in Fig. 1(c), users can select a brush with much larger size than
the boundary, and do not need to delineate the boundary precisely. The brush
strokes divide an image into three areas: definite umbra area, definite lit area,
and boundary, which consists of penumbra area as well as parts of the umbra
and lit area. Our algorithm precisely locates the penumbra area from the user
specified boundary, and removes the shadow seamlessly. A working example of
our algorithm is illustrated in Fig. 1.
This paper aims to remove shadow effects such that the resulting shadow-free
image has consistent texture between the shadow and lit area. We first construct
a new image gradient field that removes the gradients induced by the shadow
effect and has consistent gradient characteristics between the shadow and lit
area. Then we can reconstruct the shadow-free image from the new gradient
field through 2D integration by solving a Poisson equation similar to previous
work (c.f. [2,6,13]). The major challenge is to construct the new image gradient
field Gn given only the rough shadow boundary from users. In § 2.1, we de-
scribe a novel algorithm to estimate the illumination change curves across the
shadow boundary and cancel the effect of illumination change on the gradient
field in the penumbra area. In the § 2.2, we describe a method to estimate the
shadow effect on the texture characteristics in the shadow area and transform
the characteristics of gradients there to be compatible with that in the lit area.
Properly handling the shadow boundary or the penumbra area is a challenge for
shadow removal. The ambiguity of the shadow boundary often makes automatic
shadow boundary detection methods fail. Relying on users to provide the pre-
cise shadow boundary casts a heavy burden on them. To relieve users’ burden,
Mohan et al. [10] presented a piece-wise model where users only need to specify
connected line segments to delineate the boundary. However, when dealing with
complex shadow boundaries like the eagle’s right wing in Fig. 1(c), their method
will still require users to specify a large number of key points. To further reduce
users’ burden, we only require a rough specification of the shadow boundary
from users using brush tools as illustrated in Fig. 1(c).
Given an inaccurate shadow boundary specification, our method simultane-
ously locates the shadow boundary precisely and estimates the illumination
change C(x, y) in Equation 2 in the penumbra area. The complex shape of
the shadow boundary makes devising a parametric model of C(x, y) difficult.
However, we observe that any line segment crossing the boundary has an easily
parameterizable illumination profile. Therefore, we model C(x, y) by sampling
line segments across the boundary and estimating a parametric model for each as
illustrated in Fig. 3(a). Since the user provided-boundary usually is not accurate
enough, unlike [3], we do not sample C(x, y) using line segments perpendicular
to the boundary. Instead, like [10], we use a vertical/horizontal sampling line per
442 F. Liu and M. Gleicher
0.5
Illumination change
0
-0.5
extent
extent
r r
-1.5
t1 t0 t2
Position on the sampling line
(a) vertical sampling lines (b) illumination change model
Fig. 3. Sampling illumination change surface using line segments. (a): vertical sampling
lines. (b): t0 and r are the brush center and brush radius. [t1 , t2 ] is the penumbra
area. extent is the range in the umbra and lit area, used to estimate the gradient
characteristics.
pixel along the boundary and use the estimated illumination change to cancel
the shadow effect on the gradient in Y/X direction. We estimate horizontal and
vertical illumination change sampling lines independently.
We model the illumination change along each line segment as the following
C 1 continuous piece-wise polynomial as illustrated in Fig. 3(b):
⎧
⎨ c, t < t1 ;
Cl (t) = f (t), t1 ≤ t ≤ t2 ; (3)
⎩
0, else.
where Ef it (Mli , I)
˜ measures the fitness error of the illumination change model
Mli to the original shadow image I, ˜ Esm (Mli , Mlj ) measures the similarity be-
tween Mli and Mlj , and N (li) denotes the neighborhood of sampling line li. λ
is a parameter, with a default value 10.
We measure Ef it (Mli , I),
˜ the fitness error of the model Mli to the shadow
image I,
˜ as how well the gradient in the penumbra area fits into its neighborhood
along the sampling line after shadow effect compensation according to Mli .
Ef it (Mli , I)
˜ = −Πt∈[t −r ,t +r ] ϕ(Ĝli (t), T tex )
i0 i i0 i li (5)
Ĝli (t) = G̃li (t) − Cli (t) (6)
where Cli is the illumination change curve of Mli as defined in Equation 3, Cli
is its first derivative, G̃li is the gradient along li, and Ĝli (t) is the gradient
after canceling the shadow effect. Tlitex is the texture distribution along li. ϕ(, )
measures the fitness of the gradient to the distribution Tlitex . We model the
texture distribution along li as a normal distribution N (μi , σi2 ) of the gradients,
which can be estimated explicitly from the umbra and lit extension along li as
illustrated in Fig. 3(b). Accordingly, we define the fitness measure as follows:
exp(−(Gli (t) − μi )2 /2σi2 )
ϕ(Gli (t), Tlitex ) = . (7)
2πσi2
We define Esm (Mli , Mlj ), the smoothness cost between neighboring illumination
change models as follows:
where the first term measures the difference between the illumination steps from
the umbra to lit area, and the second term measures the difference between the
location of the penumbra area along sampling lines. We emphasize the fact that
the illumination change inside the umbra area is mostly uniform by weighting
the first term significantly. The default value for γ is 0.9.
Directly solving the minimization problem in Equation 4 is time-consuming.
We approximate the optimal solution in two steps:
1. For each sampling line li, we find an optimal illumination change model Mlio
which fits the shadow image most by minimizing the fitness error defined
in Equation 5. Since the extent of the penumbra area is small, we use a
brute-force search method.
2. With the optimal illumination change model Mlio of each sampling line, we
approximate the fitness error term in Equation 4 using the difference between
the illumination change model Mli and Mlio as follows:
E= Esm (Mli , Mlio ) + λ Esm (Mli , Mlj )
li li lj∈N (li)
(a) original image (b) after removing shadow (c) after texture transfer
8
−2
original −4
shadow removal −6
texture transfer
−8
20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Fig. 4. Reconstruct the gradient field for shadow removal. (a) shows the original image
and its gradient field along X direction. For the sake of illustration, we encode the
negative and positive gradient values using the GREEN and RED channels respectively.
From the original gradient field, we can see the shadow effect on the gradient field by
noticing the strong edges along the shadow boundary. By estimating the illumination
change across the penumbra area, the shadow effect on the gradient field is canceled
as illustrated in (b) and (d). However, as we can see in (b) and (e) right, the shadow
area is more contrasty than the lit area, causing inconsistent texture characteristics.
This inconsistency is removed after gradient transformation as shown in (c) and (e).
After obtaining the illumination change model along each sampling line, we
apply it to the gradient field to cancel the shadow effect according to Equation 6.
An example of canceling the shadow effect on the gradients in the penumbra area
is shown in Fig. 4(a) and (b).
Like transferring color between images [15], where the global color charac-
teristics of an image is parameterized using its sampling mean and deviation,
we model the texture characteristics using the sampling mean and deviation of
the gradient field. So if given the target mean and deviation, we transform the
gradient field in the shadow area as follows:
where Ĝs and Gs are the gradients in the shadow area before and after trans-
formation respectively, and μ̂s and σ̂ s are the mean and deviation of Ĝs . μ̂t and
σ̂ t are the target mean and deviation.
Like transferring color [15], using the characteristics parameters of the lit area
as the target parameters can achieve consistent texture characteristics between
the shadow and lit area. However, this scheme works well only if the texture
distribution is globally homogeneous in the image. Otherwise it can destroy local
textures in the shadow area. We calculate the target characteristics parameters
by estimating the shadow effect on the gradient distribution and canceling this
effect from the original gradient field. Assuming the gradient distribution around
the shadow boundary is homogenous and the shadow effect is independent of
the shadow-free image, we estimate the shadow effect parameters from gradients
around the boundary as follows:
5
μse = μsb − μlb
2 (9)
σse
2
= σbs 2 − σbl
where μse and σse are the mean and deviation of the shadow effect on gradients
in the shadow area. μsb and σbs are the mean and deviation of the gradients in
the umbra side along the shadow boundary(the extent parts as illustrated in
Fig. 3(b)) , and μlb and σbl are those in the lit area side. Accordingly, the target
mean and deviation can be calculated by canceling the shadow effect as follows:
t
μ̂ = μ̂
− μse
s
2 (10)
σ̂ t = σˆs − σse2
Fig. 4(b) and (c) shows that the gradient field transformation leads to consistent
texture characteristics between the shadow and lit area. Please refer to the whole
image in Fig. 6(a) to examine the consistency of the texture.
3 Results
We have experimented with our method on photos with shadows from Flickr.
These photos have different texture characteristics. We report some representa-
tive ones together with the results in Fig. 1, Fig. 2, Fig. 6, Fig. 7 and Fig. 8, as
well as comparison to many representative works [2,6,4,3,10]. (Please refer to
446 F. Liu and M. Gleicher
Fig. 5. Images in (a) and (c) are from [10]. (b) shadow removed by nullifying the
gradients in the boundary [2,6]. (c) shadow removed using the method from [10]. There,
not only the illuminance level in the lit area is changed, but also the shadow area is
not as contrasty as the lit area. Our method creates a texture-consistent result.
(b) pavement
contrasty levels as shown in Fig. 5(c). Our method effectively removes the shadow
as well as keeps the consistent texture characteristics across the whole image as
shown in Fig. 5(d) and other examples. For instance, in the Fig. 7(b), the texture
of small shell grains in the shadow area and in the lit area is consistent. For the
desert example in Fig. 7(c), the highlights across the original shadow boundary
are consistent between the shadow and lit area. For the river surface example
in Fig. 7(d), the ripples in the shadow area are consistent with that in the lit
area. Particularly, the wavefront in the middle is continuous across the original
shadow boundaries. For the tree example in Fig. 7(a), the soil inside the shadow
region is consistent with the lit area surrounding it. The hill example in Fig. 8(a)
is similar.
From the results in Fig. 6, 7 and 8, we can see that the proposed algorithm
can seamlessly remove shadows in images with various texture characteristics.
For example, the shadows are on the beach (Fig. 6(a)), on the road surfaces
(Fig. 6(b)), on the sands (Fig. 7(b)), on the desert (Fig. 7(c)), on the river
surface (Fig. 7(d)), on the hills (Fig. 7(a) and Fig. 8(a)), etc. Our method works
well on specular surfaces such as Fig. 6(a), as well as Lambertian surfaces, such
as examples in Fig. 7.
Examples in Fig. 8(b) and (c) are very interesting. Noticing the mountains
in these examples, shadow removal reveals the beautiful texture details in the
original dark shadow areas, which are concealed in the original shadow images.
What is particularly interesting is that shadow removal recovers the blue glacier
ice phenomenon1 in the Fig. 8(b) (Notice the blue-cyan area of the snow in the
left bottom.).
We found from the experiments that our method does not work well on some
images. Taking Fig. 8(d) as an example, the shadow area in the original im-
age looks more reddish than its surrounding lit area. This is because when the
lighting is blocked by the semi-transparent red leaf, its red component can still
pass through. For this kind of cast shadow, the general shadow model in Equa-
tion 2 used in previous work (including ours) does not hold. Noticing the original
shadow region in the resulting image, we can still sense the reddish component
there. In future, analyzing the caustics of shadow from its context may help solve
this problem. However, our current method is effective for many images.
4 Conclusion
1
http://www.northstar.k12.ak.us/schools/joy/denali/OConnor/
colorblue.html
450 F. Liu and M. Gleicher
References
1. Barrow, H., Tenenbaum, J.: Recovering intrinsic scene characteristics from images.
In: Computer Vision Systems. Academic Press, London (1978)
2. Weiss, Y.: Deriving intrinsic images from image sequences. In: IEEE ICCV, pp.
68–75 (2001)
3. Arbel, E., Hel-Or, H.: Texture-preserving shadow removal in color images contain-
ing curved surfaces. In: IEEE CVPR (2007)
4. Finlayson, G.D., Hordley, S.D., Lu, C., Drew, M.S.: On the removal of shadows
from images. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 59–68 (2006)
5. Liu, Z., Huang, K., Tan, T., Wang, L.: Cast shadow removal combining local and
global features. In: The 7th International Workshop on Visual Surveillance (2007)
6. Finlayson, G.D., Hordley, S.D., Drew, M.S.: Removing shadows from images. In:
7th European Conference on Computer Vision, pp. 823–836 (2002)
7. Salvador, E., Cavallaro, A., Ebrahimi, T.: Cast shadow segmentation using invari-
ant color features. Comput. Vis. Image Underst. 95(2), 238–259 (2004)
8. Levine, M.D., Bhattacharyya, J.: Removing shadows. Pattern Recognition Let-
ters 26(3), 251–265 (2005)
9. Wu, T.P., Tang, C.K., Brown, M.S., Shum, H.Y.: Natural shadow matting. ACM
Trans. Graph. 26(2), 8 (2007)
10. Mohan, A., Tumblin, J., Choudhury, P.: Editing soft shadows in a digital photo-
graph. IEEE Comput. Graph. Appl. 27(2), 23–31 (2007)
11. Baba, M., Mukunoki, M., Asada, N.: Shadow removal from a real image based on
shadow density. ACM SIGGRAPH 2004 Posters, 60 (2004)
12. Fredembach, C., Finlayson, G.D.: Hamiltonian path based shadow removal. In:
BMVC, pp. 970–980 (2005)
13. Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans.
Graph. 22(3), 313–318 (2003)
14. Barrett, R., Berry, M., Chan, T.F., Demmel, J., Donato, J., Dongarra, J., Eijkhout,
V., Pozo, R., Romine, C., der Vorst, H.V.: Templates for the Solution of Linear
Systems: Building Blocks for Iterative Methods. SIAM, Philadelphia (1994)
15. Reinhard, E., Ashikhmin, M., Gooch, B., Shirley, P.: Color transfer between images.
IEEE Comput. Graph. Appl. 21(5), 34–41 (2001)
Scene Discovery by Matrix Factorization
1 Introduction
Classification of scenes has useful applications in content-based image indexing and re-
trieval and as an aid to object recognition (improving retrieval performance by removing
irrelevant images). Even though a significant amount of research has been devoted to
the topic, the questions of what constitutes a scene has not been addressed. The task
is ambiguous because of the diversity and variability of scenes but also mainly due to
the subjectivity of the task. Just like in other areas of computer vision such as activity
recognition, it is not simple to define the vocabulary to label scenes. Thus, most ap-
proaches have used the physical setting where the image was taken to define the scene
(e. g. beach, mountain, forest, etc.).
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 451–464, 2008.
c Springer-Verlag Berlin Heidelberg 2008
452 N. Loeff and A. Farhadi
What is a scene? In current methods, visual similarity is used to classify scenes into a
known set of types. We expect there are many types of scene, so that it will be hard to
write down a list of types in a straightforward way. We should like to build a vocabulary
of scene types from data. We believe that two images depict the same scene category if:
1. Objects that appear in one image could likely appear in the other
2. The images look similar under an appropriate metric.
This means one should be able to identify scenes by predicting the objects that are
likely to be in the image, or that tend to co-occur with objects that are in the image.
Thus, if we could estimate a list of all the annotations that could reasonably be attached
to the image, we could cluster using that list of annotations. The objects in this list of
annotations don’t actually have to be present – not all kitchens contain coffee makers –
but they need to be plausible hypotheses. We would like to predict hundreds of words
for each of thousands of images. To do so, we need stable features and it is useful to
exploit the fact that annotating words are correlated.
All this suggests a procedure akin to collaborative filtering. We should build a set of
classifiers, that, from a set of image features, can predict a set of word annotations that
are like the original annotations. For each image, the predicted annotations will include
words that annotators may have omitted, and we can cluster on the completed set of
annotations to obtain scenes. We show that, by exploiting natural regularization of this
problem, we obtain image features that are stable and good at word prediction. Clus-
tering with an appropriate metric in this space is equivalent to clustering on completed
annotations; and the clusters are scenes.
We will achieve this goal by using matrix factorization [21,1] to learn a word classi-
fier. Let Y be a matrix of word annotations per image, X the matrix of image features
per image, and W a linear classifier matrix, we will look for W to minimize
The regularization term will be constructed to minimize the rank of W , in order to im-
prove generalization by forcing word classifiers to share a low dimensional represen-
tation. As the name “matrix factorization” indicates, W is represented as the product
Scene Discovery by Matrix Factorization 453
Fig. 1. Matrix factorization for word prediction. Our proxy goal is to find a word classifier W
on image features X. W factorizes into the product W = F G. We regularize with the rank of
W ; this makes F t X a low-dimensional feature space that maximizes word predictive power.
In this space, where correlated words are mapped close, we learn the classifiers G.
between two matrices F G. This factorization learns a feature mapping (F ) with shared
characteristics between the different words. This latent representation should be a good
space to learn correlated word classifiers G (see figure 1).
Our problem is related to multi-task learning as clearly the problem of assigning one
word to an image is correlated with the other words. In a related approach [2] Ando
and Zhang learn multiple classifiers with a shared structure, alternating fixing the struc-
ture and learning SVM classifiers and fixing the classifiers to learn the structure using
SVD. Ando and Zhang propose an interesting insight into the problem: instead of do-
ing dimensionality reduction on the data space (like PCA), they do it in the classifier
space. This means the algorithm looks for low-dimensional structures with good pre-
dictive, rather than descriptive, power. This leads to an internal representation where the
tasks are easier to learn. This is a big conceptual difference with respect to approaches
like [14,3]. It is also different from the CRF framework of [20], where pairwise co-
occurrence frequencies are modeled.
Quattoni et al. [18] proposed a method for supervised classification of topics using
auxiliary tasks, following [2]. In contrast, our model we discover scenes without super-
vision. We also differ in that [18] first learns word classifiers, fixes them, and then finds
the space for the topic (scene) prediction. We learn both the internal structure and the
classifiers simultaneously, in a convex formulation. Thus our algorithm is able to use
correlation between words not only for the scene classification task but also for word
prediction. This results in improved word prediction performance. In section 4 we show
the model also produces better results than [18] for the scene task, even without having
the scene labels!
Y ∈ {±1}M×N where each column is an image and each row a word, W ∈ Rd×M
is the classifier matrix and X ∈ Rd×N the observation matrix. We will initially con-
sider
that the words are decoupled (as in regular SVMs), and use the L2 regularization
2 2
m ||wm ||2 = ||W ||F (known as the Frobenius norm of W ). A suitable loss for a
max-margin formulation is the hinge function h(z) = max(0, 1 − z). The problem can
then be stated as
1 2
N M
min ||W ||F + C Δ(yim )h(yim · (wm
t
xi )) (2)
W 2 i=1 m=1
where C is the trade-off constant between data loss and regularization, and Δ is a slack
re-scaling term we introduce to penalize errors differently: false negatives Δ(1) = 1
and false positives Δ(−1) = < 1. The rationale is that missing word annotations are
much more common than wrong annotation for this problem.
Our word prediction formulation of the loss is different from [21] (a pure collabora-
tive filtering model) and [1] (a multi-class classifier), even though our tracenorm regu-
larization term is similar to theirs. Our formulation is, to the best of our knowledge, the
first application of the tracenorm regularization to a problem of these characteristics.
From [1] we took the optimization framework, although we are using different losses
and approximations and we are using BFGS to perform the minimization. Finally, we
introduce a unsupervised model on top of the internal representation this formulation
produces to discover scenes.
1 N M
min ||W ||Σ + C Δ(yim )h(yim · (wm
t
xi )) (3)
W 2 i=1 m=1
Rennie [21] showed (3) can be recast as a Semidefinite Program (SDP). Unfortunately,
SDPs don’t scale nicely with the number of dimensions of the problem, making any
decent size problem intractable. Instead, he proposed gradient descent optimization.
Scene Discovery by Matrix Factorization 455
Fig. 2. Smooth approximations of the hinge function (left) and absolute value function (right),
used in the gradient descent optimization
In our experiments we use ρ = σ = 10−7 . Plots for both approximation are depicted in
figure 2.
We will then consider the smooth cost
J(W ; Y, X, σ, ρ) = JR (W ; σ) + C · JD (W ; Y, X, ρ) (6)
2.2 Kernelization
A interesting feature of problem 3 is that it admits a solution when high dimensional
features X are not available but instead the Gram matrix K = X t X is provided. Theo-
rem 1 in [1] can be applied with small modifications to prove that there exists a matrix
α ∈ RM×N so that the minimizer of (3) is W = Xα. But instead of solving the dual
Lagrangian problem we will use this representation of W to minimize the primal prob-
lem (actually, it’s smoothed version) using gradient descent. The derivatives in terms of
K and α only become
∂JR ∂ ||Xα||S X t ∂ ||Xα||S
= = = KαV D−1 aσ (D)V t (11)
∂α ∂α ∂Xα
using that D(V V t )D−1 = I, Xα = U DV t , and that K = X t X. The gradient of the
data loss term is
∂JD
= −K ∗ (Δ(Y ) · hρ (αt Kα) · Y ) (12)
∂W
4 Experiments
To demonstrate the performance of our scene discovery model we need a dataset with
multiple object labels per image. We chose the standard subset of the Corel image
collection [7] as our benchmark dataset. This subset has been extensively used and
Scene Discovery by Matrix Factorization 457
consists of 5000 images grouped in 50 different sets (CDs). These images are separated
into 4500 training and 500 test images. The vocabulary size of this dataset is 374, out
of which 371 appear in train and 263 in test set. The annotation length varies from 1 to
5 words per image.
We employ features used in the PicSOM [23] image content analysis framework.
These features convey image information using 10 different, but not necessarily uncorre-
lated, feature extraction methods. Feature vector components include: DCT coefficients
of average color in 20x20 grid (analogous to MPEG-7 ColorLayout feature), CIE LAB
color coordinates of two dominant color clusters, 16 × 16 FFT of Sobel edge image,
MPEG-7 EdgeHistogram descriptor, Haar transform of quantised HSV color histogram,
three first central moments of color distribution in CIE LAB color space, average CIE
LAB color, co-occurence matrix of four Sobel edge directions, histogram of four Sobel
edge directions and texture feature based on relative brightness of neighboring pixels.
The final image descriptor is a 682 dimensonal vector. We append a constant value 1
to each vector to learn a threshold for our linear classifiers.
Fig. 3. Example clustering results on the Corel training set. Each row consists of the closest im-
ages to the centroid of a different cluster. The number on the right of each image is the Corel CD
label. The algorithm is able to discover scenes even when there is high visual variability in the
images (e. g. people cluster, swimmers, CD-174 cluster). Some of the scenes (e. g. sunsets, peo-
ple) clearly depict scenes, even if the images are come from different CDs. (For display purposes,
portrait images were resized)
458 N. Loeff and A. Farhadi
Scene discovery. First, we explore the latent space described in section 3. As mentioned
there, the cosine distance is natural to represent dissimilarity in this space. To be able
to use it for clustering we will employ graph-based methods. We expect scene clusters
to be compact and thus use complete link clustering. We look initially for many more
clusters than scene categories, and then remove clusters with a small number of images
allocated to them. We reassign those images to the remaining clusters using the closest
5 nearest neighbors. This produced approximately 1.5 clusters per CD label. For the test
set we use again the 5 nearest neighbors to assign images to the train clusters. As shown
in figure 3, the algorithm found highly plausible scene clusters, even in the presence of
Fig. 4. Example results on the Corel test set. Each row consists of the closest 7 test images to
each centroid found on the training set. The number on the right of each image is the Corel CD
label. Rows correspond to scenes, which would be hard to discover with pure visual clustering.
Because our method is able to predict word annotations while clustering scenes, it is able to
discount large but irrelevant visual differences. Despite this, some of mistakes are due to visual
similarity (e. g. the bird in the last image of the plane cluster, or the skyscraper in the last image
of the mountain cluster). (For displaying purposes, portrait images were resized).
Scene Discovery by Matrix Factorization 459
large visual variability. This is due to the fact that these images depict objects that
tend to appear together. The algorithm also generalizes well: when the clusters were
transfered to the test set it still produced a good output (see figure 4).
Word prediction. Our approach to scene discovery is based on the internal representa-
tion of the word classifier, so these promising results suggest a good word annotation
prediction performance. Table 1 shows the precision, recall and F1-measure of our word
prediction model is competitive with the best state-of-the-art methods using this dataset.
Changing the value of in equation 3 traces out the precision-recall curve; we show the
equal error rate (P = R) result. It is remarkable that the kernelized classifier does not
provide a substantial improvement over the linear classifier. The reason for this may lie
in the high dimensionality of the feature space, in which all points are roughly at the
same distance. In fact, using a standard RBF kernel produced significantly lower re-
sults; thus the sigmoid kernel, with a broarder support, performed much better. Because
to this and the higher computational complexity of the kernelized classifier, we will use
the linear classifier for the rest of the experiments.
The influence of the tracenorm regularization is clear when the results are com-
pared to independent linear SVMs on the same features (that corresponds to using the
Frobenius norm regularization, equation 2). The difference in performance indicates
Table 1. Comparison of the performance of our word annotation prediction method with that
of Co-occurance model (Co-occ), Translation Model (Trans), Cross-Media Relevance Model
(CMRM), Text space to image space (TSIS), Maximum Entropy model (MaxEnt), Continuous
Relevance Model (CRM), 3×3 grid of color and texture moments (CT-3×3), Inference Network
(InfNet), Multiple Bernoulli Relevance Models (MBRM), Mixture Hierarchies model (MixHier),
PicSOM with global features, and linear independent SVMs on the same features. The perfor-
mance of our model is provided for the linear and kernelized (sigmoid) classifiers.* Note: the
results of the PicSOM method are not directly comparable as they limit the annotation length to
be at most five (we do not place this limit as we aim to complete the annotations for each image).
Method P R F1 Ref
Co-occ 0.03 0.02 0.02 [16]
Trans 0.06 0.04 0.05 [7]
CMRM 0.10 0.09 0.10 [9]
TSIS 0.10 0.09 0.10 [5]
MaxEnt 0.09 0.12 0.10 [10]
CRM 0.16 0.19 0.17 [11]
CT-3×3 0.18 0.21 0.19 [25]
CRM-rect 0.22 0.23 0.23 [8]
InfNet 0.17 0.24 0.23 [15]
Independent SVMs 0.22 0.25 0.23
MBRM 0.24 0.25 0.25 [8]
MixHier 0.23 0.29 0.26 [4]
This work (Linear) 0.27 0.27 0.27
This work (Kernel) 0.29 0.29 0.29
PicSOM 0.35∗ 0.35∗ 0.35∗ [23]
460 N. Loeff and A. Farhadi
VN\VXQFORXGVVHD
VHD WUHHELUGVVQRZIO\ VN\VXQMHWSODQH VN\ZDWHUEHDFK PRXQWDLQVN\ZDWHU
ZDYHVELUGVZDWHU SHRSOHVDQGVDLOERDWV FORXGVSDUN
Fig. 5. Example word completion results. Correctly predicted words are below each image in
blue, predicted words not in the annotations (“False Positives”) are italic red, and words not
predicted but annotated (“False Negatives”) are in green. Missing annotations are not uncommon
in the Corel dataset. Our algorithm performs scene clustering by predicting all the words that
should be present on an image, as it learns correlated words (e. g. images with sun and plane
usually contain sky, and images with sand and water commonly depict beaches). Completed
word annotations are a good guide to scene categories while original annotations might not be;
this indicates visual information really matters.
the sharing of features among the word classifiers is beneficial. This is specially true for
words that are less common.
Annotation completion. The promising performance of the approach results from its
generalization ability; this in turn lets the algorithm predict words that are not anno-
tated in the training set but should have been. Figure 5 shows some examples of word
completion results. It should be noted that performance evaluation in the Corel dataset
is delicate, as missing words in the annotation are not uncommon.
Discriminative scene prediction. The Corel dataset is divided into sets (CDs) that do
not necessarily depict different scenes. As it can be observed in figure 3, some correctly
clustered scenes are spread among different CD labels (e. g. sunsets, people). In order
to evaluate our unsupervised scene discovery, we selected a subset of 10 out of the 50
CDs from the dataset so that the CD number can be used as a reliable proxy for scene
labels. The subset consists of CDs: 1 (sunsets), 21 (race cars), 34 (flying airplanes),
130 (african animals), 153 (swimming), 161 (egyptian ruins), 163 (birds and nests),
182 (trains), 276 (mountains and snow) and 384 (beaches). This subset has visually
very disimlar pictures with the same labels and visually similar images (but depicting
different objects) with different labels. The train/test split of [7] was preserved.
To evaluate the performance of the unsupervised scene discovery method, we label
each cluster with the most common CD label in the training set and then evaluate the
scene detection performance in the test set. We compare our results with the same clus-
tering thechnique on the image features directly. In this space the cosine distance losses
Scene Discovery by Matrix Factorization 461
Table 2. Comparison of the performance of our scene discovery on the latent space with another
unsupervised method and four supervised methods on image features directly. Our model pro-
duced significantly better results that the unsupervised method on the image features, and is only
surpassed by the supervised kernelized SVM. For both unsupervised methods, clustering is done
on the train set and performance is measured on the test set (see text for details).
Method Accuracy
Unsupervised Latent space (this work) 0.848
Unsupervised Image features clustering 0.697
Supervised Image features KNN 0.848
Supervised Image features SVM (linear) 0.798
Supervised Image features SVM (kernel) 0.948
Supervised ”structural learning” [2,18] 0.818
its meaning and thus we use the euclidean distance. We also computed the performance
of two supervised approaches on the image features: k nearest neighbors (KNN), sup-
port vector machines (SVM), and “structural learning” (introduced in [2] and used in a
vision application -Reuters image classification- in [18]). We use a one-vs-all approach
for the SVMs. Table 2 show the the latent space is indeed a suitable space for scene de-
tection: it clearly outperforms clustering on the original space, and only the supervised
SVM using a kernel provides an improvement over the performance of our method.
The difference with [18] deserves further exploration. Their algorithm classifies top-
ics (in our case scenes) by first learning a classification of auxiliary tasks (in this case
words), based in the framework introduced in [2]. [18] starts by building independent
Fig. 6. Dendrogram for our clustering method. Our scene discovery model produces 1.5 proto-
scenes per scene. Clusters belonging to the same scene are among the first to be merged
462 N. Loeff and A. Farhadi
Fig. 7. Future work includes unsupervised region annotation. Example images show promising
results for region labeling. Images are presegmented using normalized cuts (red lines), features
are computed in each region and fed to our classifier as if they were whole image features.
5 Conclusions
Scene discovery and classification is an important and challenging task that has impor-
tant applications in object recognition. We have introduced a principled way of defining
a meaningful vocabulary of what constitutes a scene. We consider scenes to depict cor-
related objects and present visual similarity. We introduced a max-margin factorization
model to learn these correlations. The algorithm allows for scene discovery on par with
supervised approaches even without explicitly labeling scenes, producing highly plausi-
ble scene clusters. This model also produced state of the art word annotation prediction
results including good annotation completion.
Future work will include using our classifier for weakly supervised region annota-
tion/labeling. For a given image, we use normalized cuts to produce a segmentation.
Scene Discovery by Matrix Factorization 463
Using our classifier, we know what words describe the image. We then restrict our clas-
sifier to these word subsets and to the features in each of the regions. Figure 7 depicts
examples of such annotations. These are promising preliminary results; since quantita-
tive evaluation of this procedure requires having a ground truth labels for each segment,
we only show qualitative results.
Acknowledgements
The authors would like to thank David Forsyth for helpful discussions.
This work was supported in part by the National Science Foundation under IIS -
0534837 and in part by the Office of Naval Research under N00014-01-1-0890 as part
of the MURI program. Any opinions, findings and conclusions or recommendations
expressed in this material are those of the author(s) and do not necessarily reflect those
of the National Science Foundation or the Office of Naval Research.
References
1. Amit, Y., Fink, M., Srebro, N., Ullman, S.: Uncovering shared structures in multiclass clas-
sification. In: ICML, pp. 17–24 (2007)
2. Ando, R.K., Zhang, T.: A high-performance semi-supervised learning method for text chunk-
ing. In: ACL (2005)
3. Bosch, A., Zisserman, A., Munoz, X.: Scene classification via plsa. In: Leonardis, A.,
Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517–530. Springer, Hei-
delberg (2006)
4. Carneiro, G., Vasconcelos, N.: Formulating semantic image annotation as a supervised learn-
ing problem. In: CVPR, vol. 2, pp. 163–168 (2005)
5. Celebi, E., Alpkocak, A.: Combining textual and visual clusters for semantic image retrieval
and auto-annotation. In: 2nd European Workshop on the Integration of Knowledge, Seman-
tics and Digital Media Technology, 30 November - 1 December 2005, pp. 219–225 (2005)
6. Chapelle, O., Haffner, P., Vapnik, V.: SVMs for histogram-based image classification. IEEE
Transactions on Neural Networks, special issue on Support Vectors (1999)
7. Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine
translation: Learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G.,
Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Hei-
delberg (2002)
8. Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and
video annotation. In: CVPR, vol. 02, pp. 1002–1009 (2004)
9. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-
media relevance models. In: SIGIR, pp. 119–126 (2003)
10. Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. In: Enser,
P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds.) CIVR
2004. LNCS, vol. 3115, pp. 24–32. Springer, Heidelberg (2004)
11. Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In:
NIPS (2003)
12. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for
recognizing natural scene categories. In: CVPR, pp. 2169–2178 (2006)
13. Li, F.-F., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In:
CVPR, vol. 2, pp. 524–531 (2005)
464 N. Loeff and A. Farhadi
14. Liu, J., Shah, M.: Scene modeling using co-clustering. In: ICCV (2007)
15. Metzler, D., Manmatha, R.: An inference network approach to image retrieval. In: Enser,
P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds.) CIVR
2004. LNCS, vol. 3115, pp. 42–50. Springer, Heidelberg (2004)
16. Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector
quantizing images with words. In: Proc. of the First International Workshop on Multimedia
Intelligent Storage and Retrieval Management (1999)
17. Oliva, A., Torralba, A.B.: Modeling the shape of the scene: A holistic representation of the
spatial envelope. International Journal of Computer Vision 42(3), 145–175 (2001)
18. Quattoni, A., Collins, M., Darrell, T.: Learning visual representations using images with
captions. In: CVPR (2007)
19. Quelhas, P., Odobez, J.-M.: Natural scene image modeling using color and texture visterms.
Technical report, IDIAP (2006)
20. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context.
In: ICCV (2007)
21. Rennie, J.D.M., Srebro, N.: Fast maximum margin matrix factorization for collaborative pre-
diction. In: ICML, pp. 713–719 (2005)
22. van Gemert, J.C., Geusebroek, J.-M., Veenman, C.J., Snoek, C.G.M., Smeulders, A.W.M.:
Robust scene categorization by learning image statistics in context. In: CVPRW Workshop
(2006)
23. Viitaniemi, V., Laaksonen, J.: Evaluating the performance in automatic image annotation:
Example case by adaptive fusion of global image features. Image Commun. 22(6), 557–568
(2007)
24. Vogel, J., Schiele, B.: Natural scene retrieval based on a semantic modeling step. In: CIVR,
pp. 207–215 (2004)
25. Yavlinsky, A., Schofield, E., Rger, S.: Automated image annotation using global features and
robust nonparametric density estimation. In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-
Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR 2005. LNCS, vol. 3568, pp. 507–517. Springer,
Heidelberg (2005)
Simultaneous Detection and Registration for Ileo-Cecal
Valve Detection in 3D CT Colonography
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 465–478, 2008.
c Springer-Verlag Berlin Heidelberg 2008
466 L. Lu et al.
1
Ileo-Cecal Valve (ICV) is a small, deformable anatomic structure connecting the small and
large intestine in human body. In addition to its significant clinical value, automated detection
of ICV is of great practical value for automatic colon segmentation and automatic detection of
colonic cancer in CT colongraphy (CTC) [11,17,5].
Simultaneous Detection and Registration 467
detection in section 3 and its evaluation in section 4. We conclude the paper with dis-
cussion in section 5.
For noisy 3D medical data volumes, the scanning or navigation processes of finding in-
terested objects can be very ambiguous and time-consuming for human experts. When
the searched target is partially or fully coated by other types of noisy voxels (such as
colonic objects embedded within stool, or tagging materials in CT), 3D anatomic struc-
ture detection by human experts becomes extremely difficult and sometimes impossible.
These characteristics make it very necessary to solve the type of problems using com-
puter aided detection and diagnosis (CAD) system for clinic purpose. This is the main
motivation for our paper.
The diagram of our proposed incremental parameter learning (IPL) framework is
shown in figure 1, by taking a full 3D object detection problem as an illustrative exam-
ple. We define the detection task as finding a 3D bounding box including the object in
3D data volume as closely as possible. The object’s (or the box’s) spatial configuration
space Ω can be uniquely determined by its 3D (center) position (ΩT ), 3D size (ΩS ) and
where v1i is one of the eight vertices of box1 and v2i is its according vertex of box2 .
v1i −v2i is the Euclidean distance between two 3D vectors v1i , v2i . Again PBT algorithm
and steerable features are used for training to get PS .
In the third step, PS is employed to evaluate the positive-class probabilities for M ×n
samples {Ti, Sj , R∗ )},i=1,2,...,M; j = 1, 2, ..., n, and keep a subset of M candidates
with the highest outputs. We denote them {(Ti , Si , R∗ )}, i = 1, 2, ..., M , which are
further expanded in ΩR as {(Ti , Si , Rj )}, i = 1, 2, ..., M ; j = 1, 2, ..., n. After this,
all the process is the same for training dataset construction and classifier training PR ,
as step 2. Box-to-box distance is employed and the two distance thresholds are denoted
as η1 and η2 . Finally we have {(Tk , Sk , Rk )}, k = 1, 2, ...M returned by our whole
algorithm as the object detection result of multiple hypotheses. In testing, there are
470 L. Lu et al.
3.1 Features
In the domain of 3D object detection, 3D Haar wavelet features [13] are designed to
capture region-based contrasts which is effective to classification. However 3D Haar
features are inefficient for object orientation estimation because they require a very
time-consuming process of rotating 3D volumes for integral volume computation. In
steerable features [19], only a sampling grid-pattern need to be translated, rotated and
re-scaled instead of data volumes. It allows fast 3D data evaluation and has shown to
be effective for object detection tasks [19]. It is composed by a number of sampling
grids/points where 71 local intensity, gradient and curvature based features are com-
puted at each grid. The whole sampling pattern models semi-local context. For details,
refer to [19].
Simultaneous Detection and Registration 471
Fig. 2. System diagram of Ileo-Cecal Valve detection. The upper block is prior learning and the
lower block is incremental parameter learning for ICV spatial parameter estimation. Examples of
the annotated ICV bounding boxes are shown in red.
Fig. 3. Steerable sampling grid patterns for (a) 3D point detector and (b) 3D box detector
In this paper, we design two specific steerable patterns for our ICV detection task as
shown in figure 3. In (a), we design an axis-based pattern for detecting ICV’s orifice.
Assume that the sampling pattern is placed with its center grid at a certain voxel v. It
contains three sampling axes as the gradient directions averaged in v’s neighborhoods
under three scales respectively. Along each axis, nine grids are evenly sampled. This
process is repeated for halfly and quarterly downsampled CT volumes as well. Alto-
gether we have M = 81 = 3 × 9 × 3 grid nodes which brings 71 × 81 = 5751 features.
In (b), we fit each box-based pattern with evenly 7 × 7 × 5 sampling grids. The total
feature number is 52185 by integrating features from three different scales. This type of
feature is used for all ΩT ΩS ΩR detection. The detector trained with axis pattern and
PBT is named 3D point detector; while the detector with box pattern and PBT is noted
as 3D box detector.
(a) (b)
Fig. 4. (a) ICV orifice sampling pattern of three sampling axes and nine sampling grids along
each axis; (b) detected ICV voxel/orifice candidates shown in white
informative, but far from fully unique, surface profile that can possibly indicates ICV
location as multiple hypotheses. It also allows very efficient detection using a 3D point
detector which involves less feature computation (5751 vs. 52185 for training) than a
box detector. Further more, it is known that ICV orifice only lies on the colon surface
that is computed using a 3D version of Canny edge detection. Thus we can prune all
voxel locations inside the tissue or in the air for even faster scanning. An illustrative
example of the orifice sampling pattern and detection result is shown in figure 4. Note
that multiple clusters of detection may occur often in practice. From the annotated ICV
orifice positions in our training CT volume set, we generate the positive training sam-
ples for surface voxels within α1 voxel distance and negatives out of α2 voxel distance.
We set α2 > α1 , so the discriminative boosting training [12] will not focus on samples
with distances [α1 , α2 ] which are ambiguous for classifier training but not important for
target finding. The trained classifier PO is used to exhaustively scan all surface voxels,
prune the scanned ICV orifice candidates and only a few hypotheses (eg. N = 100)
are preserved. In summary, 3D point detector for ICV orifice detection is efficient and
suitable for exhaustive search as the first step.
Given any detected orifice hypothesis, we place ICV bounding boxes centering at
its location and with the mean size estimated from annotations. In the local 3D coor-
dinates of an ICV box, XY plane is assumed to be aligned with the gradient vector
of the orifice as its Z-axis. This is an important domain knowledge that we can use to
initially prune ICV’s orientation space ΩR in 2 degrees of freedom (DOF). Boxes are
then rotated around Z-axis with 10o interval to generate training samples. Based on
their box-to-box distances against the ground truth of ICV box3 and β1 , β2 threshold as
above, our routine process is: (1)generating positive/negative training sets by distance
thresholding; (2) training a PBT classifier PR using the box-level steerable features;
(3) evaluating the training examples using the trained classifier, and keeping top 100
hypotheses of probabilities (ρiR , i = 1, 2, ..., 100). In our experiments, we show results
with α1 = 4, α2 = 20 (normally out of the ICV scope), β1 = 6 and β2 = 30.
3
The ground truth annotations are normalized with the mean size to count only the translational
and orientational distances.
Simultaneous Detection and Registration 473
during five stages of training. The training scale for our PBT classifier ranges over
10K ∼ 250K positives and 2M ∼ 20M negatives. The ROC curves are shown in
figure 5 (a). From the evidence of these plots, our training process are generally well-
performed and gradually improves for later steps. We then discuss the error distribution
curves between the top 100 ICV hypotheses maintained for all five stages of detection
and the ground truth, using five-fold cross-validation. The error curves, as shown in
figure 5 (b), also demonstrate that more accurate ICV spatial configurations can be ob-
tained as the detection process proceed through stages. This convergence is bounded by
the good training performance of ROC curves with positive-class distance boundaries
that are gradually more close to the global optima (or ground-truth) as 6, 5, 4, 4, and
(a) (b)
(c) (d)
Fig. 5. (a) Receiver operating characteristic curves of different stages of training in our Ileo-Cecal
Valve detection system. (b) Error ratio curves of top 100 ICV hypotheses of different stages of
detection. Each curve show the ratios of hypotheses (Y axis) under the particular error readings
(X-axis) against ground truth. All numbers are averaged over the testing sets of volumes, under
five-fold cross-validation of 116 total labeled ICV examples. (c) Overlap ratios between 114
detected ICV examples and their ground truth. (d) A typical example of 3D ICV detection in CT
Colonography, with overlap ratio of 79.8%. Its box-to-box distance as define in equation 8 is
3.43 voxels where the annotation box size is 29.0 × 18.0 × 12.0 voxels. Its orientational errors
are 7.68o , 7.77o , 2.52o with respect to three axes. The red box is the annotation; the green box is
the detection. This picture is better visualized in color.
Simultaneous Detection and Registration 475
decreasing distance margins between positive and negative classes (eg. β2 − β1 = 24;
θ2 − θ1 = 20; τ2 − τ1 = 16 and η2 − η1 = 11) over stages.
ICV Detection Evaluation: Our training set includes 116 ICV annotated volumes from
the dataset of clean colon CT volumes using both Siemens and GE scanners. With a
fixed threshold ρR > 0.5 for the final detection, 114 ICVs are found with the detection
rate of 98.3%, under five-fold cross-validation. After manual examination, we find that
the two missed ICVs have very abnormal shape from the general training pool which is
probably heavily diseased. The ICV detection accuracy is first measured by a symmetric
overlapping ratio between a detected box Boxd and its annotated ground truth Boxa
;
2 × V ol(Boxa Boxd )
γ(Boxa , Boxd ) = (10)
V ol(Boxa ) + V ol(Boxd )
where V ol() is the box-volume function (eg. the voxel number inside a box). The accu-
racy distribution over 114 detected ICV examples is shown in 5 (c). The mean overlap
ratio γ(Boxa , Boxd ) is 74.9%. This error measurement is directly relevant with our
end goal of removing polyp-like false findings in my CAD system. Addtionally the
mean and standard deviation of orientational detection errors are 5.89o , 6.87o, 6.25o ;
and 4.46o , 5.01o , 4.91o respectively for three axes. The distribution of absolute box-
box distances (ie. equation 8) has 4.31 voxels as its mean value, and 4.93 voxels for
the standard deviation. Two missed cases are further verified by clinician as heavily
diseased ICVs which are rare in nature. Our trained classifiers treat them as outliers.
Next we applied our detection system to other previously unseen clean and tagged
CT datasets. For clean data, 138 detections are found from 142 volumes. After manual
validation, 134 detections are true ICVs and 4 cases are Non-ICVs. This results a detec-
tion rate of 94.4%. We also detected 293 ICVs from 368 (both solid and liquid) tagged
colon CT volumes where 236 detections are real ICVs with 22 cases for Non-ICVs and
35 cases unclear (which are very difficult even for expert to make decision). Tagged CT
data are generally much more challenging than clean cases, under low-contrast imaging
and very high noise level of tagging materials. Some positive ICV detections are illus-
trated in figure 6. The processing time varies from 4 ∼ 10 seconds per volume on a P4
3.2G machine with 2GB memory.
Without prior learning for ICV detection, our system can achieve comparable detec-
tion performance as with prior learning. However it requires about 3.2 times more com-
putation time by applying a 3D box detector exhaustively on translational search, not
a cheaper 3D point detector as in prior learning. Note that prior learning is performed
in the exact same probabilistic manner as the incremental 3D translation, scale and ori-
entation parameter estimation. It is not a simple and deterministic task, and multiple
(e.g., 100) detection hypotheses are required to keep for desirable results.
Polyp False Positive (FP) Deduction: ICV contains many polyp-like local structures
which confuse colon CAD system [11,17,5]. By identifying a reasonably accurate
bound box for ICV, this type of ambiguous false positive polyp candidates can be re-
moved. For this purpose, we enhanced the ICV orifice detection stage by adding the
labeled polyp surface voxels into its negative training dataset. Other stages are conse-
quentially retained in the same way. Polyp FP deduction is tested on 802 unseen CT
476 L. Lu et al.
volumes: 407 clean volumes from 10 different hospital sites acquired on Siemens and
GE scanners; 395 tagged volumes, including iodine and barium preparations, from 2
sites acquired on Siemens and GE scanners. The ICV detection is implemented as post
filter for our existing colon CAD system and only applied on those candidates that are
labeled as “Polyp” in the preceding classification phases4 . In clean cases, ICV detec-
tion reduced the number of false positives (fp) from 3.92 fp/patient (2.04 fp/vol.) to
3.72 fp/patient (1.92 fp/vol.) without impacting the overall sensitivity of the CAD sys-
tem. It means that no true polyps were missed due to our ICV detection component
integrated. In tagged cases, ICV detection reduced the number of false marks from 6.2
fp/patient (3.15 fp/vol.) to 5.78 fp/patient (2.94 fp/vol.). One polyp out of 121 polyps
with a size range from 6 up to 25 mm was wrongly labeled as ICV, resulting in a sen-
sitivity drop of 0.8%. Another version implementation of using ICV detection as a soft
constraint, instead of a hard-decisioned post filter, avoids true polyp missing without
sacrificing FP Deduction. In summary our ICV system achieved 5.8% and 6.7% false
positive deduction rates for clean and tagged data respectively, which has significant
clinical importance.
Contextual K-Box ICV Model: To more precisely identify the 3D ICV region be-
sides detection, a contextual K-box model is experimented. The idea is using the final
ICV detection box B1 as an anchor to explore reliable expansions. For all other high
probability hypotheses {; B̂i } returned in the last step of detection, we sort them ac-
cording to V ol(B̂i − B1 B̂i ) while two constraints are satisfied: γ(B;1 , B̂i ) ≥ γ1 and
ρR (B̂i ) ≥ ρ1 . Then the box that gives the largest gain of V ol(B̂i −B1 B̂i ) is selected
4
Note that the use of ICV detection as post-process is dedicated to handle “difficult” polyp cases
which can not be correctly classified in preceding processes.
Simultaneous Detection and Registration 477
as the second box B2 . The two constraints guarantee that B2 is spatially correlated with
B1 (γ1 = 0.5) and is a highly likely ICV detection
hypothesis by itself ρ1 = 0.8. By
taking B1 and B2 as a union Boxd = B1 B2 , it is straightforward to expand the
model for K-box ICV model while K > 2. Our initial experimental results show that
2-box model improves the mean overlap ratio γ(Boxa , Boxd ) from 74.9% to 88.2%
and surprisingly removes 30.2% more Polyp FPs without losing true polyps.
Previous Work on ICV Detection: Our proposed approach is the first reported, fully
automatic Ileo-Cecal Valve detection system in 3D CT colonography, due to the dif-
ficulties discussed in sections 1 and 3. The closest previous work is by Summer et al.
[11] that is also considered as the state-of-art technique in medical imaging community.
We discuss and compare [11] and our work in two aspects. (1) For localization of ICV,
Summer et al. relies on a radiologist to interactively identify the ICV by clicking on
a voxel inside (approximately in the center of) the ICV. This is a requisite step for the
next classification process and takes minutes for an expert to finish. On the contrary, our
automatic system takes 4 ∼ 10 seconds for the whole detection procedure. (2) For clas-
sification, [11] primarily designs some heuristic rules discovered from dozens of cases
by clinicians. It depends on the performance of a volume segmentor [16] which fails
on 16% ∼ 38% ICV cases [11]. Their overall sensitivity of ICV detection is 49% and
50% based on the testing (70 ICVs) and training datasets (34 ICVs) [11], respectively.
This rule based classification method largely restricts its applicability and effectiveness
on recognizing varieties of ICV samples with their low detection rates reported in [11].
Our detection rate is 98.3% for training data and 94.4% for unseen data. The superiority
of our approach attributes to our effective and efficient incremental parameter learning
framework optimizing object spatial configuration in a full 3D parameter space, and the
discriminative feature selection algorithm (PBT + steerable features) exploring hun-
dreds of thousands volume features.
References
1. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as Space-Time Shapes.
In: ICCV (2005)
2. Geman, D., Jedynak, B.: An Active Testing Model for Tracking Roads in Satellite Images.
IEEE Trans. Pattern Anal. Mach. Intell. 18(1), 1–14 (1996)
3. Han, F., Tu, Z., Zhu, S.C.: Range Image Segmentation by an Effective Jump-Diffusion
Method. IEEE Trans. PAMI 26(9) (2004)
4. Huang, C., Ai, H., Li, Y., Lao, S.: High-performance rotation invariant multiview face detec-
tion. IEEE Trans. PAMI 29(4), 671–686 (2007)
5. Jerebko, A., Lakare, S., Cathier, P., Periaswamy, S., Bogoni, L.: Symmetric Curvature Pat-
terns for Colonic Polyp Detection. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI
2006. LNCS, vol. 4191, pp. 169–176. Springer, Heidelberg (2006)
6. Jones, M., Viola, P.: Fast multi-view face detection. In: CVPR (2003)
7. Ke, Y., Sukthankar, R., Hebert, M.: Efficient Visual Event Detection using Volumetric Fea-
tures. In: ICCV (2005)
8. Lu, L., Hager, G.: Dynamic Background/Foreground Segmentation From Images and Videos
using Random Patches. In: NIPS (2006)
9. Rowley, H., Baluja, S., Kanade, T.: Neural Network-Based Face Detection. In: CVPR (1996)
10. Rowley, H., Baluja, S., Kanade, T.: Rotation Invariant Neural Network-Based Face Detec-
tion. In: CVPR (1998)
11. Summers, R., Yao, J., Johnson, C., Colonography, C.T.: with Computer-Aided Detection:
Automated Recognition of Ileocecal Valve to Reduce Number of False-Positive Detections.
Radiology 233, 266–272 (2004)
12. Tu, Z.: Probabilistic boosting-tree: Learning discriminative methods for classification, recog-
nition, and clustering. In: ICCV (2005)
13. Tu, Z., Zhou, X.S., Barbu, A., Bogoni, L., Comaniciu, D.: Probabilistic 3D polyp detection
in CT images: The role of sample alignment. In: CVPR (2006)
14. Wu, B., Nevatia, R.: Cluster Boosted Tree Classifier for Multi-View, Multi-Pose Object De-
tection. In: ICCV (2007)
15. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In:
CVPR, pp. 511–518 (2001)
16. Yao, J., Miller, M., Franaszek, M., Summers, R.: Colonic polyp segmentation in CT
Colongraphy-based on fuzzy clustering and deformable models. IEEE Trans. on Medical
Imaging (2004)
17. Yoshida, H., Dachman, A.H.: CAD techniques, challenges, and controversies in computed
tomographic colonography. Abdominal Imaging 30(1), 26–41 (2005)
18. Yuille, A.L., Coughlan, J.M.: Twenty Questions, Focus of Attention, and A*: A Theoretical
Comparison of Optimization Strategies. In: Pelillo, M., Hancock, E.R. (eds.) EMMCVPR
1997. LNCS, vol. 1223, pp. 197–212. Springer, Heidelberg (1997)
19. Zheng, Y., Barbu, A., Georgescu, B., Scheuering, M., Comaniciu, D.: Fast Automatic Heart
Chamber Segmentation from 3D CT Data Using Marginal space Learning and Steerable
Features. In: ICCV (2007)
Constructing Category Hierarchies
for Visual Recognition
1 Introduction
Visual object classification is one of the basic computer vision problems. In spite
of significant research progress, the problem is still far from being solved and a
considerable effort is still being put into this research area [1].
In the last years, one could witness remarkable progress in the development of
robust image representations and also observe successful applications of sophis-
ticated machine learning techniques in computer vision. Developments in image
representation include research on interest point detectors [2,3], SIFT features [4]
and bag-of-features [5]. Support Vector Machines (SVMs) [6] were successfully
applied to vision with the design of specialized kernels [7,8]. Combining these
techniques allowed researchers to construct successful visual object recognition
systems [1]. We build on those works to construct our baseline.
Still, the typical problems that are tackled today by the state-of-the-art vi-
sual object class recognition systems, consist of only few object categories. Very
recently, datasets that include more than a hundred of categories, like the most
recent Caltech datasets [9,10], have been introduced. Furthermore, there is an
obvious need to further increase this number. In this paper we examine the
problem of classifying a large number of categories and use the Caltech-256 [10]
dataset for evaluation. Figure 1 shows a few sample images.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 479–491, 2008.
c Springer-Verlag Berlin Heidelberg 2008
480 M. Marszalek and C. Schmid
Fig. 1. Sample Caltech-256 images for the most difficult (left), the most confused
(middle) and the easiest (right) classes are shown. In parentheses the per-class accuracy
of our method is given.
2 Existing Approaches
Thus, a Support Vector Machine for each node of the tree can be trained. If
the tree is balanced, only &log2 N ' SVM runs are necessary to perform the N -
class classification. In the worst case (degenerated trees) the complexity is linear
in the number of classes. Therefore, in general, hierarchy-based classification
approaches scale well with the number of classes.
Note that assoc(A, B) is often denoted in the literature as cut(A, B). As the
distance measures used in spectral clustering are often positive definite, the
adjacency matrix EV is often denoted as K. The common choice is the RBF
kernel, which can be generalized to an extended Gaussian kernel
wT D− 2 KD− 2 w
1 1
2.3 Discussion
Most existing class hierarchy construction methods assume that at each level of
the hierarchy the feature-space can be partitioned into disjoint subspaces. We
predict an inevitable conflict between generalization and precision requirements.
Especially for the earliest decisions, where the boundary is supposed to split
very distinct categories of objects (natural vs. man-made objects for example),
a requirement is enforced to precisely trace the boundaries between tens or
hundreds of similar classes that fall at the explored decision boundary (a bear
vs. a teddy-bear and a fountain vs. waterfall for example). Note that a mistake
at a boundary of such a high-level decision is as costly as a mistake at lower
levels, where the classifier can tune to minor class differences without degrading
its generalization properties.
Given a few distinct visual object categories class separability can be good.
But this certainly cannot hold for hundreds or thousands of classes. Let us
motivate our hypothesis with some simplified examples before evaluating it ex-
perimentally.
Figure 2 presents some simplistic efforts to separate 2-dimensional multi-class
data with a linear boundary. A carefully crafted example (Fig. 2a) shows, that
even if any two of three classes can be easily separated with a hyperplane, it does
not assure good separation of all three classes. If there are few classes which are
well separated (Fig. 2b), a good recursive partitioning can be found. With the
growing number of classes, however, it will be increasingly difficult to find a
disjoint class-set partitioning (Fig. 2c).
As we show in Sect. 4, early enforcement of hard decisions can be costly in
the hierarchic setup and can significantly lower the classification performance.
Thus, we propose a novel approach for constructing top-down hierarchies, which
postpones final decisions in the presence of uncertainty.
484 M. Marszalek and C. Schmid
Fig. 2. Simple examples of separating 2-dimensional multi-class data with a linear de-
cision boundary. Difficulties to separate classes (left) might not arise for a few separated
classes (middle), but can emerge when the number of classes increases (right).
3 Our Approach
Our approach is based on the observation that finding a feature-space partition-
ing that reflects the class-set partitioning becomes more and more difficult with
a growing number of classes. Thus, we propose to avoid disjoint partitioning and
split the class-set into overlapping sets instead. This allows to postpone uncer-
tain classification decisions until the number of classes gets reduced and learning
good decision boundaries becomes tractable.
The proposed solution is to discover classes that lie on the partition boundary
and could introduce classification errors. Those classes should not be forced
into either of the partitions, but they should be included in both. With our
approach, a number of classes can still be separated with one decision. This
assures a computational gain compared to setups with linear complexity like
OAR. However, since disjoint partitioning is not enforced, the performance is
not degraded. As the resulting partitioning is relaxed, we call our hierarchy
Relaxed Hierarchy (RH).
Figure 3 demonstrates how our method applies to the problem sketched in
Subsect. 2.3. The boundary from Fig. 2a which separates members of a class
can be used if both subpartitions (Fig. 3a) contain this class. Moreover, the
subsequent splits are straightforward. Note that the resulting hierarchy (Fig. 3b)
is no longer a tree, but a rooted directed acyclic graph (DAG).
Our method can be applied to most top-down partitioning approaches. This
includes methods based on k-means clustering and normalized cuts. Here we
build on normalized cuts. Note that the kernel matrix constructed for SVMs
can be reused. Furthermore, only one eigenvector corresponding to the second
largest eigenvalue needs to be computed, so optimized algorithms can be used.
By partitioning the set of training samples S instead of the set of classes
C = {[s] : s ∈ S},1 a separating boundary between the samples can be found. A
disjoint bi-partitioning of samples S = A B leads to a disjoint tri-partitioning
1
[s] denotes the class assigned to sample s ∈ S.
Constructing Category Hierarchies for Visual Recognition 485
L = A ∪ X = {C : ∃s∈A [s] = C}
R = B ∪ X = {C : ∃s∈B [s] = C} . (6)
In practice, we can also slightly relax the requirement for A (B) to have all
samples in A (B). Given a partitioning p : S → {−1, 1} of the training set S, we
define a function q : C → [−1, 1] on the set of classes C:
1
q(C) = p(s) (7)
|C|
s∈C
where C ∈ C is a class.
This allows us to define a split:
L = q −1 [−1, 1−α)
R = q −1 (−1+α, 1] (8)
Fig. 4. Illustration of the split procedure. Note how the value of α influences the
overlap.
i.e., until |Cn | = 1 or Ln = Rn . In the second case we use OAR on the sub-
set of classes that is too complex to split.
To train the hierarchy, for each node of the computed rooted DAG we train
an SVM using samples belonging to classes in Rn \ Ln as a positive set and to
classes in Ln \ Rn as a negative set. Note that samples belonging to classes in
Xn = Ln ∩Rn are not used for training. This does not matter, since classification
of a sample that belongs to a class in Xn is not relevant at this stage. This is the
key point of our method, since the decision for these classes could be erroneous
and is postponed till later.
For testing, the tree is descended until a leaf is reached. The decision is either
directly the class label (leaves containing only one class) or OAR classification is
performed on the remaining classes (complex leaves with more that one class).
4 Experiments
1 (hin − hjn )2
V
m(Hi , Hj ) = . (9)
2 n=1 hin + hjn
where V is the vocabulary size. We use k-means to construct the vocabulary and
V = 8000 in our experiments.
To use this distance measure in Support Vector Machines, we use the extended
Gaussian kernel, cf. (3). This results in a Mercer kernel [23]. The parameter γ is
set to the mean value of the distances between all training samples.
Using the above image representation with Support Vector Machines in the
OAR setup corresponds to the method of Zhang et al. [8]. This method has
shown an excellent performance on varying object class datasets, including 2005
and 2006 Pascal VOC challenges [8,24]. Extended with additional channels
and a separate optimization framework to combine them, this approach won the
Pascal VOC classification challenge in 2007 [1].
4.2 Caltech-256
4.3 Results
Figure 5 shows a class hierarchy constructed by our method for the Caltech-
256 dataset, displayed for a subset of 10 categories. The categories were chosen
to include animals, natural phenomena and man-made objects. They include
class pairs with apparent visual similarities that are semantically close (bear
and dog, top hat and cowboy hat) as well as those that have a secondary or
no semantic relationship at all (bear and teddy bear, top hat and Saturn). The
488 M. Marszalek and C. Schmid
Fig. 5. Class hierarchy constructed by our method for the Caltech-256 dataset, dis-
played for a subset of 10 categories
hierarchy reveals many intuitive relationships and groupings. At the top node
man-made objects and natural phenomena (hats, lightning, rainbow, Saturn)
are separated from animals (octopus, starfish, bear). Classes at the partition
boundary (dog and teddy bear) are included in both partitions. Subsequent
splits further separate sea animals from land animals (with a teddy bear) and
hat-like objects (including Saturn) from natural phenomena and mascot-like
objects. Even though it is based on visual data only, the constructed hierarchy
turns out to be similar to hierarchies extracted from semantic networks [25].
Unlike the purely semantic hierarchies, however, it also groups classes that are
related by semantic links difficult to model (bear and teddy bear) or that feature
accidental similarity (top hat and Saturn).
Table 1 shows the average per-class classification accuracy on the Caltech-
256 dataset. The upper half of the table compares our approach, i.e., a Relaxed
Hierarchy (RH), to the OAR setup. We can see that the proposed hierarchy
does not lead to accuracy loss. The image representation is the one described in
Subsection 4.1. The lower half of the table shows a result for a different image
representation, i.e., based on a reimplementation of the method of Lazebnik at
al. [7]. This representation obtains better results for the Caltech-256 dataset,
as most objects are centered in the image and relatively small. Again, we can
observe that the results obtained with our RH and an OAR approach (see results
obtained by Griffin et al. [10]) are comparable.
As to be expected, our approach does not depend on the image representation.
Best results on Caltech-256 dataset in a similar setup (53% average accuracy
for 10 training images) where achieved by Varma [26] using a combination of
multiple channels. Our method could be combined with this multi-representation
approach. Note that it could even be applied to different data types, but this is
beyond the scope of this paper. In the following we use the image representation
described in Sect. 4.1 as it is fast to compute and does not impact the evaluation
of our class hierarchy construction.
Figure 6 compares the complexity in the number of categories. The complex-
ity in the OAR setup is linear (red squares). The complexity of our Relaxed
Hierarchy method is confirmed to be sublinear. The exact gain depends on the
parameter α, see the datapoints along the right edge. Note that α is expressed
here as r—the number of relaxed training samples per class, i.e., α = r/15. For
250 categories and a setting of α = 3/15 = 0.2 (blue diamonds) which corresponds
to minor performance loss, we observe a reduction of the computation time by
1/3. This ratio will further increase with the number of categories.
Figure 7 demonstrates the speed-for-accuracy trade-off (green circles) that
can be tuned with the α parameter. As shown in Sect. 3, with the increase of the
parameter value the set of classes is more willingly treated as separable. Greater
α values lead to better computational gain, but could degrade the classification
accuracy. Note that the complexity is sublinear independently of the parameter
setting (see Fig. 6), but for the smaller number of classes one may choose to
accept a small loss in accuracy for a significant gain in computation time. For
instance, for Caltech-256 we find the setting of α = 0.2 (r = 3) reasonable, as
the absolute loss in the accuracy is only about 2%, while the computational gain
250 1
OAR 10
2
Our RH (r=0) 3
200 Our RH (r=1) 0.8
SVM runs per test image
4
Our RH (r=3) 5
Relative accuracy
Our RH (r=5)
150 0.6
100 0.4
OAR
50 0.2 Our RH (r)
Std. top-down
Std. bottom-up
0 0
0 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1
Number of classes Relative complexity
5 Summary
We have shown that existing approaches for constructing class hierarchies for
visual recognition do not scale well with the number of categories. Methods that
perform disjoint class-set partitioning assume good class separability and thus
fail to achieve good performance on visual data when the number of categories
becomes large. Thus, we have proposed a method that detects classes at the par-
titioning boundary and postpones uncertain decisions until the number of classes
becomes smaller. Experimental validation shows that our method is sublinear in
the number of classes and its classification accuracy is comparable to the OAR
setup. Furthermore, our approach allows to tune the speed-for-accuracy trade-off
and, therefore, allows to significantly reduce the computational costs.
Our method finds a reliable partitioning of the categories, but the hierarchy
may be far from optimal. Finding the optimal partitioning is a hard problem.
For the future work we plan use semantic information to drive the optimization.
References
1. Everingham, M., van Gool, L., Williams, C., Winn, J., Zisserman, A.: Overview and
results of classification challenge. In: The PASCAL VOC 2007 Challenge Workshop,
in conj. with ICCV (2007)
Constructing Category Hierarchies for Visual Recognition 491
2. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors.
IJCV (2004)
3. Lindeberg, T.: Feature detection with automatic scale selection. IJCV (1998)
4. Lowe, D.: Distinctive image features form scale-invariant keypoints. IJCV (2004)
5. Willamowski, J., Arregui, D., Csurka, G., Dance, C.R., Fan, L.: Categorizing nine
visual classes using local appearance descriptors. In: IWLAVS (2004)
6. Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regu-
larization, Optimization and Beyond (2002)
7. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In: CVPR (2006)
8. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for
classification of texture and object categories: A comprehensive study. IJCV (2007)
9. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. PAMI
(2007)
10. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical
report (2007)
11. Chen, Y., Crawford, M., Ghosh, J.: Integrating support vector machines in a hier-
archical output space decomposition framework. In: IGARSS (2004)
12. Liu, S., Yi, H., Chia, L.T., Deepu, R.: Adaptive hierarchical multi-class SVM clas-
sifier for texture-based image classification. In: ICME (2005)
13. Zhigang, L., Wenzhong, S., Qianqing, Q., Xiaowen, L., Donghui, X.: Hierarchical
support vector machines. In: IGARSS (2005)
14. Yuan, X., Lai, W., Mei, T., Hua, X., Wu, X., Li, S.: Automatic video genre cate-
gorization using hierarchical SVM. In: ICIP (2006)
15. Zweig, A., Weinshall, D.: Exploiting object hierarchy: Combining models from
different category levels. In: ICCV (2007)
16. He, X., Zemel, R.: Latent topic random fields: Learning using a taxonomy of labels.
In: CVPR (2008)
17. Griffin, G., Perona, P.: Learning and using taxonomies for fast visual category
recognition. In: CVPR (2008)
18. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR
(2006)
19. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with
large vocabularies and fast spatial matching. In: CVPR (2007)
20. Casasent, D., Wang, Y.C.: A hierarchical classifier using new support vector ma-
chines for automatic target recognition. Neural Networks (2005)
21. Shi, J., Malik, J.: Normalized cuts and image segmentation. PAMI (2000)
22. Rahimi, A., Recht, B.: Clustering with normalized cuts is clustering with a hyper-
plane. In: SLCV (2004)
23. Fowlkes, C., Belongie, S., Chung, F., Malik, J.: Spectral grouping using the
Nyström method. PAMI (2004)
24. Everingham, M., Zisserman, A., Williams, C., van Gool, L.: The PASCAL visual
object classes challenge 2006 (VOC 2006) results. Technical report (2006)
25. Marszalek, M., Schmid, C.: Semantic hierarchies for visual object recognition. In:
CVPR (2007)
26. Perona, P., Griffin, G., Spain, M.: The Caltech 256 Workshop. In: Conj. with ICCV
(2007)
Sample Sufficiency and PCA Dimension for Statistical
Shape Models
Lin Mei* , Michael Figl, Ara Darzi, Daniel Rueckert, and Philip Edwards∗
1 Introduction
Statistical shape modelling (SSM) is a technique for analysing variation of shape and
generating or inferring unseen shapes. A set of sample shapes is collected and PCA is
performed to determine the principal modes of shape variation. These modes can be
optimised to fit the model to a new individual, which is the familiar active shape model
(ASM) [1,2,3]. Further information, such as texture, can be included to create an active
appearance model [4] or morphable model [5].
* We would like to thank Tyco Healthcare for funding Lin Mei’s PhD studentship. We are also
grateful to many other members of the Department of Computing and the Department of Bio-
surgery and Surgical Technology at Imperial College.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 492–503, 2008.
© Springer-Verlag Berlin Heidelberg 2008
Sample Sufficiency and PCA Dimension for Statistical Shape Models 493
Despite its popularity, PCA-based SSMs are normally trained from datasets for which
the issue of sufficiency is not considered. The PCA dimension for an SSM is often
chosen by rules that assume either a given percentage or level of noise. As will be shown
later in this paper that these two methods are highly dependent on sample size.
In this paper, we review the discussions on sample size sufficiency for a closely
related field, common factor analysis (CFA), and design a mathematical framework
to investigate the source of PCA model error. This framework provides a theoretical
evaluation of the conventional rules for retaining PCA modes, and enables analysis
of sample size sufficiency for PCA. We then propose a rule for retaining only stable
PCA modes that uses a t-test between the bootstrap stability of mode directions from
the training data and those from pure Gaussian noise. The convergence of the PCA
dimension can then be used as an indication of sample sufficiency.
We verify our framework by a 4-way ANOVA for reconstruction accuracy is applied
to the models trained from synthetic datasets generated under different conditions. Our
PCA dimension rule and procedure for sample sufficiency determination are validated
on the synthetic datasets and demonstrated on real data.
2 Background
There is little literature on the issue of minimum sample size for PCA. In the related
field of CFA, however, this issue has been thoroughly discussed. CFA is commonly
used to test or discover common variation shared by different test datasets. Guidelines
for minimum sample size in CFA involve either a universal size regardless of the data di-
mension or a ratio to the data dimension. Recommendations for minimum size neglect-
ing the sample dimension and the number of expected factors vary from 100 to 500 [6].
Such rules are not supported by tests on real data. Doubts have been raised about a uni-
versal sample size guideline since it neglects the data dimension. Size-variable ratios
(SVR) may be more appropriate and values of between 2:1 to 20:1 have been sug-
gested [7]. There have been a number of tests using real data, but no correlation was
found between SVR and the mode stability [8], nor has any minimum value for SVR
emerged [9]. The minimum sample size needed in these real tests is not consistent ei-
ther, varying from 50 [8], to 78-100 [9], 144 [10], 400 [11] and 500 or more [12].
The inconsistency among these results shows that the minimum size depends on
some nature of the data other than its dimension. MacCallum et al. [13,14] proposed
a mathematical framework for relating the minimum sample size for CFA with its
communality and overdetermination level. They then designed an experiment using
4-way ANOVA to study the effects of communality, overdetermination level, model
error and sample size on the accuracy in recovering the genuine factors from synthetic
data. The results showed that communality had the dominant effect on the accuracy
regardless of the model error. The effect of overdetermination level was almost neg-
ligible when communality is high. In low communality tests, accuracy improves with
larger sample size and higher accuracy was found in tests with lower overdetermination
levels.
494 L. Mei et al.
3 Theories
3.1 Sources of PCA Model Inaccuracy
We propose the following mathematical framework to examine the characteristics af-
fecting the sufficiency of a sample set drawn from a population with genuine modes of
< instead of
variation, listed in the column of A. Due to the presence of noise, we have X
X, and the PCA modes from X < are A.
< The model inaccuracy can be expressed as the
difference between the covariance matrices Δ = X <X< T − XX T .
Let X = AW and X < =A <W
= , we have:
<=A
X <W <W
= = AAT A = + (I − AAT )A
<W= (1)
Since A is orthonormal, (I − AAT ) is a diagonal matrix with only 1s and 0s. Hence
(I − AAT ) = N N T . Equation 1 becomes:
< = AAT A
X <W= + NNT A
<W= = AW
=A + N W
=N (2)
Sample Sufficiency and PCA Dimension for Statistical Shape Models 495
<X
X < T = AW = T AT + AW
=A W =T N T + N W
=A W = T AT + N W
=N W =T N T
=N W
A N A N
= AΣAA A + AΣAN N + N ΣN A A + N ΣN N N
T T T T
(3)
Δ=X<X< T − XX T
= A(Σ AA − ΣAA )A + AΣAN N + N ΣN A A + N ΣN N N
T T T T
According to the framework in section 3.1, the sample size requirement for PCA only
depends on two factors: number of structural modes in the dataset, and the level of
noise. Hence we propose the following procedure for sample sufficiency determination.
For a sample set, X, of n samples:
Fig. 1. Comparison of Leading 8 Eigenmodes from two mutually exclusive sets of 50 samples
from our 3D face mesh database, aligned according to eigenvalue ranks. Darker texture implies
larger variation, showing many mismatched after the 4th mode.
d( k
, k ) = k − trace(AAT BBT ) (5)
For two sets of PCA modes, ai and bi , trained from different sample sets of a
common distribution, the following rule can be used to establish correspondence. The
first mode in ai corresponds to the mode of a replicate that minimises d( 1 , 1 ),
and we proceed iteratively. Assume we have already aligned k
, the PSS from the
first k modes in ai , to the spanned space from k modes in the replicate bi . The
k
mode in bi that corresponds to the k+1th mode in ai will be the one that minimises
d( k+1 , k+1 ).
4 Experiments
We demonstrate the correctness of our theories with three sets of experiments. First a 4-
way ANOVA is performed on synthetic datasets to show how the PCA model accuracy
is affected by different features as it is discussed in section 3.1. Then we show that our
stopping rule is able to identify the correct number of modes in the synthetic samples
for which commonly used rules fail. This shows that our rule can be used to determine
PCA sample sufficiency, by following the procedure presented in section 3.2. This is
applied to two different sets of real samples.
a b
Fig. 2. Examples from real 3D Face database (a) and landmarks of 2D AR Face database
a b c
Fig. 3. Examples of three synthetic faces generated with 70 modes with shape vector dimension
being 2100. Different noise levels are applied: 0.1mm (a), 0.25mm (b) and 0.5mm (c). Noise
starts to become visible in (b) and (c).
from the landmarks (22 points) [24] of 2D AR face database [25]. Examples from these
two datasets are shown in figure 2.
in equation 5 is used to calculate the error of the models trained from the subsets. A
4-way ANOVA was performed to find out which characteristics influence the model
accuracy. As shown in table 1, the results confirm the correctness of our framework
introduced in section 3.1. Sample size and number of genuine modes in the dataset act
as the major source of influence on the model accuracy. Noise also has a significant but
small influence. Also the result showed that the effect of sample dimension is negligible.
Fig. 4. Synthetic faces generated using 80 modes, added 1mm Gaussian noise on to each element
of the shape vector with dimension 1500
500 L. Mei et al.
a b
Fig. 5. 95% thresholded compactness plots of synthetic 3D face datasets (a) with 100, 200, 400
and 600 samples and real 3D face datasets (b) with 30, 50, 100 and 150 samples. The number of
retained modes is clearly dependent on sample size.
Number of Samples 50 100 150 200 250 300 350 400 450 500
Number of Modes 32 60 95 108 120 140 169 186 204 219
Fig. 6. Instability of PSS for synthetic datasets for synthetic datasets sized from 200 to 2000
for the real data as shown in figure 5(b), which strongly suggests that this rule is un-
reliable and should not be used. A similar effect, as shown in table 2, was found for
the stopping rule that discards the least principal modes until the average error of each
point reaches 1mm.
Sample Sufficiency and PCA Dimension for Statistical Shape Models 501
Fig. 8. Result of real datasets sufficiency test. Left: 2D faces; Right: 3D faces.
The method of Besse et al [21] was validated with synthetic datasets sized from
200 to 400. A plot of instability, measured as the distance between subspaces spanned
by different replicates, is shown in figure 6. Although this method provides a visible
indication of the correct number of modes to retain when the sample size is sufficiently
large, it cannot identify the lower number of modes that should be retained when the
sample size is insufficient.
Our method was validated with synthetic datasets sized from 100 to 2000. Figure 7
shows the number of modes to retain versus the sample size is also shown. Our stop-
ping rule does not have the tendency to go beyond 80 with large sample sizes. It also
identifies a lower number of stable modes to retain for smaller sample sizes. It appears
a sample size of around 500 is sufficient.
Figure 8 shows the results of sample size sufficiency tests on the three real datasets we
have. For the 2D dataset, the plot obviously converges at 24 modes with 50 samples.
502 L. Mei et al.
With the 3D faces, the graph appears close to convergence at around 70 modes for the
150 samples. These results suggest both face datasets are sufficient.
References
1. Cootes, F., Hill, A., Taylor, C., Haslam, J.: The use of active shape models for locating
structures in medical images. In: Proc. IPMI, pp. 33–47 (1993)
2. Cootes, T., Taylor, C., Cooper, D., Graham, J.: Active shape models and their training and
application. Comput. Vis. Image Underst. 61(1), 38–59 (1995)
3. Sukno, F.M., Ordas, S., Butakoff, C., Cruz, S.: Active shape models with invariant optimal
features: Application to facial analysis. IEEE Trans. Pattern Anal. Mach. Intell. 29(7), 1105–
1117 (2007) (Senior Member-Alejandro F. Frangi)
4. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Transactions on Pattern
Analysis and Machine Intelligence 23(6), 681–685 (2001)
5. Blanz, V., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE Trans-
actions On Pattern Analysis And Machine Intelligence 25, 1063–1074 (2003)
6. Osborne, J., Costello, A.: Sample size and subject to item ratio in principal components
analysis. Practical Assessment, Research and Evaluation 9(11) (2004)
7. Guadagnoli, E., Velicer, W.: Relation of sample size to the stability of component patterns.
Psychological Bulletin 103, 265–275 (1988)
8. Barrett, P., Kline, P.: The observation to variable ratio in factor analysis. Personality Study
and Group Behavior 1, 23–33 (1981)
9. Arrindell, W., van der Ende, J.: An empirical test of the utility of the observations-to-variables
ratio in factor and components analysis. Applied Psychological Measurement 9(2), 165–178
(1985)
Sample Sufficiency and PCA Dimension for Statistical Shape Models 503
10. Velicer, W., Peacock, A., Jackson, D.: A comparison of component and factor patterns: A
monte carlo approach. Multivariate Behavioral Research 17(3), 371–388 (1982)
11. Aleamoni, L.: Effects of size of sample on eigenvalues, observed communalities, and factor
loadings. Journal of Applied Psychology 58(2), 266–269 (1973)
12. Comfrey, A., Lee, H.: A First Course in Factor Analysis. Lawrence Erlbaum, Hillsdale (1992)
13. MacCallum, R., Widaman, K., Zhang, S., Hong, S.: Sample size in factor analysis. Psycho-
logical Methods 4, 84–99 (1999)
14. MacCallum, R., Widaman, K., Hong, K.P.S.: Sample size in factor analysis: The role of
model error. Multivariate Behavioral Research 36, 611–637 (2001)
15. Jackson, D.: Stopping rules in principal components analysis: a comparison of heuristical
and statistical approaches. Ecology 74, 2204–2214 (1993)
16. Jolliffe, I.: Principal Component Analysis, 2nd edn. Springer, Heidelberg (2002)
17. Sinha, A., Buchanan, B.: Assessing the stability of principal components using regression.
Psychometrika 60(3), 355–369 (2006)
18. Daudin, J., Duby, C., Trecourt, P.: Stability of principal component analysis studied by the
bootstrap method. Statistics 19, 341–358 (1988)
19. Besse, P.: PCA stability and choice of dimensionality. Statistics& Probability 13, 405–410
(1992)
20. Babalola, K., Cootes, T., Patenaude, B., Rao, A., Jenkinson, M.: Comparing the similarity of
statistical shape models using the bhattacharya metric. In: Larsen, R., Nielsen, M., Sporring,
J. (eds.) MICCAI 2006. LNCS, vol. 4190, pp. 142–150. Springer, Heidelberg (2006)
21. Besse, P., de Falguerolles, A.: Application of resampling methods to the choice of dimension
in PCA. In: Hardle, W., Simar, L. (eds.) Computer Intensive Methods in Statistics, pp. 167–
176. Physica-Verlag, Heidelberg (1993)
22. University of Notre Dame Computer Vision Research Laboratory: Biometrics database dis-
tribution (2007), http://www.nd.edu/∼cvrl/UNDBiometricsDatabase.html
23. Papatheodorou, T.: 3D Face Recognition Using Rigid and Non-Rigid Surface Registration.
PhD thesis, VIP Group, Department of Computing, Imperial College, London University
(2006)
24. Cootes, T.: The AR face database 22 point markup (N/A),
http://www.isbe.man.ac.uk/∼bim/data/tarfd markup/
tarfd markup.html
25. Martinez, A., Benavente, R.: The AR face database (2007),
http://cobweb.ecn.purdue.edu/∼aleix/aleix face DB.html
Locating Facial Features with an Extended
Active Shape Model
1 Introduction
Automatic and accurate location of facial features is difficult. The variety of
human faces, expressions, facial hair, glasses, poses, and lighting contribute to
the complexity of the problem.
This paper focuses on the specific application of locating features in unob-
structed frontal views of upright faces. We make some extensions to the Active
Shape Model (ASM) of Cootes et al. [4] and show that it can perform well in
this application.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 504–513, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Locating Facial Features with an Extended Active Shape Model 505
Fig. 1. A face with correctly positioned landmarks. This image is from the BioID
set [15].
repeats the following two steps until convergence (i) suggest a tentative shape by
adjusting the locations of shape points by template matching of the image tex-
ture around each point (ii) conform the tentative shape to a global shape model.
The individual template matches are unreliable and the shape model pools the
results of the weak template matchers to form a stronger overall classifier. The
entire search is repeated at each level in an image pyramid, from coarse to fine
resolution.
It follows that two types of submodel make up the ASM: the profile model
and the shape model.
The profile models (one for each landmark at each pyramid level) are used
to locate the approximate position of each landmark by template matching.
Any template matcher can be used, but the classical ASM forms a fixed-length
normalized gradient vector (called the profile) by sampling the image along a
line (called the whisker ) orthogonal to the shape boundary at the landmark.
During training on manually landmarked faces, at each landmark we calculate
the mean profile vector ḡ and the profile covariance matrix Sg . During searching,
we displace the landmark along the whisker to the pixel whose profile g has lowest
Mahalanobis distance from the mean profile ḡ:
We can generate various shapes with Equation 2 by varying the vector pa-
rameter b. By keeping the elements of b within limits (determined during model
building) we ensure that generated face shapes are lifelike.
Conversely, given a suggested shape x, we can calculate the parameter b that
allows Equation 2 to best approximate x with a model shape x̂. Cootes and
Taylor [8] describe an iterative algorithm that gives the b and T that minimizes
distance(x, T(x̄ + Φb)) (3)
where T is a similarity transform that maps the model space into the image
space.
3 Related Work
Active Shape Models belong to the class of models which after a shape is situated
near an image feature interact with the image to warp the shape to the feature.
They are deformable models likes snakes [16], but unlike snakes they use an
explicit shape model to place global constraints on the generated shape. ASMs
were first presented by Cootes et al. [3]. Cootes and his colleagues followed with
a succession of papers cumulating in the classical ASM described above [8] [4].
Many modifications to the classical ASM have been proposed. We mention
just a few. Cootes and Taylor [6] employ a shape model which is a mixture of mul-
tivariate gaussians, rather than assuming that the shapes come from the single
gaussian distribution implicit in the shape model of the classical ASM. Romdhani
et al. [22] use Kernel Principal Components Analysis [23] and a Support Vector
Machine. Their software trains on 2D images, but models non-linear changes to
face shapes as they are rotated in 3D. Rogers and Graham [21] robustify ASMs
by applying robust least-squares techniques to minimize the residuals between
the model shape and the suggested shape. Van Ginneken et al. [12] take the tack
of replacing the 1D normalized first derivative profiles of the classical ASM with
local texture descriptors calculated from “locally orderless images” [17]. Their
method automatically selects the optimum set of descriptors. They also replace
the classical ASM profile model search (using Mahalanobis distances) with a k-
nearest-neighbors classifier. Zhou et al. [25] estimate shape and pose parameters
using Bayesian inference after projecting the shapes into a tangent space. Li and
Ito [24] build texture models with AdaBoosted histogram classifiers. The Active
Appearance Model [5] merges the shape and profile model of the ASM into a
single model of appearance, and itself has many descendants. Cootes et al. [7]
report that landmark localization accuracy is better on the whole for ASMs than
AAMs, although this may have changed with subsequent developments to the
AAM.
2.5
point−to−point err relative to 68−point−model
2.0
1.5
1.0
0.5
0 10 20 30 40 50 60 70
number of landmarks
The classical ASM uses a one-dimensional profile at each landmark, but using
two-dimensional “profiles” can give improved fits. Instead of sampling a one-
dimensional line of pixels along the whisker, we sample a square region around
the landmark. Intuitively, a 2D profile area captures more information around
the landmark and this information if used wisely should give better results.
During search we displace the sampling region in both the “x” and “y” direc-
tions, where x is orthogonal to the shape edge at the landmark and y is tangent to
the shape edge. We must rely on the face being approximately upright because 2D
profiles are aligned to the edges of the image. The profile covariance matrix Sg of a
set of 2D profiles is formed by treating each 2D profile matrix as a long vector (by
appending the rows end to end), and calculating the covariance of the vectors.
508 S. Milborrow and F. Nicolls
Any two dimensional template matching scheme can be used, but the au-
thors found that good results were obtained using gradients over a 13x13 square
around the landmark, after prescaling faces to a constant width of 180 pixels.
The values 13 and 180 were determined during model building by measurements
on a validation set, as were all parameter values in this paper (Sec. 5).
Gradients were calculated with a 3x3 convolution mask ((0,0,0),(0,-2,1),(0,1,0))
and normalized by dividing by the Frobenius norm of the gradient matrix. The
effect of outliers was reduced by applying a mild sigmoid transform to the elements
xi of the gradient matrix: xi = xi /(abs(xi ) + constant).
Good results were obtained using 2D profiles for the nose and eyes and sur-
rounding landmarks, with 1D profiles elsewhere.
The XM2VTS set used for training (Sec. 5) contains frontal images of mostly
caucasian working adults and is thus a rather limited representation of the variety
of human faces. A shape model built with noise added to the training shapes
helps the trained model generalize to a wider variety of faces. Good results can
be obtained with the following techniques:
1. Add gaussian noise with a standard deviation of 0.75 pixels to the x- and y-
positions of each training shape landmark. In effect, this increases variability
in the training set face shapes.
2. Randomly choose the left or the right side each face. Generate a stretching
factor for each face from a gaussian distribution with a standard deviation
of 0.08. Stretch or contract the chosen side of the face by multiplying the x
position (relative to the face center) of each landmark on that side by 1 + .
This is roughly equivalent to rotating the face slightly.
In Equation 2, the constraints on the generated face shape are determined by the
number of eigenvectors neigs in Φ and the maximum allowed values of elements
in the parameter vector b. When conforming the shape suggested √ by the profile
models to the shape model, we clip each element bi of b to bmax λi where λi is
the corresponding eigenvalue The parameters neigs and bmax are global constants
determined during model building by parameter selection on a validation set.
See [8] for details.
The profile models are most unreliable when starting the search (for exam-
ple, a jaw landmark can snag on the collar), but become more reliable as the
search progresses. We can take advantage of this increase in reliability with two
modifications to the standard ASM procedure described above. The first mod-
ification sets neigs and bmax for the final pyramid level (at the original image
scale) to larger values. The second sets neigs and bmax for the final iteration at
each pyramid level to larger values. In both cases the landmarks at that stage of
the search tend to be already positioned fairly accurately, for the given pyramid
Locating Facial Features with an Extended Active Shape Model 509
level. It is therefore less likely that the profile match at any landmark is grossly
mispositioned, allowing the shape constraints to be weakened.
These modifications are effective for 2D but not for 1D profiles. The 1D profile
matches are not reliable enough to allow the shape constraints to be weakened.
5 Experimental Results
Before giving experimental results we briefly review model assessment in more
general terms [13]. The overall strategy for selecting parameters is
1. for each model parameter
2. for each parameter value
3. train on a set of faces
4. evaluate the model by using it to locate landmarks
5. select the value of the parameter that gives the best model
6. test the final model by using it to locate landmarks.
510 S. Milborrow and F. Nicolls
0.08
me17
(mean point to point me17
0.07
error / eye distance) time
for all BioID faces
found by the Viola
0.06
Jones detector
0.05
0.5
0.4
search time
0.04
0 3GHz Pentium)
20 point model (1D)
2D profiles
training noise
trimmed
stacked
Two processes are going on here: model selection which estimates the perfor-
mance of different models in order to choose one (steps 2-5 above), and model
assessment which estimates the final model’s performance on new data (step 6
above). We want to measure the generalization ability of the model, not its abil-
ity on the set it was trained on, and therefore need three independent datasets
(i) a training set for step 3 above (ii) a parameter selection or validation set for
step 4 above, and (iii) a test set for step 6 above.
For the training set we used the XM2VTS [19] set. We effectively doubled the
size of the training set by mirroring images, but excluded faces that were of poor
quality (eyes closed, blurred, etc.).
For the validation set we used the AR [18] set. So, for example, we used the
AR set for choosing the amount of noise discussed in section 4.3. We minimized
overfitting to the validation set by using a different subset of the AR data for
selecting each parameter. Subsets consisted of 200 randomly chosen images.
For the test set we used the BioID set [15]. More precisely, the test set is
those faces in the BioID set that were successfully found by the OpenCV [14]
implementation of the Viola-Jones face detector (1455 faces, which is 95.7% of
the total 1521 BioID faces).
We used manual landmarks for these three sets from the FGNET project [9].
Cross validation on a single data set is another popular approach. We did not
use cross validation because three datasets were available and because of the
many instances of near duplication of images within each dataset.
Locating Facial Features with an Extended Active Shape Model 511
1.0
0.8
proportion
0.6
0.4
stacked model
0.2
templates. During search, the feature templates are matched to the image using
an efficient shape constrained search. The model is more accurate and more
robust than the original Active Appearance Model.
The results in Cristinacce and Cootes’ paper appear to be the best previously
published facial landmark location results and are presented in terms of the me17
on the BioId set, which makes a direct comparison possible. The dotted curve in
Fig. 4 reproduces the curve in Fig. 4(c) in their paper. The figure shows that the
stacked model on independent data outperforms the Constrained Local Model.
The median me17 for the stacked model is 0.045 (2.4 pixels), the best me17 is
0.0235 (1.4 pixels), and the worst is 0.283 (14 pixels). The long right hand tail
of the error distribution is typical of ASMs.
References
1. Bates, D., Maechler, M.: Matrix: A Matrix package for R. See the nearPD function
in this R package for methods of forcing positive definiteness (2008),
http://cran.r-project.org/web/packages/Matrix/index.html
2. Breiman, Friedman, Olshen, Stone: Classification and Regression Trees. Wadsworth
(1984)
3. Cootes, T.F., Cooper, D.H., Taylor, C.J., Graham, J.: A Trainable Method of
Parametric Shape Description. BMVC 2, 54–61 (1991)
4. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active Shape Models —
their Training and Application. CVIU 61, 38–59 (1995)
Locating Facial Features with an Extended Active Shape Model 513
5. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. In:
Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 484–498.
Springer, Heidelberg (1998)
6. Cootes, T.F., Taylor, C.J.: A Mixture Model for Representing Shape Variation.
Image and Vision Computing 17(8), 567–574 (1999)
7. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Comparing Active Shape Models with
Active Appearance Models. In: Pridmore, T., Elliman, D. (eds.) Proc. British Ma-
chine Vision Conference, vol. 1, pp. 173–182 (1999)
8. Cootes, T.F., Taylor, C.J.: Technical Report: Statistical Models of Appearance for
Computer Vision. The University of Manchester School of Medicine (2004),
www.isbe.man.ac.uk/∼ bim/refs.html
9. Cootes, T.F., et al.: FGNET manual annotation of face datasets (2002),
www-prima.inrialpes.fr/FGnet/html/benchmarks.html
10. Cristinacce, D., Cootes, T.: Feature Detection and Tracking with Constrained Local
Models. BMVC 17, 929–938 (2006)
11. Gentle, J.E.: Numerical Linear Algebra for Applications in Statistics. Springer,
Heidelberg (1998); See page 178 for methods of forcing positive definiteness
12. van Ginneken, B., Frangi, A.F., Stall, J.J., ter Haar Romeny, B.: Active Shape
Model Segmentation with Optimal Features. IEEE-TMI 21, 924–933 (2002)
13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, Heidelberg (2003); See chapter 7 for
methods of model assessment
14. Intel: Open Source Computer Vision Library. Intel (2007)
15. Jesorsky, O., Kirchberg, K., Frischholz, R.: Robust Face Detection using the Haus-
dorff Distance. AVBPA 90–95 (2001)
16. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour Models. IJCV 1,
321–331 (1987)
17. Koenderink, J.J., van Doorn, A.J.: The Structure of Locally Orderless Images.
IJCV 31(2/3), 159–168 (1999)
18. Martinez, A.M., Benavente, R.: The AR Face Database: CVC Tech. Report 24
(1998)
19. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: XM2VTS: The Extended
M2VTS Database. AVBPA (1999)
20. Milborrow, S.: Stasm software library (2007),
http://www.milbo.users.sonic.net/stasm
21. Rogers, M., Graham, J.: Robust Active Shape Model Search. In: Heyden, A., Sparr,
G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 517–530.
Springer, Heidelberg (2002)
22. Romdhani, S., Gong, S., Psarrou, A.: A Multi-view Non-linear Active Shape Model
using Kernel PCA. BMVC 10, 483–492 (1999)
23. Scholkopf, S., Smola, A., Muller, K.: Nonlinear Component Analysis as a Kernel
Eigenvalue Problem. Neural Computation 10(5), 1299–1319 (1998)
24. Li, Y., Ito, W.: Shape Parameter Optimization for AdaBoosted Active Shape
Model. ICCV 1, 251–258 (2005)
25. Zhou, Y., Gu, L., Zhang, H.J.: Bayesian Tangent Shape Model: Estimating Shape
and Pose Parameters via Bayesian Inference. In: CVPR (2003)
Dynamic Integration of Generalized Cues for
Person Tracking
1 Introduction
Visual person tracking is a basic prerequisite for applications in fields like surveil-
lance, multimodal man-machine interaction or smart spaces. Our envisioned
scenario is that of an autonomous robot with limited computational resources
operating in a common space together with its users. The tracking range varies
from close distance, where the portrait of the user spans the entire camera im-
age, to far distance, where the entire body is embedded in the scene. In order to
tackle the problem, we present a multi-cue integration scheme within the frame-
work of particle filter-based tracking. It is capable of dealing with deficiencies of
single features as well as partial occlusion by means of the very same dynamic
fusion mechanism. A set of simple but fast cues is defined, allowing to cope with
limited on-board resources.
The choice of cues is a crucial design criterion for a tracking system. In real-
world applications, each single cue is likely to fail in certain situations such
as occlusion or background clutter. Thus, a dynamic integration mechanism is
needed to smooth over a temporary weakness of certain cues as long as there
are other cues that still support the track. In [1], Triesch and Von Der Malsburg
introduced the concept of democratic integration that weights the influence of
the cues according to their agreement with the joint hypothesis. The competing
cues in [1] were based on different feature types such as color, motion, and shape.
In this paper, we use the principle of democratic integration in a way that also
includes the competition between different regions of the target object. We show
that this allows us to deal with deficiencies of single feature types as well as with
partial occlusion using one joint integration mechanism.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 514–526, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Dynamic Integration of Generalized Cues for Person Tracking 515
The combination of democratic integration and particle filters has been ap-
proached before by Spengler and Schiele [2]. In their work, however, the inte-
gration weights were held constant, thus falling short behind the real power of
democratic integration. This has also been pointed out by Shen et al. [3], who
did provide a cue quality criterion for dynamic weight adaptation. This criterion
is formulated as the distance of the tracking hypothesis based on all cues and the
hypothesis based on the cue alone. The problem with this formulation is that,
due to resampling, the proposal distribution is generally strongly biased toward
the final hypothesis. Thus, even cues with uniformly mediocre scores tend to
agree well with the joint mean of the particle set. We therefore propose a new
quality criterion based on weighted MSE that prefers cues which actually focus
their probability mass around the joint hypothesis.
Democratic integration combines cues in the form of a weighted sum. In a
particle filter framework, this means that all cues have to be evaluated simulta-
neously for all particles. As pointed out by Pérez et al. [4], this can be alleviated
by layered sampling, if the cues are ordered from coarse to fine. In the proposed
algorithm, we therefore combine two-stage layered sampling with democratic in-
tegration on each stage to increase efficiency by reducing the required number
of particles.
For each object to be tracked, we employ one dedicated Condensation-like
tracker [5]. By using separate trackers instead of one single tracker running in
a joint state space, we accept the disadvantage of potentially not being able
to find the global optimum. On the other hand, however, we thereby avoid
the exponential increase in complexity that typically prevents the use of par-
ticle filters in high-dimensional state spaces. There are a number of approaches
dealing with this problem, such as Partitioned Sampling [6], Trans-dimensional
MCMC [7], or the Hybrid Joint-Separable formulation [8]. Although these ap-
proximations reduce the complexity of joint state space tracking significantly,
they still require noticeably more computational power than the separate tracker
approach.
The remainder of this paper is organized as follows: In section 2, we briefly
describe the concept of particle filters and layered sampling. In section 3 we
present our multi-cue integration scheme, which is the main contribution of this
paper. It is followed, in section 4, by the definition of the cues that we actually
use in the live tracking system. In section 5, the multi-person tracking logic
including automatic track initialization and termination is described. Finally,
section 6 shows the experiments and results.
According to [4], the state evolution can then be decomposed into M successive
intermediate steps:
p(st |st−1 ) = pM (st |sM−1 ) · · · p1 (s1 |st−1 )ds1 · · · dsM−1 (3)
p(z|s) = rc pc (z|s), (4)
c∈C
where pc (z|s) is the the
single-cue observation model, and rc is the mixture
weight for cue c, with c rc = 1.
Democratic integration [1] is a mechanism to dynamically adjust the mixture
weights rc , termed reliabilities, with respect to the agreement of the single cue c
with the joint result. For each cue, a quality measure qc is defined that quantifies
the agreement, with values close to zero indicating little agreement and values
close to one indicating good agreement. The reliabilities are updated after each
frame by a leaky integrator using the normalized qualities:
qc
rct+1 = (1 − τ )rct + τ (5)
c qc
with the parameter τ controlling the speed of adaptation.
The exponent λ > 0 can be used to tweak the volatility of the quality measure:
high values of λ emphasize the quality difference between cues whereas low values
produce more similar qualities for all cues.
The other option to generate orthogonal cues is to use different state model
transformations A(s):
pc (z|s) = pc (z|A(s)) (8)
This is motivated by the fact that cues relying on certain aspects of the state
vector may still be used while other aspects of the state are not observable. In
our implementation, A(s) represents a certain projection from state space to
image space, i.e. a certain image sub-region of the target. This is useful in a
situation, where due to partial occlusion one region of the target object can be
observed, while another region cannot.
In this work, we aim at combining the advantages of both strategies, i.e.
dynamically combining cues that are based on different feature types as well as
dynamically weighting cues that focus on different regions of the target but are
based on the same feature type. Therefore, we use a generalized definition of the
cues c = (F , A) that comprises different feature types F (z) and different state
transformations A(s):
All cues in this unified set will then compete equally against each other, guided
by the very same integration mechanism. Thus, the self-organizing capabilities
of democratic integration can be used to automatically select the specific feature
types as well as the specific regions of the target that are most suitable in the
current situation.
One of the issues with adaptation is due to the fact that after an update
step, the cue is not guaranteed to perform better than before. Although the
update step always results in a higher score for the prototype region at ŝ, it can
happen that the updated model produces higher scores also for other regions
than the correct one. This actually reduces the cue’s discriminative power and,
in consequence, its reliability rc . We therefore propose the following test to be
carried out before accepting an update:
1. Calculate qc (eq. 6) using the new parameters P̂c
2. Perform the update step (eq. 10) only if qc > qc
Fig. 1. The 3-box model of the human body: the state vector s is transformed into the
image space as the projection of a cuboid representing either the head, torso, or leg
region. The projection of the cuboid is approximated by a rectilinear bounding box.
Fig. 2. Snapshot from a test sequence showing the different feature types. In this
visualization, the color support maps for head, torso and legs of the respective person
are merged into the RGB-channels of the image. The tracking result is superimposed.
The left factor seeks to maximize the amount of foreground within the region.
The right factor seeks to cover all foreground pixels in the image. It prevents the
motion cue from preferring tiny regions filled with motion, while ignoring the rest.
We employ 3 motion cues, termed m-h, m-t and m-l, dedicated to either
the head, torso or legs region as depicted in Fig. 1. We rely on the ability of
the integration mechanism (see section 3) to automatically cancel the influence
of the motion cues in case of camera motion. This is justified by the fact that
the agreement of the motion cues with the final tracking hypothesis will drop
whenever large portions of the image exceed the threshold.
1st layer:
(1..n) (1..n)
– resample st−1 wrt. πt−1
– propagate with partial evolution model (cf. eq. 3)
1,(1..n) 1,(i) (i)
st ←− p1 (st |st−1 )
1,(i) 1,(i)
– evaluate stereo cues: πt ∝ c∈CS rc pc (z|st )
1,(i) 1,(i) 1,(i)
– apply collision penalty: πt ←− πt − v(st )
2nd layer:
1,(1..n) 1,(1..n)
– resample st wrt. πt
– propagate with partial evolution model (cf. eq. 3)
(1..n) (i) 1,(i)
st ←− p2 (st |st )
(i) (i)
– evaluate regular cues: πt ∝ c∈CR rc pc (z|st )
Dem. integration:
(i) (i)
– calculate track hypothesis ŝt = i πt st
– update reliabilities (cf. eqs. 5 and 6)
1,(1..n) 1,(1..n)
rc∈CS ←− ŝt , st , πt
(1..n) (1..n)
rc∈CR ←− ŝt , st , πt
Apart from the geometrical constraints implied by the stereo cues, there is
another strict constraint, namely the collision penalty, which is enforced in the
1st layer of the algorithm in Fig. 3. The function v(s) penalizes particles that
are close to those tracks with a higher track quality than the current track (see
following section). Thereby, we guarantee mutual exclusion of tracks.
Dynamic Integration of Generalized Cues for Person Tracking 523
The question of when to spawn a new tracker and when to terminate a tracker
that has lost its target is of high importance, and can become more difficult than
the actual tracking problem. We define the quality measure for a tracker to be
the joint response from both stereo and regular cues at the tracker’s hypothesis ŝ:
Q(ŝ) = rc pc (z|ŝ) · rc pc (z|ŝ) (15)
c∈CS c∈CR
6 Experiments
Fig. 4. Evolution of cue reliabilities in an example sequence. The three stereo cues
constitute the first layer of the algorithm, their reliabilities sum up to 1. The remaining
ten cues are used in layer 2 and sum up to 1 likewise. In the beginning of the interval,
the subject approaches the camera. While he is walking (frame 250), the motion cues
for legs and torso (M-L,M-T) contribute significantly to the track. At around frame
300, the subject’s legs disappear, and in consequence the reliabilities of all leg-related
cues (M-L, C-L, S-L) drop automatically. While the subject is standing in front of the
camera (frames 300-500), the frontal face detection cue D-F and the the head color cue
C-H dominate the track. The influence of the head color cue C-H drops dramatically,
when the subject turns around (frame 520) and walks in front of the wooden pinboard,
which has a skin-color like appearance.
its model is therefore shared among all trackers. An new box-type for the upper
body detector was used; it comprises head and upper half of the torso. To avoid
dominance, we limited the range for a cue’s influence to 0.03 ≤ rc ≤ 0.6. We
found, however, that these situations rarely occur. Boxes that get projected
outside the visible range or that are clipped to less than 20% of their original
size, are scored with a minimum score of 0.001. The approximate runtime of
the algorithm was 30ms per frame for an empty scene, plus another 10ms per
person being tracked. These values are based on an image size of 320×240 pixels,
and a 2.4GHz Pentium CPU. The most important parameter values are given
in Table 2.
7 Conclusion
We have presented a new approach for dynamic cue combination in the frame-
work of particle filter-based tracking. It combines the concepts of democratic
integration and layered sampling and enables a generalized kind of competition
among cues. With this method, cues based on different feature types compete di-
rectly with cues based on different target regions. In this way, the self-organizing
capabilities of democratic integration can be fully exploited. In an experimental
validation, the proposed new cue quality measure has been shown to improve
the tracking performance significantly.
Acknowledgments
This work has been funded by the German Research Foundation (DFG) as part
of the Sonderforschungsbereich 588 ”Humanoid Robots”.
References
1. Triesch, J., Malsburg, C.V.D.: Democratic integration: Self-organized integration
of adaptive cues. Neural Comput. 13(9), 2049–2074 (2001)
2. Spengler, M., Schiele, B.: Towards robust multi-cue integration for visual tracking.
Machine Vision and Applications 14, 50–58 (2003)
3. Shen, C., Hengel, A., Dick, A.: Probabilistic multiple cue integration for particle
filter based tracking. In: International Conference on Digital Image Computing -
Techniques and Applications, pp. 309–408 (2003)
526 K. Nickel and R. Stiefelhagen
4. Pérez, P., Vermaak, J., Blake, A.: Data fusion for visual tracking with particles.
Proceedings of the IEEE 92(3), 495–513 (2004)
5. Isard, M., Blake, A.: Condensation–conditional density propagation for visual
tracking. International Journal of Computer Vision 29(1), 5–28 (1998)
6. MacCormick, J., Blake, A.: A probabilistic exclusion principle for tracking multiple
objects. International Journal of Computer Vision 39(1), 57–71 (2000)
7. Smith, K., Gatica-Perez, D., Odobez, J.M.: Using particles to track varying num-
bers of interacting people. In: IEEE Conf. on Computer Vision and Pattern Recog-
nition, Washington, DC, USA, pp. 962–969 (2005)
8. Lanz, O.: Approximate bayesian multibody tracking. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence 28(9), 1436–1449 (2006)
9. Viola, P., Jones, M.: Robust real-time object detection. In: ICCV Workshop on
Statistical and Computation Theories of Vision (July 2001)
10. Lienhart, R., Maydt, J.: An extended set of haar-like features for rapid object
detection. In: ICIP, vol. 1, pp. 900–903 (September 2002)
11. Kruppa, H., Castrillon-Santana, M., Schiele, B.: Fast and robust face finding via
local context. In: IEEE Intl. Workshop on Visual Surveillance and Performance
Evaluation of Tracking and Surveillance (October 2003)
12. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms. IJCV 47(1/2/3), 7–42 (2002)
13. Veksler, O.: Fast variable window for stereo correspondence using integral images.
In: IEEE Conf. on Computer Vision and Pattern Recognition, pp. 556–561 (2003)
Extracting Moving People from Internet Videos
1 Introduction
Human motion analysis is notoriously difficult because human bodies are highly
articulated and people tend to wear clothing with complex textures that obscure
the important features needed to distinguish poses. Uneven lighting, clutter,
occlusions, and camera motions cause significant variations and uncertainties.
Hence it is no surprise that the most reliable person detectors are built for
upright walking pedestrians seen in typically high quality images or videos.
Our goal in this work is to be able to automatically and efficiently carve out
spatio-temporal volumes of human motions from arbitrary videos. In particular,
we focus our attention on videos that are typically present on internet sites such
as YouTube. These videos are representative of the kind of real-world data that
is highly prevalent and important. As the problem is very challenging, we do
not assume that we can find every individual. Rather, our aim is to enlarge
the envelope of upright human detectors by tracking detections from typical to
atypical poses. Sufficient data of this sort will allow us in the future to learn
even more complex models that can reliably detect people in arbitrary poses.
Two example sequences and the system output are shown in Fig. 1.
Our first objective is to find moving humans automatically. In contrast to
much of the previous work in tracking and motion estimation, our framework
does not rely on manual initialization or a strong a priori assumption on the
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 527–540, 2008.
c Springer-Verlag Berlin Heidelberg 2008
528 J.C. Niebles et al.
Fig. 1. Two example outputs. Our input videos are clips downloaded from YouTube and
thus are often low resolution, captured by hand-held moving cameras, and contain a
wide range of human actions. In the top sequence, notice that although the boundary
extraction is somewhat less accurate in the middle of the jump, the system quickly
recovers once more limbs become visible.
number of people in the scene, the appearance of the person or the background,
the motion of the person or that of the camera. To achieve this, we improve a
number of existing techniques for person detection and pose estimation, leverag-
ing on temporal consistency to improve both the accuracy and speed of existing
techniques. We initialize our system using a state-of-the-art upright pedestrian
detection algorithm [1]. While this technique works well on average, it produces
many false positive windows and very often fails to detect. We improve this sit-
uation by building an appearance model and applying a two-pass constrained
clustering algorithm [2] to verify and extend the detections.
Once we have these basic detections, we build articulated models following
[3,4,5] to carve out arbitrary motions of moving humans into continuous spatio-
temporal volumes. The result can be viewed as a segmentation of the moving
person, but we are not aiming to achieve pixel-level accuracy for the extraction.
Instead, we offer a relatively efficient and accurate algorithm based on the prior
knowledge of the human body configuration. Specifically, we enhance the speed
and potential accuracy of [4,5] by leveraging temporal continuity to constrain
the search space and applying semi-parametric density propagation to speed up
evaluation.
The paper is organized as follows. After reviewing previous work in the area
of human motion analysis in Section 1.1, we describe the overall system archi-
tecture in Section 2. Two main parts of our system, person detection/clustering
and extraction of moving human boundaries, are presented in Sections 3 and 4,
respectively. Finally, implementation details and experimental results are de-
scribed in Section 5.
2 System Architecture
Our system consists of two main components. The first component generates
object-level hypotheses by coupling a human detector with a clustering algo-
rithm. In this part, the state of each person, including location, scale and trajec-
tory, is obtained and used to initialize the body configuration and appearance
models for limb-level analysis. Note that in this step two separate problems
– detection and data association – are handled simultaneously, based on the
spatio-temporal coherence and appearance similarity.
The second component extracts detailed human motion volumes from the
video. In this stage, we further analyze each person’s appearance and spatio-
temporal body configuration, resulting in a probability map for each body part.
We have found that we can improve both the robustness and efficiency of the
algorithm by limiting the search space of the measurement and inference around
the modes of the distribution. To do this, we model the density function as a
mixture of Gaussians in a sequential Bayesian filtering framework [23,24,25].
The entire system architecture is illustrated in Fig. 2. More details about each
step are described in the following two sections.
530 J.C. Niebles et al.
The focus of our work is to extract arbitrarily complex human motions from
YouTube videos that involve a large degree of variability. We face several difficult
challenges, including:
1. Compression artifacts and low quality of videos
2. Multiple shots in a video
3. Unknown number of people in each shot or sequence
4. Unknown human motion and poses
5. Unknown camera parameters and motion
6. Background clutter, motion and occlusions
We will refer back to these points in the rest of the paper as we describe how
the components try to overcome them.
50 50 50
0 0 0
0 200 0 200 0 200
Fig. 3. Human detection and clustering result. From noisy detections, three tracks of
people are identified successfully by filling gaps and removing outliers. (In this figure,
the horizontal and vertical axis are the x locations and frame numbers, respectively.)
(a) Original detection (b) Initial clusters after step 1 (c) Final clusters (d) Example
images of three similar people that correctly clustered into different groups.
4.1 Overview
We summarize here the basic theory for the belief propagation and inference
in [3,4]. Suppose that each body part pi is represented with a 4D vector of
(xi , yi , si , θi ) – location, scale and orientation. The entire human body B is
composed of m parts, i.e. B = {p1 , p2 , . . . , pm }. Then, the log-likelihood given
the measurement from the current image I is
L(B|I) ∝ Ψ (pi − pj ) + Φ(pi ) (1)
(i,j)∈E i
where Ψ (pi − pj ) is the relationship between two body parts pi and pj , and Φ(pi )
is the observation for body part pi . E is a set of edges between directly connected
Extracting Moving People from Internet Videos 533
body parts. Based on the given objective function, the inference procedure by
message passing is characterized by
Mi (pj ) ∝ Ψ (pi − pj )O(pi ) (2)
pj
O(pi ) ∝ Φ(pi ) Mk (pi ) (3)
k∈Ci
which generates the probability map of each body part in the 4D state.
Based on this framework, we propose a method to propagate the density
function in the temporal domain in order to reduce search space and temporally
consistent results. The rest of the section describes the details of our algorithm.
4.2 Initialization
The first step for human body extraction is to estimate an initial body configu-
ration and create a reliable appearance model. The initial location of the human
is given by the method presented in Section 3. Note that the bounding box pro-
duced by the detection algorithm does not need to be very accurate since most
of the background area will be removed by further processing. Once a potential
human region is found, we apply a pose estimation technique [4] based on the
same pictorial structure and obtain the probability map of the configuration of
each body part through the measurement and inference step. In other words, the
output of this algorithm is the probability map Pp (u, v, s, θ) for each body part
p, where (u, v) is location, s is scale and θ is orientation. A sample probability
map is presented in Fig. 4 (b)-(d). Although this method creates accurate proba-
bility maps for each human body part, it is too computationally expensive to be
used in video processing. Thus, we adopt this algorithm only for initialization.
where Vx and Vθ are (co)variance matrices in spatial and angular domain, re-
spectively. The representation of the combined density function based on the
entire orientation maps is given by
1 2
N
1 κ(k)
fˆ(x) = exp − D x, x , P
(k) (k)
(8)
(2π)d/2 i=1 | P(k) |1/2 2
where D2 x, x(k) , P(k) is the Mahalanobis distance from x to x(k) with
covariance P(k) .
Although we simplify the density functions for each orientation as a Gaussian,
it is still difficult to manage them in an efficient way especially because the
number of components will increase exponentially when we propagate the density
to the next time step. We therefore adopt Kernel Density Approximation (KDA)
[26] to further simplify the density function with little sacrifice in accuracy. KDA
is a density approximation technique for a Gaussian mixture. The algorithm finds
the mode locations of the underlying density function by an iterative procedure,
such that a compact mixture of Gaussians based on the detected mode locations
is found.
Fig. 4 presents the original probability map and our approximation using a
mixture of Gaussians for each body part after the pose estimation. Note that
the approximated density function is very close to the original one and that the
multi-modality of the original density function is well preserved.
1
Arms occasionally have significant outliers due to their flexibility. A uni-modal
Gaussian fitting may result in more error here.
Extracting Moving People from Internet Videos 535
Fig. 5. Density functions in one step of the human motion extraction. (a) Original
frame (cropped for visualization) (b) Diffused density function (c) Measurement and
inference results (d) Posterior (Note that the probability maps for all orientations are
shown in a single image by projection.)
5 Experiments
In order to evaluate our proposed approach, we have collected a dataset of 50
sequences containing moving humans downloaded from YouTube. The sequences
contain natural and complex human motions and various challenges mentioned
Extracting Moving People from Internet Videos 537
in Section 2. Many videos have multiple shots (challenge 2), so we divide the
original videos into several pieces based on the shot boundary detection, which
is performed by global color histogram comparison with threshold [27]. We deal
with each shot as a separate video. We have made this dataset public and it can
be found at http://vision.cs.princeton.edu/projects/extractingPeople.html.
Instead of 4D state space for human body configuration, 3D state space for
location and orientation is utilized and scale is determined based on the detection
size. Although person detector is not so accurate in scale estimate, the extraction
algorithm is robust enough to handle some variations of the scale. Also, the gaps
between detections are not generally long, and it is not often the case that we
observe significant change in scale between two detections.
The measurement is based on edge template and color histogram as in [4], but
search space for the measurement is significantly reduced. Fig. 5 (b) illustrates
the search space reduction, where low density areas are not sampled for the
observations.
We evaluate the retrieval performance of our system in terms of the precision-
recall measures. For each sequence, we have generated ground-truth by manually
labeling every human present in each frame with a bounding box. We compare
the precision-recall rates at three stages of our system: pedestrian detection only
[1], people detection and clustering, and the full model. For a fixed threshold of
the human detector, we obtain the three precision-recall pairs in each row of
Table 1. Our full system provides the highest performance in terms of the F-
measure2 . This reflects the fact that our system achieves much higher recall rates
by extracting non-upright people beyond the pedestrian detections.
We also evaluate the performance of our system in terms of the segmentation
of the moving people. We create ground-truth for the spatial support of the
moving people in the form of binary masks. We have labeled a random sample
of 122 people from our 50 sequences. The evaluation of the pose estimation is
performed at frames td , td +5 and td +10, where td is a frame containing a pedes-
trian detection, and no detections are available in [td + 1, td + 10]. The average
accuracies are 0.68, 0.68 and 0.63 respectively. Note that the accuracy decrease
in the extracted person mask is moderate, and the temporal error propagation
is small.
2
The F-measure is defined [28] as: 2 · (precision · recall)/(precision + recall).
538 J.C. Niebles et al.
Fig. 6. Experimental results for various sequences. Each row corresponds to a sep-
arate sequence and two failure examples are illustrated in the last two rows. Please
visit http://vision.cs.princeton.edu/projects/extractingPeople.html for more
sample videos.
Extracting Moving People from Internet Videos 539
The results for several YouTube videos are presented in Fig. 6. Various general
and complex human motions are extracted with reasonable accuracy, but there
are some failures that are typically caused by inaccurate measurements. In a
PC with a 2.33 GHz CPU, our algorithm requires around 10-20 seconds for the
measurement and inference of each person, one order of magnitude faster than
the full search method of [4].
References
1. Laptev, I.: Improvements of object detection using boosted histograms. In: BMVC,
Edinburgh, UK, vol. III, pp. 949–958 (2006)
2. Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-
level constraints: Making the most of prior knowledge in data clustering. In: ICML
(2002)
3. Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition.
IJCV 61, 55–79 (2005)
4. Ramanan, D.: Learning to parse images of articulated objects. In: NIPS, Vancouver,
Canada (2006)
5. Ramanan, D., Forsyth, D., Zisserman, A.: Tracking people by learning their ap-
pearance. PAMI 29, 65–81 (2007)
6. Lucas, B., Kanade, T.: An iterative image registration technique with an applica-
tion to stereo vision. IJCAI, 674–679 (1981)
7. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using
mean shift. In: CVPR, Hilton Head, SC, vol. II, pp. 142–149 (2000)
8. Cham, T., Rehg, J.: A multiple hypothesis approach to figure tracking. In: CVPR,
Fort Collins, CO, vol. II, pp. 219–239 (1999)
9. Deutscher, J., Blake, A., Reid, I.: Articulated body motion capture by annealed
particle filtering. In: CVPR, Hilton Head, SC (2000)
10. Han, T.X., Ning, H., Huang, T.S.: Efficient nonparametric belief propagation with
application to articulated body tracking. In: CVPR, New York, NY (2006)
11. Haritaoglu, I., Harwood, D., Davis, L.: W4: Who? When? Where? What? - A real
time system for detecting and tracking people. In: Proc. of Intl. Conf. on Automatic
Face and Gesture Recognition, Nara, Japan, pp. 222–227 (1998)
12. Lee, C.S., Elgammal, A.: Modeling view and posture manifolds for tracking. In:
ICCV, Rio de Janeiro, Brazil (2007)
540 J.C. Niebles et al.
13. Sigal, L., Bhatia, S., Roth, S., Black, M., Isard, M.: Tracking loose-limbed people.
In: CVPR, Washington DC, vol. I, pp. 421–428 (2004)
14. Sminchisescu, C., Triggs, B.: Covariance scaled sampling for monocular 3D body
tracking. In: CVPR, Kauai, Hawaii, vol. I, pp. 447–454 (2001)
15. Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Discriminative density prop-
agation for 3d human motion estimation. In: CVPR, San Diego, CA, vol. I, pp.
390–397 (2005)
16. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In:
CVPR, San Diego, CA, vol. I, pp. 878–885 (2005)
17. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR, San Diego, CA, vol. I, pp. 886–893 (2005)
18. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on riemannian
manifolds. In: CVPR, Minneapolis, MN (2007)
19. Viola, P., Jones, M.J., Snow, D.: Detecting pedestrians using patterns of motion
and appearance. In: ICCV, Nice, France, pp. 734–741 (2003)
20. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single
image by bayesian combination of edgelet part detectors. In: ICCV, Beijing, China,
vol. I, pp. 90–97 (2005)
21. Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction
for human pose estimation. In: CVPR, Anchorage, AK (2008)
22. Ren, X., Malik, J.: Tracking as repeated figure/ground segmentation. In: CVPR,
Minneapolis, MN (2007)
23. Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle fil-
ters for on-line non-linear/non-gaussian bayesian tracking. IEEE Trans. Signal
Process. 50, 174–188 (2002)
24. Doucet, A., de Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Prac-
tice. Springer, Heidelberg (2001)
25. Han, B., Zhu, Y., Comaniciu, D., Davis, L.: Kernel-based bayesian filtering for
object tracking. In: CVPR, San Diego, CA, vol. I, pp. 227–234 (2005)
26. Han, B., Comaniciu, D., Zhu, Y., Davis, L.: Sequential kernel density approxima-
tion and its application to real-time visual tracking. PAMI 30, 1186–1197 (2008)
27. Lienhart, R.: Reliable transition detection in videos: A survey and practitioner’s
guide. International Journal of Image and Graphics 1, 469–486 (2001)
28. Van Rijsbergen, C.J.: Information Retreival. Butterworths, London (1979)
29. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time
shapes. In: ICCV, Beijing, China, pp. 1395–1402 (2005)
30. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumet-
ric features. In: ICCV, Beijing, China, pp. 166–173 (2005)
31. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action cate-
gories using spatial-temporal words. IJCV 79, 299–318 (2008)
Multiple Instance Boost Using Graph Embedding
Based Decision Stump for Pedestrian Detection
1 Introduction
Pedestrian detection is a practical requirement of many today’s automated
surveillance, vehicle driver assistance systems and robot vision systems. How-
ever, the issue of large appearance and stance variations accompanied with dif-
ferent viewpoints makes pedestrian detection very difficult. The reasons can be
multifold, such as variable human clothing, articulated human structure and
illumination change, etc. The variations bring various challenges including miss-
alignment problem, which is often encountered in non-rigid object detection.
There exist a variety of pedestrian detection algorithms from the different per-
spectives, directly template matching [2], unsupervised model [3], traditional super-
vised model [4,5,6] and so on. Generally, these approaches cope with “mushroom”
shape – the torso is wider than the legs, which dominates the frontal pedestrian, and
deal with “scissor” shape – the legs are switching in walk, which dominates the lat-
eral pedestrian. However, for some uncommon stances, such as mounting on bike,
they incline to fail. In these conditions, the variations often impair the performance
of these conventional approaches. Fig. 1 shows some false negatives generated by
Dalal et al [4]. These false negatives are typically non-“mushroom” or non-“scissor”
shape, and have large variations between each other.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 541–552, 2008.
c Springer-Verlag Berlin Heidelberg 2008
542 J. Pang, Q. Huang, and S. Jiang
Fig. 1. Some detection results in our method producing fewer false negatives than
Dalal et al do [4]
The key notion of our solution is that the variations are represented within
multiple instances, and the “well” aligned instances are automatically selected
to train a classifier via multiple instance learning (MIL) [7,8]. In MIL, a training
example is not singletons, but is represented as a “bag” where all of the instances
in a bag share the bag’s label. A positive bag means that at least one instance
in the bag is positive, while a negative bag means that all instances in the
bag are negative. To pedestrian detection, the standard scanning window is
considered as the “bag”, a set of sub-images in window are treated as instances.
If one instance is classified as pedestrian, the pedestrian is located in detection
stage. The logistic multiple instance boost (LMIBoost) [9] is utilized to learn the
pedestrian appearance, which assumes the average relationship between bag’s
label and instance’s label.
Considering the non-Gaussian distribution (which dominates the positive and
negative examples) and aims of detection (which are accurate and fast), a graph
embedding based weak classifier is proposed for histogram feature in boosting.
The graph embedding can effectively model the non-Gaussian distribution, and
maximally separate the pedestrians from negative examples in low dimension
space [10]. After feature is projected onto discriminative one dimension manifold,
K-means is utilized to fast locate the multiple decision planes for the decision
stump. The proposed weak classifier has the following advantages: 1) it handles
training examples with any distribution; and 2) it not only needs less computa-
tion cost, but also results in robust boosting classifier. The main contributions
of the proposed algorithm are summarized as following:
– The pose variations are handled by multiple instance learning. The variations
between examples are represented within the instances, and are automati-
cally reduced during learning stage.
– Considering the boost setting, graph embedding based decision stump is
proposed to handle training data with non-Gaussian distribution.
2 Related Work
Generally, the “mushroom” or “scissor” shape encourages the use of template
matching and traditional machine learning approach as discussed in section 1.
The contour templates are hierarchically matched via Chamfer matching [2]. A
polynomial support vector machine (SVM) is learned with Haar wavelets as hu-
man descriptor [5] (and variants are described in [11]). Similar to still images,
a real-time boosted cascade detector also uses Haar wavelets descriptor but ex-
tracted from space-time differences in video [6]. In [4], an excellent pedestrian
detector is described by training a linear SVM classifier using densely sampled
histogram of oriented gradients (HOG) feature (this is a variant of Lowe’s SIFT
descriptor [12]). In a similar approach [13], the near real-time detection perfor-
mance is achieved by training a cascade detector using SVM and HOG feature in
AdaBoost. However, their “fixed-template-style” detectors are sensitive to pose
variations. If the pose or appearance of the pedestrian has large change, the
“template”-like methods are doomed to fail. Therefore, more robust feature is
proposed to withstand translation and scale transformation [14].
Several existing publications have been aware of the pose variation problem,
and have handled it by “divide and conquer”– the parts based approach. In [15],
the body parts are explicitly represented by co-occurrences of local orientation
features. The separate detector is trained for each part using AdaBoost. Pedes-
trian location is determined by maximizing the joint likelihood of the part occur-
rences according to the geometric relations. Codebook approach avoids explicitly
modeling the body segments or the body parts, and instead uses unsupervised
methods to find part decompositions [16]. Recently, the body configuration esti-
mation is exploited to improve pedestrian detection via structure learning [17].
However, parts based approaches have two drawbacks. First, different part de-
tector has to be applied to the same image patch. This reduces the detection
speed. Second, labeling and aligning the local parts are tedious and time-costing
work in supervised learning. Therefore, the deformable part model supervised
learns the holistic classifier to coarsely locate the person, and then utilizes part
filters to refine body parts in unsupervised method [18].
The multiple instance learning(MIL) problem is first identified in [8], which
represents ambiguously labeled examples using axis-paralled hyperrectangles.
Previous applications of MIL in vision have focused on image retrieval [19]. The
seemingly most similar work to ours may be the upper-body detection [20].
Viola et al use Noisy-OR boost which assumes that only sparse instances are
upper-body in a positive bag. However, in our pedestrian detection setting, the
instances in a positive bag are all positive, and this facilitates to simply assume
that every instance in a bag contributes equally to the bag’s class label.
In pedestrian detection, the histogram feature (such as SIFT, HOG) is typi-
cally used. The histogram feature can be computed rapidly using an intermediate
data representation called “Integral Histogram” [21]. However, the efficient use
of the histogram feature is not well discussed. In [13], the linear SVM and HOG
feature is used as weak classifier. Kullback-Leibler (K-L) Boost uses the log-ratio
between the positive and negative projected histograms as weak classifier. The
544 J. Pang, Q. Huang, and S. Jiang
where ni is the number of the instances in the i-th bag, yij is the instance-
level class label for the instance xij . Equation.(1) indicates that every instance
contributes equally to the bag’s label. This simple assumption is suitable for the
instances generated by perturbing around the person. Because the generated
every instance is positive pedestrian image.
The instance-level class probability is given as p(y|x) = 1/(1 + eβx ), where
β is the parameter to be estimated. Controlling the parameter β gives different
Multiple Instance Boost Using Graph Embedding Based Decision Stump 545
Fig. 2. Overview of the multiple instance learning process. The training example is first
converted into a bag of instances. Note that we only generate the instances spatially,
and the instances can also be generated at different scales. Therefore, the resulting
classifier will withstand the translation and scale transformation.
1
N
E[I(F(x) = y)] = − yi F(xi ), (3)
N i=1
where I(·) is the indicator function. We are interesting in wrapping the bag-level
weak classifier f with the instance-level weak classifier f . Using the Equation.(1),
Equation.(3) is converted into the instance-level’s exponential loss Ex Ey|x [e−yf ]
as e−yH ≥ I(H(x) = y), ∀M . One searches for the optimal update cm fm such
that minimizes
% &
Ex Ey|x e−yij Fm−1 (xij )−cm yij fm (xij ) = wi e[(2i −1)cm ] , (4)
i
where i = j 1fm (xij )=yij /ni , wi is the example’s weight. The error i describes
the discrepancy between the bag’s label and instance’s label. The instance in
positive bags with higher score f (xij ) gives higher confidence to the bag’s label,
even though there are some negative instances occurring in the positive bag.
546 J. Pang, Q. Huang, and S. Jiang
Therefore, the final classifier often classifies these bags as positive. The variations
problem in training examples will be reduced.
1
We refer the interested reader to [10] for more details.
Multiple Instance Boost Using Graph Embedding Based Decision Stump 547
positive samples
(a) clustering center
negative samples
(b)
clustering center
(c)
(d)
(e)
5 5
1/n − 1/nc if yi = yj = ωc , 1/nc if yi = yj = ωc ,
sbi,j = sw = , (6)
1/n if yi = yj , i,j 0 if yi = yj ,
where nc is the cardinality of the ωc class. The pairwise sbi,j and sw i,j try to keep
within-class sample close (since sw
i,j is positive and s b
i,j is negative if yi = yj ) and
between-class sample pairs apart (since si,j is positive if yi = yj ). The projection
b
According to Bayesian decision theory, if class conditional probability p(ω1 |x) >
p(ω2 |x) we would naturally incline to decide that the true label of x is ω1 , and
vice versa. Using Bayes rule p(ω|x) = p(x|ω)p(ω), the optimal decision plane is
located at where p(x|ω1 ) = p(x|ω2 ) with p(ω1 ) = p(ω2 ). We obtain the Bayes
error p(error|x) = min[p(x|ω1 ), p(x|ω2 )]dx. However, the p(x|ωc ) is not directly
available. To accurately estimate the p(x|ωc ), histogram needs large numbers of
bins via uniform sampling in [25,26]. We avoid estimating the p(x|ωc ) with uniform
sampling or rejection sampling. As demonstrated in Fig. 3(c), we consider the local
region of feature space, and the location at the middle of two modal is a natural
decision plane. The decision plane would approximately minimize Bayes error, if
p(ω1 ) = p(ω2 ). Algorithm. 1 shows the graph embedding based decision stump.
Note that the number of decision planes is automatically decided.
548 J. Pang, Q. Huang, and S. Jiang
5 Pedestrian Detection
To achieve the fast pedestrian detection, we adopt the cascade structure of de-
tector [6]. Each stage is designed to achieve high detection rate and modest false
positive rate. We combine K = 30 LMIBoost on HOG feature with rejection
cascade. To exploit the discriminative ability of HOG feature, we design 4 type
block feature as showed in Fig.4. In each cell, 9-bins HOG feature is extracted
and concatenated into a single histogram to represent the block feature. To ob-
tains a modicum of illumination invariance, the feature is normalized with L2
norm. The dimension of the 4 different type feature are 9, 18, 27 and 36, respec-
tively. The 453 × 4 number of block HOG feature can be computed from a single
detection window.
Assuming that the i-th cascade stage is trained, we classify all the possible
detection window on the negative training images with the cascade of the previ-
ous k-1 LMIBoost classifiers. The examples which are misclassified in scanning
window form the possible new negative training set. While, the positive training
samples do not change during bootstrap. Let Npi and Nni be cardinality of the
positive and negative training examples at i-th stage. Considering the influence
of asymmetric training data on the classifier and computer RAM limitations, we
constrain Npi and Nni to be approximately equal.
According to “There is no free lunch” theorem, it is very important to choose
suitable number of instances in a bag for training and detection. More instances
in a bag will represent more variations and improve the detection results, but
will also reduce the training and detection speed. We experimentally set 4 in-
stances for training and detection, respectively. Each level of cascade classifier
is optimized to correctly detect at least 99% of the positive bags, while reject at
least 40% of the negative bags.
6 Experiments
To test our method, we perform the experiments on two public dataset: INRIA
[4] and VOC2006 [1]. The INRIA dataset contains 1239 pedestrian images (2478
with their left-right reflections) and 1218 person-free images for training. In
the test set, there are 566 images containing pedestrians. The pedestrian images
provided by INRIA dataset have large variations (but most of them have standing
pose), different clothing and urban background. This dataset is very close to real-
life setting. The VOC2006’s person detection subtask supplies 319 images with
577 person as training set, and 347 images with 579 person as validation set.
675 images with 1153 person is supplied as test data. Note that the VOC2006’s
person detection dataset contains various human activities, different stances and
clothing. Some examples of the two different datasets are showed in Fig. 8.
0.4
Dalal&Triggs−Ker. SVM + HOG [4]
Zhu et al−AdaBoost + HOG [13]
Dalal&Triggs−Linear SVM + HOG [4]
0.2 Tuzel et al−Boost + Covariance descriptor [14]
Our approach LMiBoost + HOG
miss rate
0.1
0.05
0.04
0.03
0.02
0.015
−5 −4 −3 −2
10 10 10 10
false postive per window (FPPW)
Fig. 5. Comparison results on INRIA dataset. Note that the curve of our detector is
generated by changing the number of cascade stage used.
1
0.3
FLDA
0.9
0.2 Graph embeding based decision stump
0.8
0.7
miss rate
0.1
0.6
precision
0.5
0.05 0.4
0.04
0.3
0.03
0.2
0.02 0.1
0.015 0
−5 −4 −3
10 10 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
false postive per window (FPPW) recall
The performance results on INRIA show that the detectors based on graph
embedding decision stump outperforms the detector based on FLDA in Fig. 6.
Unlike the other LUT weak classifier [26,25], the bins of decision stumps are
automatically decided by algorithm.
There are 90% of the negative examples are rejected at first five stage. The speed
of the cascaded detector is directly related to the number of feature evaluated
per scanned sub-window. For INRIA dataset, on average our method requires to
evaluate 10.05 HOG feature per negative detection window. Densely scanning at
0.8 scale and 4 pixel step in a 320 × 240 image needs average 150ms under PC
with 2.8GHz CPU and 512RAM. While, 250ms for 320 × 240 image is reported
in Zhu et al’s detector [13].
Multiple Instance Boost Using Graph Embedding Based Decision Stump 551
We introduce the multiple instance learning into the pedestrian detection for
solving pose variations. The training example does not need to be well aligned,
but to be represented as a bag of instances. To efficiently utilizing histogram fea-
ture, a graph embedding based decision stump is proposed. The weak classifier
guarantees the fast detection and better discriminative ability. The promising
performances of the approach are shown on INRIA and VOC2006’s person de-
tection subtask.
Using multiple instance learning has enabled detector robust to the pose and
appearance variations. Theoretically, the more instances are supplied, the more
variations would be learned. Modeling the average relationship between the in-
stance’s label and bag’s label may be unsuitable when there are large numbers
of instances in a positive bag. In future, more experiments will be carried out to
compare the different way to model the relationship.
Acknowledgements
References
1. Everingham, M., Zisserman, A., Williams, C.K.I., Gool, L.V.: The PASCAL Visual
Object Classes Challenge (VOC 2006) Results (2006),
http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf
2. Gavrila, D.M.: Pedestrian detection from a moving vehicle. In: Vernon, D. (ed.)
ECCV 2000. LNCS, vol. 1843, pp. 37–49. Springer, Heidelberg (2000)
3. Bissacco, A., Yang, M., Soatto, S.: Detection human via their pose. In: Proc. NIPS
(2006)
4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
Proc. CVPR, vol. I, pp. 886–893. IEEE, Los Alamitos (2005)
5. Papageorgiou, P., Poggio, T.: A trainable system for object detection. IJCV, 15–33
(2000)
6. Viola, P., Jones, M., Snow, D.: Detecing pedestrians using patterns of motion and
appearance. In: Proc. ICCV (2003)
7. Maron, O., Lozanno-Perez, T.: A framework for multiple-instance learning. In:
Proc. NIPS, pp. 570–576 (1998)
8. Dietterich, T., Lathrop, R., Lozano-Perez, T.: Solving the multiple instance prob-
lem with axis-parallel rectangles. Artifical intelligence, 31–71 (1997)
9. Xu, X., Frank, E.: Logistic regression and boosting for labeled bags of instances.
In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056,
pp. 272–281. Springer, Heidelberg (2004)
552 J. Pang, Q. Huang, and S. Jiang
10. Sugiyama, M.: local fisher discriminat analysis for supervised dimensionality re-
duction. In: Proc. ICML (2006)
11. Monhan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in im-
ages by componets. IEEE Trans. PAMI 23, 349–360 (2001)
12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints, 91–110
(2004)
13. Zhu, Q., Avidan, S., Yeh, M.C., Cheng, K.T.: Fast human detection using a cascade
of histograms of oriented gradients. In: Proc. CVPR, vol. 2, pp. 1491–1498. IEEE,
Los Alamitos (2006)
14. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on riemannina
manifolds. In: Proc. CVPR. IEEE, Los Alamitos (2007)
15. Zisserman, A., Schmid, C., Mikolajczyk, K.: Human detection based on a proba-
bilistic assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV
2004. LNCS, vol. 3021, pp. 69–82. Springer, Heidelberg (2004)
16. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowed scenes. In: Proc.
CVPR, pp. 878–885. IEEE, Los Alamitos (2005)
17. Tran, D., Forsyth, D.A.: Configuration estimates improve pedestrian finding. In:
Proc. NIPS (2007)
18. Felzenszwalb, P., Mcallester, D., Ramanan, D.: A discriminatively trained, multi-
scale, deformable part model. In: Proc. CVPR. IEEE, Los Alamitos (2008)
19. Maron, O., Ratan, A.: Multiple-instance learning for natural scene classification.
In: Proc. ICML (1998)
20. Viola, P., Platt, J.C., Zhang, C.: Multiple instance boosting for object detection.
In: Proc. NIPS (2006)
21. Porikli, F.M.: Integral histogram: a fast way to extract histogram in cartesian
space. In: Proc. CVPR, pp. 829–836. IEEE, Los Alamitos (2005)
22. Liu, C., Shum, H.Y.: Kullback-leibler boosting. In: Proc. CVPR, pp. 587–594
(2003)
23. Laptev, I.: Improvements of object detection using boosted histograms. In: BMVC
(2006)
24. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple
features. In: Proc. CVPR. IEEE, Los Alamitos (2001)
25. Huang, C., Ai, H., Wu, B., Lao, S.: Boosting nested cascade detector for multi-view
face detection. In: Proc. ICPR. IEEE, Los Alamitos (2004)
26. Xiao, R., Zhu, H., Sun, H., Tang, X.: Dynamic cascades for face detection. In: Proc.
ICCV. IEEE, Los Alamitos (2007)
Object Detection from Large-Scale 3D Datasets
Using Bottom-Up and Top-Down Descriptors
University of Pennsylvania
{aiv,mordohai,kostas}@seas.upenn.edu
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 553–566, 2008.
c Springer-Verlag Berlin Heidelberg 2008
554 A. Patterson IV, P. Mordohai, and K. Daniilidis
Fig. 1. Cars detection results from real LIDAR data. Cars have false colors.
objects and clutter and Extended Gaussian Images (EGIs) [8] to ascertain the
presence of a target at the hypothesized locations. This scheme enables us to
process very large datasets with high precision and recall. Training requires
little effort, since the user has to click one point in each target object, which
is then automatically segmented from the scene. The remaining points are used
as negative examples. Spin images are computed on both positive and negative
examples. EGIs only need to be computed for the positive exemplars of the
training set, since they are used to align the bottom-up detection with the model
database. Accurate alignment estimates between similar but not identical objects
enable us to segment the target objects from the clutter. The contributions of
our work can be summarized as follows:
– The combination of bottom-up and top-down processing to detect potential
targets efficiently and verify them accurately.
– The capability to perform training on instances that come from the same
object category as the queries, but are not necessarily identical to the queries.
– Minimal user efforts during training.
– Object detection for large-scale datasets captured in uncontrolled environ-
ments.
– Accurate segmentation of target objects from the background.
2 Related Work
In this section, we briefly overview related work on local and global 3D shape
descriptors and 3D object recognition focusing only on shape-based descriptors.
Research on appearance-based recognition has arguably been more active re-
cently, but is not directly applicable in our experimental setup.
Global shape descriptors include EGIs [8], superquadrics [9], complex EGIs
[10], spherical attribute images [11] and the COSMOS [12]. Global descriptors
are more discriminative since they encapsulate all available information. On the
other hand, they are applicable to single segmented objects and they are sensitive
to clutter and occlusion. A global representation in which occlusion is explicitly
handled is the spherical attribute image proposed by Hebert et al. [11].
A method to obtain invariance to rigid transformations was presented by Os-
ada et al. [13] who compute shape signatures for 3D objects in the form of shape
statistics, such as the distance between randomly sampled pairs of points. Liu
et al. [14] introduced the directional histogram model as a shape descriptor and
achieved orientation invariance by computing the spherical harmonic transform.
Kazhdan et al. [15] proposed a method to make several types of shape descrip-
tors rotationally invariant, via the use of spherical harmonics. Makadia et al. [16]
compute the rotational Fourier transform [17] to efficiently compute the corre-
lation between EGIs. They also propose the constellation EGI, which we use in
Section 5 to compute rotation hypotheses.
Descriptors with local support are more effective than global descriptors for
partial data and data corrupted by clutter. Stein and Medioni [18] combined sur-
face and contour descriptors, in the form of surface splashes and super-segments,
556 A. Patterson IV, P. Mordohai, and K. Daniilidis
respectively. Spin images were introduced by Johnson and Hebert [7] and are
among the most popular such descriptors (See Section 4). Ashbrook et al. [19]
took a similar approach based on the pairwise relationships between triangles
of the input mesh. Frome et al. [20] extended the concept of shape contexts
to 3D. Their experiments show that 3D shape contexts are more robust to oc-
clusion and surface deformation than spin images but incur significantly higher
computational cost. Huber et al. [21] propose a technique to divide range scans
of vehicles into parts and perform recognition under large occlusions using spin
images as local shape signatures.
Local shape descriptors have been used for larger scale object recognition.
Johnson et al.[1] use PCA-compressed spin images and nearest neighbor search
to find the most similar spin images to the query. Alignment hypotheses are
estimated using these correspondences and a variant of the ICP algorithm [22] is
used for verification. Shan et al. [23] proposed the shapeme histogram projection
algorithm which can match partial object by projecting the descriptor of the
query onto the subspace of the model database. Matei et al. [3] find potential
matches for spin images using locality sensitive hashing. Geometric constraints
are then used to verify the match. Ruiz-Correa et all. [5] addressed deformable
shape recognition via a two-stage approach that computes numeric signatures
(spin images) to label components of the data and then computes symbolic
signatures on the labels. This scheme is very effective, but requires extensive
manual labeling of the training data. Funkhouser and Shilane [24] presented a
shape matching system that uses multi-scale, local descriptors and a priority
queue that generates the most likely hypotheses first.
In most of the above methods, processing is mostly bottom-up, followed in
some cases by a geometric verification step. A top-down approach was proposed
by Mian et al. [4] who represent objects by 3D occupancy grids which can be
matched using a 4D hash table. The algorithm removes recognized objects from
the scene and attempts to recognize the remaining data until no additional
library object can be found.
Our method can detect cars in real scenes in the presence of clutter and
sensor noise. Very few of the papers mentioned above ([1,2,3,4,5]) present results
on real data. Among the ones that do, Matei et al. [3] classified cars that had
been previously segmented. Johnson et al. [1], Carmichael et al. [2] and Mian
et al. [4] show object detection from real scenes containing multiple objects. It
should be noted, however, that the number of objects in the scene is small and
that all objects were presented to the algorithm during training. Ruiz-Correa
et. all [5] are able to handle intra-class variation, at the cost of large manual
labeling effort. The goal of our work is more ambitious than [1,2,3,4,20,23] in
order to make more practical applications possible. Our algorithm is not trained
on exemplars identical to the queries, but on other instances from the same
class. This enables us to deploy the system on very large-scale datasets with
moderate training efforts, since we only have to label a few instances from the
object categories we are interested in.
Object Detection from Large-Scale 3D Datasets 557
3 Algorithm Overview
Our algorithm operates on 3D point clouds and entails a bottom-up and a top-
down module. The steps for annotation and training are the following:
1. The user selects one point on each target object.
2. The selected target objects are automatically extracted from the background.
3. Compute surface normals for all points1 in both objects and background.
4. Compute spin images on a subset of the points for both objects and back-
ground and insert into spin image database DBSI (Section 4).
5. Compute an EGI for each object (not for the background). Compute con-
stellation EGI and density approximation. Insert into EGI database DBEGI
(Section 5).
Processing on test data is performed as follows:
1. Compute normals for all points and spin images on a subset of the points.
2. Classify spin images as positive (object) or negative (background) according
to their nearest neighbors in DBSI .
3. Extract connected components of neighboring positive spin images. Each
connected component is a query (object hypothesis).
4. Compute an EGI and the corresponding constellation EGI for each query.
5. For each query and model in DBEGI (Section 5):
(a) Compute rotation hypothesis using constellation EGIs.
(b) For each rotation hypothesis with low distance according to Section (5.3),
compute translation in frequency domain.
(c) Calculate the overlap between query and model.
6. If the overlap is above the threshold, declare positive detection (Section 5).
7. Label all points that overlap with each of the models of DBEGI after align-
ment as object points to obtain segmentation.
4 Bottom-Up Detection
The goal of the bottom-up module is to detect potential target locations in the
point cloud with a bias towards high recall to minimize missed detections. Since
detection has to be performed on very large point clouds, we need a represen-
tation that can be computed and compared efficiently. To this end we use spin
images [7]. A spin image is computed in a cylindrical coordinate system defined
by a reference point and its corresponding normal. All points within this region
are transformed by computing α, the distance from the reference normal ray and
β the height above the reference normal plane. Finally a 2D histogram of α and
β is computed and used as the descriptor. Due to integration around the normal
of the reference point, spin images are invariant to rotations about the normal.
This is not the case with 3D shape contexts [20] or EGIs (Section 5) for which
several rotation hypotheses have to be evaluated to determine a match. Since
1
During normal computation, we also estimate the reliability of the normals, which
is used to select reference points for the spin images.
558 A. Patterson IV, P. Mordohai, and K. Daniilidis
10
12
14
2 4 6 8 10 12 14
Fig. 2. Left: spin image computation on real data. The blue circles delineate the cylin-
drical support region and the red vector is the normal at the reference point. Middle:
illustration of spin image computation. O is the reference point and n its normal. A
spin image is a histogram of points that fall into radial (α) and elevation (β) bins.
Right: the spin image computed for the point on the car.
Fig. 3. Left: input point cloud. Middle: Classification of spin images as target (blue)
and background (cyan). (Only the reference points are shown.) Right: target spin image
centers clustered into object hypotheses. Isolated target spin images are rejected.
The second stage of processing operates on the queries (clustered points with
normals) proposed by the bottom-up stage and verifies whether targets exist at
those locations. Spin images without geometric constraints are not discriminative
enough to determine the presence of a target with high confidence. Spin image
classification is very efficient, but only provides local evidence for the presence
of a potential part of a target and not for a configuration of parts consistent
with a target. For instance a row of newspaper boxes can give rise to a number
of spin images that are also found in cars, but cannot support a configuration
560 A. Patterson IV, P. Mordohai, and K. Daniilidis
of those spin images that is consistent with a car. The top-down stage enforces
these global configuration constraints by computing an alignment between the
query and the database models using EGI descriptors.
Early research has shown that there is a unique EGI representation for any
convex object [27], which can be obtained by computing the density function
of all surface normals on the unit sphere. If the object is not convex, its shape
cannot be completely recovered from the EGI, but the latter is still a powerful
shape descriptor. The EGI does not require a reference point since the relative
positions of the points are not captured in the representation. This property
makes EGIs effective descriptors for our data in which a reference point cannot
be selected with guaranteed repeatability due to occlusion, but the distribution
of normals is fairly stable for a class of objects.
Fig. 4. Left: a database model of a car. Middle: illustration of an EGI in which points
are color-coded according to their density. Right: the corresponding constellation EGI.
Object Detection from Large-Scale 3D Datasets 561
using EGIs. We can use the constellation EGI to cue a more efficient distance
computation. Therefore, we avoid quantizing the orientations of the normals in
an EGI and do not treat it as an orientation histogram.
Instead of an exhaustive search using either spatial or Fourier methods, we use
a technique that generates discrete alignment hypotheses, which was originally
proposed in [16]. A constellation EGI records the locations of local maxima in the
distribution of normals in the EGI. We call these maxima stars, since they resem-
ble stars in the sky. An EGI and the corresponding constellation EGI for an object
can be seen in Fig. 4. Two constellation EGIs can be matched by sampling pairs of
stars that subtend the same angle on the sphere. Each sample generates matching
hypotheses with two stars of the other EGI. If the angles between each pair are
large enough and similar, a rotation hypothesis for the entire descriptor is gener-
ated. Note that a correspondence between two pairs of stars produces two possible
rotations. Similar rotations can be clustered to reduce the number of hypotheses
that need to be tested. The resulting set of rotations are evaluated based on the
distance between the entire EGIs and not just the stars.
Dj = V † Di , (1)
Where Dj are the coefficients at the sparse set of normals Ns , and dmax is the
range of the kernel function. Using this new representation, we can compute
the distance between two EGIs, using a sparse set of samples, after applying a
rotation hypothesis. If the two shapes are identical, the density values should
be equal over the entire sphere. We measure the deviation from an ideal match
by predicting the density on the samples of one EGI using the interpolation
562 A. Patterson IV, P. Mordohai, and K. Daniilidis
Fig. 5. Alignment of a database model (left car and left EGI) and a query (right car
and right EGI) that have been aligned. The car models are shown separately for clarity
of the visualization. Notice the accuracy of the rotation estimation. The query has been
segmented by the positive spin image clustering algorithm and the model by removing
the ground after the user specified one point.
function of the other EGI and comparing them with the original density values.
Specifically we use the l1 distance computed at the query points which we can
now interpolate once the normals Ns are rotated according to each hypothesized
rotation. The minimum distance provides an estimate of the best rotation to
align the two objects, but no estimate of translation and most importantly no
indication of whether the objects actually match. Typically, 1-5 rotations are
close enough to the minimum distance. For these, we estimate the translation
and compute the final distance in the following section.
Given the few best rotation hypotheses based in section 5.3, we compute the
translation that best aligns the two models in the frequency domain. We adopt
the translation estimation method of [16] in which translation is estimated us-
ing a Fourier transform in R3 . This is less sensitive to noise in the form of
missing parts or clutter than global alignment methods that estimate com-
plete rigid transformations in the Fourier domain. We begin by voxelizing the
model and the query to obtain binary occupancy functions in 3D. We then com-
pute their convolution efficiently using the fft and take the maximum as our
translation.
Finally, we need a measure of distance to characterize the quality of the align-
ment that is flexible enough to allow for deformation between the query and the
model. We experimented with the ICP distance [22], without performing ICP
iterations, but found the overlap between the query and model to be more ef-
fective because the quantization in the translation estimation caused large ICP
distance errors even though the models were similar. The overlap is computed as
the inlier ratio over all points of the model and query, where an inlier is a point
with a neighboring point from the other model that is closer than a threshold
distance and whose normal is similar to that of the point under consideration.
Figure 5 shows an alignment between a query and a database object and their
corresponding EGIs. Selecting the overlap points after alignment results in pre-
cise segmentation of the object from the background.
Object Detection from Large-Scale 3D Datasets 563
0.9
0.8
0.7
0.6
0.5
0.4
0.4 0.6 0.8 1
Fig. 6. Left: The precision-recall curve for car detection on 200 million points con-
taining 1221 cars. (Precision is the x-axis and recall the y-axis.) Right: Screenshot of
detected cars. Cars are in random colors and the background in original colors.
Fig. 7. Screenshots of detected cars, including views from above. (There is false nega-
tive at the bottom of the left image.) Best viewed in color.
6 Experimental Results
We processed very large-scale point clouds captured by a moving vehicle equipped
with four range scanners and precise Geo-location sensors. The dataset consists of
about 200 million points, 2.2 million of which were used for training. The training
set included 17 cars which were selected as target objects. We compute 81,172 spin
images for the training set (of which 2657 are parts of cars) and 6.1 million for the
test set. Each spin image has a 15×15 resolution computed in a cylindrical support
region with height and radius both set to 2m. Reference points for the spin images
are selected as in Section 4 with an average distance between vertices of 0.4m. The
spin images of the training set are inserted into DBSI .
EGIs are computed for each target object in the training set, approximated
by picking a smaller set of 200 normals, that minimize the interpolation error on
all samples. The approximated EGIs are inserted into DBEGI , which is a simple
list with 17 entries. Since our method only requires very few representatives from
each class, we were able to perform the experiments using a few sedans, SUVs
and vans as models.
564 A. Patterson IV, P. Mordohai, and K. Daniilidis
The query trouping threshold is set to 1m (Section 4). This groups points roughly
up to two grid positions away. The EGI matching thresholds (Section 5.2) are set
as follows: Each pair must of stars must subtend an angle of at least 30◦ and the
two angles must not differ by more than 5◦. Rotations that meet these requirements
are evaluated according to Section 5.3. For the best rotation hypotheses, the metric
used to make the final decision is computed: the percentage of inliers on both models
after alignment. For a point to be an inlier there has to be at least one other point
from the other model that is within 30cm and whose normal deviates by at most
35◦ from the normal of the current point. We have found the inlier fraction to be
more useful than other distance metrics.
Results on an test area comprising 220 million points and 1221 cars are shown
in Figs. 6 and 7. After bottom-up classification there were approximately 2200
detections of which about 1100 were correct. The top-down step removes about
1200 false positives and 200 true positives. The precision-recall curve as the inlier
threshold varies for the full system is shown in Fig. 6. For the point marked with
a star, there are 905 true positives, 74 false positives and 316 false negatives
(missed detections) for a precision of 92.4% and a recall of 74.1%.
7 Conclusion
We have presented an approach for object detection from 3D point clouds that
is applicable to very large datasets and requires limited training efforts. Its effec-
tiveness is due to the combination of bottom-up and top-down mechanisms to
hypothesize and test locations of potential target objects. An application of our
method on car detection has achieved very satisfactory precision and recall on an
area far larger than the test area of any previously published method. Moreover,
besides a high detection rate, we are able to accurately segment the objects of
interest from the background. We are not aware of any other methodology that
obtains comparable segmentation accuracy without being trained on the same
instances that are being segmented.
A limitation of our approach we intend to address is that search is linear in
the number of objects in the EGI database. We are able to achieve satisfactory
results with a small database, but sublinear search is a necessary enhancement
to our algorithm.
Acknowledgments
This work is partially supported by DARPA under the Urban Reasoning and
Geospatial ExploitatioN Technology program and is performed under National
Geospatial-Intelligence Agency (NGA) Contract Number HM1582-07-C-0018.
The ideas expressed herein are those of the authors, and are not necessarily
endorsed by either DARPA or NGA. This material is approved for public release;
distribution is unlimited.
The authors are also grateful to Ioannis Pavlidis for his help in labeling the
ground truth data.
Object Detection from Large-Scale 3D Datasets 565
References
1. Johnson, A., Carmichael, O., Huber, D., Hebert, M.: Toward a general 3-d matching
engine: Multiple models, complex scenes, and efficient data filtering. In: Image
Understanding Workshop, pp. 1097–1108 (1998)
2. Carmichael, O., Huber, D., Hebert, M.: Large data sets and confusing scenes in
3-d surface matching and recognition. In: 3DIM, pp. 358–367 (1999)
3. Matei, B., Shan, Y., Sawhney, H.S., Tan, Y., Kumar, R., Huber, D., Hebert, M.:
Rapid object indexing using locality sensitive hashing and joint 3d-signature space
estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(7), 1111–
1126 (2006)
4. Mian, A., Bennamoun, M., Owens, R.: Three-dimensional model-based object
recognition and segmentation in cluttered scenes. IEEE Trans. Pattern Analysis
and Machine Intelligence 28(10), 1584–1601 (2006)
5. Correa, S.R., Shapiro, L.G., Meila, M., Berson, G., Cunningham, M.L., Sze, R.W.:
Symbolic signatures for deformable shapes. IEEE Trans. on Pattern Analysis and
Machine Intelligence 28(1), 75–90 (2006)
6. Frueh, C., Jain, S., Zakhor, A.: Data processing algorithms for generating textured
3d building facade meshes from laser scans and camera images. IJCV 61(2), 159–
184 (2005)
7. Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in clut-
tered 3d scenes. IEEE Trans. on Pattern Analysis and Machine Intelligence 21(5),
433–449 (1999)
8. Horn, B.: Extended gaussian images. Proceedings of the IEEE 72(12), 1656–1678
(1984)
9. Solina, F., Bajcsy, R.: Recovery of parametric models from range images: The
case for superquadrics with global deformations. IEEE Transactions on Pattern
Analysis and Machine Intelligence 12(2), 131–147 (1990)
10. Kang, S., Ikeuchi, K.: The complex egi: A new representation for 3-d pose deter-
mination. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(7),
707–721 (1993)
11. Hebert, M., Ikeuchi, K., Delingette, H.: A spherical representation for recognition
of free-form surfaces. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 17(7), 681–690 (1995)
12. Dorai, C., Jain, A.K.: Cosmos: A representation scheme for 3d free-form objects.
IEEE Trans. on Pattern Analysis and Machine Intelligence 19(10), 1115–1130
(1997)
13. Osada, R., Funkhouser, T., Chazelle, B., Dobkin, D.: Shape distributions. ACM
Transactions on Graphics 21(4) (2002)
14. Liu, X., Sun, R., Kang, S.B., Shum, H.Y.: Directional histogram model for three-
dimensional shape similarity. In: Int. Conf. on Computer Vision and Pattern Recog-
nition (2003)
15. Kazhdan, M., Funkhouser, T., Rusinkiewicz, S.: Rotation invariant spherical har-
monic representation of 3D shape descriptors. In: Symposium on Geometry Pro-
cessing (2003)
16. Makadia, A., Patterson, A.I., Daniilidis, K.: Fully automatic registration of 3d
point clouds. In: Int. Conf. on Computer Vision and Pattern Recognition, vol. I,
pp. 1297–1304 (2006)
17. Driscoll, J., Healy, D.: Computing fourier transforms and convolutions on the 2-
sphere. Advances in Applied Mathematics 15, 202–250 (1994)
566 A. Patterson IV, P. Mordohai, and K. Daniilidis
18. Stein, F., Medioni, G.: Structural hashing: Efficient three dimensional object recog-
nition. IEEE Trans. on Pattern Analysis and Machine Intelligence 14(2), 125–145
(1992)
19. Ashbrook, A., Fisher, R., Robertson, C., Werghi, N.: Finding surface correspon-
dence for object recognition and registration using pairwise geometric histograms.
In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 674–686.
Springer, Heidelberg (1998)
20. Frome, A., Huber, D., Kolluri, R., Bulow, T., Malik, J.: Recognizing objects in
range data using regional point descriptors. In: Pajdla, T., Matas, J(G.) (eds.)
ECCV 2004. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004)
21. Huber, D., Kapuria, A., Donamukkala, R., Hebert, M.: Parts-based 3d object clas-
sification. In: Int. Conf on Computer Vision and Pattern Recognition, vol. II, pp.
82–89 (2004)
22. Besl, P.J., McKay, N.D.: A method for registration of 3-d shapes. IEEE Trans. on
Pattern Analysis and Machine Intelligence 14(2), 239–256 (1992)
23. Shan, Y., Sawhney, H.S., Matei, B., Kumar, R.: Shapeme histogram projection
and matching for partial object recognition. IEEE Trans. on Pattern Analysis and
Machine Intelligence 28(4), 568–577 (2006)
24. Funkhouser, T., Shilane, P.: Partial matching of 3d shapes with priority-driven
search. In: Symposium on Geometry Processing (2006)
25. Medioni, G., Lee, M., Tang, C.: A Computational Framework for Segmentation
and Grouping. Elsevier, New York (2000)
26. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal
algorithm for approximate nearest neighbor searching. Journ. of the ACM 45, 891–
923 (1998)
27. Smith, D.A.: Using enhanced spherical images. Technical Report AIM-530. MIT
(1979)
28. Carr, J.C., Beatson, R.K., Cherrie, J.B., Mitchell, T.J., Fright, W.R., McCallum,
B.C., Evans, T.R.: Reconstruction and representation of 3d objects with radial
basis functions. In: SIGGRAPH, pp. 67–76. ACM, New York (2001)
Making Background Subtraction Robust to
Sudden Illumination Changes
1 Introduction
Background subtraction is a critical component of many applications, ranging
from video surveillance to augmented reality. State-of-the-art algorithms can
handle progressive illumination changes but, as shown in Fig. 1, remain vulner-
able to sudden changes. Shadows cast by moving objects can easily be misinter-
preted as additional objects.
This is especially true of approaches [2,3,4,1] that rely on statistical back-
ground models that are progressively updated as time goes by. They can handle
both illumination effects and moving background elements, such as tree leaves or
flowing water. This is an obvious strength, but can result in mistakenly integrat-
ing foreground elements into the background model. This is a potentially serious
problem in surveillance applications: A forgotten luggage could accidentally be-
come part of the background. Furthermore, the model update is usually relatively
slow, making it difficult to rapidly adjust to sudden illumination changes and to
shadows cast by moving objects.
Here, we propose an approach that overcomes this problem by replacing the
statistical background model by a statistical illumination model. More specifi-
cally, we model the ratio of intensities between a stored background image and
an input image in all three channels as a Gaussian Mixture Model (GMM) that
accounts for the fact that different parts of the scene can be affected in differ-
ent ways. We incorporate this GMM in an efficient probabilistic framework that
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 567–580, 2008.
c Springer-Verlag Berlin Heidelberg 2008
568 J. Pilet, C. Strecha, and P. Fua
Fig. 1. Top row: Three very different input images and a model image of the same
scene. The changes are caused by lights being turned on one after the other and the
person moving about. Bottom row: Our algorithm successfully segments out the person
in all three input images. The rightmost image depicts the completely wrong output of
a state-of-the-art approach [1] applied on the third image.
accounts for texture, background illumination, and foreground colour clues. Its
parameters are computed by Expectation Maximization (EM) [5].
This approach reflects our key insight that, assuming that the background is
static, changes in intensity of non-occluded pixels are mainly caused by illumi-
nation effects that are relatively global: They are not the same in all parts of
the image but typically affect similarly whole portions of the image as opposed
to individual pixels. As a result, they can be modelled using GMMs with only
few components—2 in the experiments presented in this paper—which leads to
a very robust algorithm.
We will demonstrate that our algorithm outperforms state-of-the-art back-
ground subtraction techniques when illumination changes quickly. The key
difference between these techniques and ours is that they directly estimate dis-
tributions of pixel intensities as opposed to illumination effects as we do. We
will also show that our approach performs well in an Augmented Reality context
where a moving object is treated as the background from which occluders such
as the hands holding it must be segmented out.
2 Related Work
3 Method
Our method can serve in two different contexts. For background subtraction,
where both the scene and the camera are static. For augmented reality applica-
tions, where an object is moving in the camera field and occlusions have to be
segmented for realistic augmentation.
Let us assume that we are given an unoccluded model image of a background
scene or an object. Our goal is to segment the pixels of an input image in two
parts, those that belong to the same object in both images and those that are
occluded. If we are dealing with a moving object, we first need to register the
input image and create an image that can be compared to the model image
pixelwise. In this work, we restrict ourselves to planar objects and use publicly
available software [13] for registration. If we are dealing with a static scene
and camera, that is, if we are performing standard background subtraction,
registration is not necessary. It is the only difference between both contexts, and
570 J. Pilet, C. Strecha, and P. Fua
the rest of the method is common. In both cases, the intensity and colour of
individual pixels are affected mostly by illumination changes and the presence
of occluding objects.
Changes due to illumination effects are highly correlated across large portions
of the image and can therefore be represented by a low dimensional model that
accounts for variations across the whole image. In this work, we achieve this by
representing the ratio of intensities between the stored background image and an
input image in all three channels as a Gaussian Mixture Model (GMM) that has
very few components—2 in all the experiments shown in this paper. This is in
stark contrast with more traditional background subtraction methods [2,3,4,1]
that introduce a model for each pixel and do not explicitly account for the fact
that inter-pixel variations are correlated.
Following standard practice [14], we model the pixel colours of occluding ob-
jects, such as people walking in front of the camera, as a mixture of Gaussian
and uniform distributions.
To fuse these clues, we model the whole image — background, foreground and
shadows — with a single mixture of distributions. In our model, each pixel is
drawn from one of five distributions: Two Gaussian kernels account for illumi-
nation effects, and two more Gaussians, completed by a uniform distribution,
represent the foreground. An Expectation Maximization algorithm assigns pix-
els to one of the five distributions (E-step) and then optimizes the distributions
parameters (M-step).
Since illumination changes preserve texture whereas occluding objects radi-
cally change it, the correlation between image patches in the model and input
images provides a hint as to whether pixels are occluded or not in the latter,
especially where there is enough texture.
In order to lower the computational burden, we assume pixel independence.
Since this abusive assumption entails the loss of the relation between a pixel
and its neighbors, it makes it impossible to model texture. However, to circum-
vent this issue, we characterize each pixel of the input image by a five dimen-
sional feature vector: The usual red, green, and blue values plus the normalized
cross-correlation and texturedness values. Feature vectors are then assumed in-
dependent, allowing an efficient maximization of a global image likelihood, by
optimizing the parameters of our mixture. In the remainder of this section, we
introduce in more details the different components of our model.
First, we consider the background model, which is responsible for all pixels that
have a counterpart in the model image m. If a pixel ui of the input image u
shows the occlusion free target object, the luminance measured by the camera
depends on the light reaching the surface (the irradiance ei ) and on its albedo.
Irradiance ei is function of visible light sources and of the surface normal. Under
the lambertian assumption, the pixel value ui is: ui = ei ai , where ai is the
albedo of the target object at the location pointed by ui . Similarly, we can
write: mi = em ai ,with em assumed constant over the surface. This assumption
Making Background Subtraction Robust to Sudden Illumination Changes 571
is correct if the model image m has been taken under uniform illumination, or
if a textured model free of illumination effects is available. Combining the above
equations yields:
ui ei
li = = ,
mi em
which does not depend on the surface albedo. It depends on the surface orienta-
tion and on the illumination environment. In the specific case of a planar surface
lit by distant light sources and without cast shadows, this ratio can be expected
to be constant for all i [9]. In the case of a 3 channel colour camera, we can write
the function li that computes a colour illumination ratio for each colour band:
% &
ui,r ui,g ui,b T
li = mi,r mi,g mi,b ,
where the additional indices r, g, b denotes the red, green and blue channel of
pixel ui , recpectively.
In our background illumination model we suppose that the whole scene can be
described by K different illumination ratios, that correspond to areas in ui with
different orientations and/or possible cast shadows. Each area is modelled by a
Gaussian distribution around the illumination ratio μk and with full covariance
Σk . Furthermore we introduce a set of binary latent variables xi,k that take the
value 1 iff pixel i belongs to Gaussian k and 0 otherwise. Then, the probability
of the ratio li is given by:
K
x xi,k
p(li | xi , μ, Σ) = πk i,k N (li ; μk , Σk ) , (1)
k=1
The overall model consist of the background (Eq. 1) and the foreground (Eq. 2)
model. Our latent variables xi select the one distribution among the total K+K̄+1
572 J. Pilet, C. Strecha, and P. Fua
components which is active for pixel i. Consider figures 2(a) and 2(b) for example:
The background pixels could be explained by K = 2 illumination ratios, one
for the cast shadow and one for all other background pixels. The hand in the
foreground could be modelled by the skin colour and the black colour of the shirt
(K̄ = 2). The example in Fig. 2 shows clearly that the importance of the latent
variable components is not equal. In practice, there is often one Gausssian which
models a global illimination change, ı.e. most pixels are assigned to this model
by the latent variable component xi,k . To account for the possibly changing
importance, we have introduced πk that globally weight the contribution of all
Gaussian mixtures k = 1 . . . K̄ and the uniform distribution k = K̄ + 1.
A formal expression of our model requires combining the background pdf of
Eq. 1 and the foreground pdf of Eq. 2. However, one is defined over illumination,
whereas the other over pixel colour, making direct probabilities incompatible.
We therefore express the background model as a function of pixel colour instead
of illumination:
1
p(ui | xi , μ, Σ) = p(li | xi , μ, Σ) , (3)
| Ji |
where | Ji | is the determinant of the Jacobian of function li (ui ). Multiplying
this equation with Eq. 2 composes the complete colour pdf.
Some formulations define an appropriate prior model on the latent variables
x. Such a prior model would incorporate the prior belief that the model se-
lection x shows spatial [14] and spatio-temporal [12] correlations. These priors
on the latent variable x have shown to improve the performance of many vi-
sion algorithms [15]. However, they increase the complexity and slow down the
computation substantially. To circumvent this, we propose in the next section
a spatial likelihood model, which can be seen as a model to capture the spatial
nature of pixels and which allows real-time performance.
Fig. 2. Elements of the approach. (a) Background image m. (b) Input image u. (c)
Textureness image f 2 . (d) Correlation image f 1 . (e) Probability of observing f on the
background, according to the histogram h(fi | vi ) (f) Probability of observing f on the
foreground, according to the histogram h̄(fi | v̄i ).
' '
2 2
fi2 = (uj − ūi ) + (mj − m̄i ) .
j∈wi j∈wi
Probability Probability
0.07 0.07
0.06 0.06
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 120 0.01 120
0 100 0 100
80 80
1 60 1 60
0.5 40 0.5 40
0 20 Textureness 0 20 Textureness
Correlation -0.5 0 Correlation -0.5 0
-1 -1
Fig. 3. Joint correlation and texturedness distributions over background and fore-
ground pixels
Having defined the illumination and the spatial likelihood model we are now
in the position to describe the Maximum Likelihood (ML) estimation of the
combined model. Let θ = {μ, Σ, π} denote the vector of all unknowns. The ML
estimate θ̃ is given by:
θ̃ = arg max log p(u, f , x | θ) (4)
θ x
E-Step. On the (t + 1)th iteration the conditional expectation bt+1 of the log-
likelihood w.r.t. the posterior p(x | u, θ) is computed in the E-step. By construc-
tion, i.e. by the pixel independence, this leads to a closed-form solution for the
latent variable expectations bi , which are often called beliefs. Note, that in other
formulations, where the spatial correlation is modelled explicitly, the E-step re-
quires graph-cut optimisation [14] or other iterative approximations like mean
field [15]. The update equations for the expected values bi,k of xi,k are given by:
1 1
bt+1
i,k=1...K = πk N (li ; μtk , Σkt )h(fi | vi ) (6)
N | Ji |
1
bt+1 = πk N (ui ; μtk , Σkt )h̄(fi | v̄i ) (7)
i,k=K+1...K̄ N
1 1
bt+1 = πK+K̄+1 h̄(fi | v̄i ) ,
i,K̄+1 N 2563
where N = k bt+1i,k normalises the beliefs bi,k to one. The first line in Eq. 6
t+1
M-Step. Given the beliefs bt+1 i,k , the M-step maximises the log-likelihood by
replacing the binary latent variables xi,k by their expected value bt+1
i,k .
1 N
Nk i=1 bi,k li if k ≤ K ,
t+1
μt+1 = N (8)
i=1 bi,k ui otherwise
k 1 t+1
Nk
N
where Nk = i=1 bik .
t+1
Similarly, we obtain:
N t+1
Nk i=1 bi,k (li − μk ) (li − μk ) if k ≤ K
1 T
Σkt+1 = N (9)
i=1 bi,k (ui − μk )(ui − μk ) otherwise
1 t+1 T
Nk
Nk
πkt+1 = (10)
k Nk
4 Results
In this section, we show results on individual frames of video sequences that
feature both sudden illumination changes and shadows cast by occluding ob-
jects. We also compare those results to those produced by state-of-the-art tech-
niques [1,11].
Fig. 4. Segmenting the light switch test images from [16]. (a) Background model.
(b) Test image. (c) Manually segmented ground truth. (d) The output of Zivkovic’s
method [1]. (e) Result published in [11], using an approach based on local binary pat-
terns. (f) Our result, obtained solely by comparing (a) and (b). Unlike the other two
methods, we used no additional video frames.
the rate at which the background adapts, but, as shown in Fig. 5(b), it results
in the sleeve being lost. By contrast, by explicitly reevaluating the illumination
parameters at every frame, our algorithm copes much better with this situation,
as shown in Fig. 5(c). To compare these two methods independently of specific
parameter choices, we computed the ROC curve of Fig. 5(d). We take precision
to be the number of pixels correctly tagged as foreground divided by the total
number of pixels marked as foreground and recall to be the number of pixels
tagged as foreground divided by the number of foreground pixels in the ground
truth. The curve is obtained by binarizing using different thresholds for the prob-
ability of Eq. 11. We also represent different runs of [1] by crosses corresponding
to different choices of its learning rate and the decision threshold. As expected,
our method exhibits much better robustness towards illumination effects.
Fig. 1 depicts a sequence with even more drastic illumination changes that
occur when the subject turns on one light after the other. The GMM based-
method [1] immediately reacts by classifying most of the image as foreground. By
contrast, our algorithm correctly compares the new images with the background
image, taken to be the average of the first 25 frames of the sequence.
Fig. 4 shows the light switch benchmark of [16]. We again built the background
representation by averaging 25 consecutive frames showing the room with the
light switched off. We obtain good results when comparing it to an image where
the light is turned on even though, unlike the other algorithms [1,11], we use a
single frame instead of looking at the whole video. To foreground recall of 82%
that appears in [11] entails a precision of only 25%, whereas our method achieves
578 J. Pilet, C. Strecha, and P. Fua
100
95
90
85
Precision
(a) (b) 80
75
70
65
60 Our method
Zivkovic method
55
100 95 90 85 80 75 70 65 60 55
Recall
(c) (d)
Fig. 5. Segmenting the hand of Fig. 2(b). (a) Result of [1] when the background model
adjusts too slowly to handle a quick illumination change. (b) When the background
model adjusts faster. (d) ROC curve for our method obtained by varying a threshold
on the probability of Eq. 11. The crosses represent results obtained by [1] for different
choices of learning rate and decision threshold.
49% for the same recall. With default parameters, the algorithm of [1] cannot
handle this abrupt light change and yields a precision of 13% for a recall of 70%.
Finally, as shown in Fig. 6, we ran our algorithm on one of the PETS 2006
video sequences that features an abandoned luggage to demonstrate that our
technique is indeed appropriate for surveillance applications because it does not
lose objects by unduly merging them in the background.
Fig. 6. PETS 2006 Dataset. (a) Initial frame of the video, used as background model.
(b) Frame number 2800. (c) The background subtraction of [1]: The abandoned bag in
the middle of the scene has mistakenly been integrated into the background. (d) Our
method correctly segment the bag, the person who left after sitting on the bottom left
corner, and the chair that has been removed on the right.
Making Background Subtraction Robust to Sudden Illumination Changes 579
Fig. 7. Occlusion segmentation on a moving object. (a) Input frame in which the card
is tracked. (b) Traditional background subtraction provide unsatisfying results because
of the shadow cast by the hand, and because it learned the fingers hiding the bottom
left corner as part of the background. (c): Our method is far more robust and produces
a better segmentation. (d) We use its output as an alpha channel to convincingly draw
the virtual text and account for the occluding hand.
the following: A user holds an object that is detected and augmented. If the
detected pattern is occluded by a real object, the virtual object should also be
occluded. In order to augment only the pixels actually showing the pattern, a
visibility mask is required. Technically, any background subtraction technique
could produce it, by unwarping the input images in a reference frame, and by
rewarping the resulting segmentation back to the input frame.
The drastic illumination changes produced by quick rotation of the pattern
might hinder a background subtraction algorithm that has not been designed for
such conditions. That is why the Gaussian mixture based background subtraction
method of [1] has difficulties to handle our test sequence illustrated by figure 7.
On the other hand, the illumination modeling of our approach is able to handle
this situation well and, unsurprisingly, shows superior results. The quality of the
resulting segmentation we obtain allows convincing occluded augmented reality,
as illustrated by figure 7(d).
5 Conclusion
We presented a fast background subtraction algorithm that handles heavy illumi-
nation changes by relying on a statistical model, not of the pixel intensities, but
of the illumination effects. The optimized likelihood also fuses texture correlation
clues by exploiting histograms trained off-line.
We demonstrated the performance of our approach under drastic light changes
that state-of-the-art technique have trouble to handle.
Moreover, our technique can be used to segment the occluded parts of a mov-
ing planar object and therefore allows occlusion handling for augmented reality
applications.
Although we do not explicitely model spatial consistency, the learnt his-
tograms of correlation captures texture. Similarly, we could easily extend our
method by integrating temporal dependence using temporal features.
580 J. Pilet, C. Strecha, and P. Fua
References
1. Zivkovic, Z., van der Heijden, F.: Efficient adaptive density estimation per image
pixel for the task of background subtraction. Pattern Recognition Letters 27(7),
773–780 (2006)
2. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: Real-time tracking
of the human body. In: Photonics East, SPIE, vol. 2615 (1995)
3. Friedman, N., Russell, S.: Image segmentation in video sequences: A probabilistic
approach. In: Annual Conference on Uncertainty in Artificial Intelligence, pp. 175–
181 (1997)
4. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time
tracking. In: CVPR, pp. 246–252 (1999)
5. Bishop, C.: Pattern Recognition and Machine Learning. Springer, Heidelberg
(2006)
6. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.: Background and fore-
ground modeling using nonparametric kernel density for visual surveillance. Pro-
ceedings of the IEEE 90, 1151–1163 (2002)
7. Sheikh, Y., Shah, M.: Bayesian modeling of dynamic scenes for object detection.
PAMI 27, 1778–1792 (2005)
8. Prati, A., Mikic, I., Trivedi, M., Cucchiara, R.: Detecting moving shadows: Algo-
rithms and evaluation. PAMI 25, 918–923 (2003)
9. Stauder, J., Mech, R., Ostermann, J.: Detection of moving cast shadows for object
segmentation. IEEE Transactions on Multimedia 1(1), 65–76 (1999)
10. Jabri, S., Duric, Z., Wechsler, H., Rosenfeld, A.: Detection and location of people in
video images using adaptive fusion of color and edge information. In: International
Conference on Pattern Recognition, vol. 4, pp. 627–630 (2000)
11. Heikkila, M., Pietikainen, M.: A texture-based method for modeling the background
and detecting moving objects. PAMI 28(4), 657–662 (2006)
12. Criminisi, A., Cross, G., Blake, A., Kolmogorov, V.: Bilayer segmentation of live
video. In: CVPR, pp. 53–60 (2006)
13. Lepetit, V., Pilet, J., Geiger, A., Mazzoni, A., Oezuysal, M., Fua, P.: Bazar,
http://cvlab.epfl.ch/software/bazar
14. Rother, C., Kolmogorov, V., Blake, A.: Grabcut: Interactive foreground extraction
using iterated graph cuts. ACM SIGGRAPH (2004)
15. Fransens, R., Strecha, C., Van Gool, L.: A mean field EM-algorithm for coherent
occlusion handling in map-estimation problems. In: CVPR (2006)
16. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: principles and prac-
tice of background maintenance. In: International Conference on Computer Vision,
vol. 1, pp. 255–261 (1999)
Closed-Form Solution to Non-rigid 3D Surface
Registration
EPFL - CVLab,
1015 Lausanne, Switzerland
1 Introduction
3D shape recovery of deformable surfaces from individual images is known to be
highly ambiguous. The standard approach to overcoming this is to introduce a
deformation model and to recover the shape by optimizing an objective func-
tion [1,2,3,4,5,6,7,8] that measures the fit of the model to the data. However, in
practice, this objective function is either non-convex or involves temporal con-
sistency. Thus, to avoid being trapped in local minima, these methods require
initial estimates that must be relatively close to the true shape. As a result, they
have been shown to be effective for tracking, but not for registration without a
priori shape knowledge.
By contrast, we propose here a solution to detecting and reconstructing in-
elastic 3D surfaces from correspondences between an individual image and a
reference configuration, in closed-form, and without any initial shape estimate.
More specifically, we model flexible inelastic surfaces as triangulated meshes
whose edge lengths cannot change. Given an image of the surface in a known
This work was supported in part by the Swiss National Science Foundation and in
part by the European Commission under the IST-project 034307 DYVINE (Dynamic
Visual Networks).
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 581–594, 2008.
c Springer-Verlag Berlin Heidelberg 2008
582 M. Salzmann et al.
Fig. 1. 3D Reconstruction of non-rigid objects from and individual image and a refer-
ence configuration. Results were obtained in closed-form, without any initial estimate.
Top: Recovered mesh overlaid on the original image. Bottom: Re-textured side view of
the retrieved surface.
2 Related Work
3D reconstruction of non-rigid surfaces from images has attracted increasing
attention in recent years. It is a severely under-constrained problem and many
different kinds of prior models have been introduced to restrict the space of
possible shapes to a manageable size.
Closed-Form Solution to Non-rigid 3D Surface Registration 583
Most of the models currently in use trace their roots to the early physics-
based models that were introduced to delineate 2D shapes [11] and reconstruct
relatively simple 3D ones [12].
As far as 2D problems are concerned, their more recent incarnations have
proved effective for image registration [13,14] and non-rigid surface detection
[15,16]. Many variations of these models have also been proposed to address
3D problems, including superquadrics [1], triangulated surfaces [2], or thin-
plate splines [17]. Additionally, dimensionality reduction was introduced through
modal analysis [3,18], where shapes are represented as linear combinations of
deformation modes. Finally, a very recent work [19] proposes to set bounds on
distances between feature points, and use them in conjunction with a thin-plate
splines model to reconstruct inextensible surfaces.
One limitation of the physics-based models is that they rarely describe ac-
curately the non-linear physics of large deformations. In theory, this could be
remedied by introducing more sophisticated finite-element modeling. However,
in practice, this often leads to vastly increased complexity without a commensu-
rate gain in performance. As a result, in recent years, there has been increasing
interest in statistical learning techniques that build surface deformation models
from training data. Active Appearance Models [20] pioneered this approach by
learning low-dimensional linear models for 2D face tracking. They were quickly
followed by Active Shape Models [5] and Morphable Models [4] that extended it
to 3D. More recently, linear models have also been learned for structure-from-
motion applications [6,21] and tracking of smoothly deforming 3D surfaces [7].
There has also been a number of attempts at performing 3D surface recon-
struction without resorting to a deformation model. One approach has been
to use lighting information in addition to texture clues to constrain the recon-
struction process [8], which has only been demonstrated under very restrictive
assumptions on lighting conditions and is therefore not generally applicable.
Other approaches have proposed to use motion models over video sequences.
The reconstruction problem was then formulated either as solving a large lin-
ear system [22] or as a Second Order Cone Programming problem [23]. These
formulations, however, rely on tightly bounding the vertex displacements from
one frame to the next, which makes them applicable only in a tracking context
where the shape in the first frame of the sequence is known.
In all the above methods, shape recovery entails minimizing an objective func-
tion. In most cases, the function is non convex, and therefore, one can never be
sure to find its global minimum, especially if the initial estimate is far from the
correct answer. In the rare examples formulated as convex problems [23], the so-
lution involves temporal consistency, which again requires a good initialization.
By contrast, many closed-form solutions have been proposed for pose estima-
tion of rigid objects [24,25,26]. In fact, the inspiration for our method came from
our earlier work [9] in that field. However, reconstructing a deformable surface
involves many more variables than the 6 rigid motion degrees of freedom. In
the remainder of this paper, we show that this therefore requires a substantially
different approach.
584 M. Salzmann et al.
3 Closed-Form 3D Reconstruction
In this section, we show that recovering the 3D shape of a flexible surface from
3D-to-2D correspondences can be achieved by solving a set of quadratic equations
accounting for inextensibility, which can be done in closed-form.
We first show that, given a set of 3D-to-2D correspondences, the vector of vertex
coordinates X can be found as the solution of a linear system.
Let x be a 3D point belonging to facet f with barycentric coordinates [a1 , a2 , a3 ].
3
Hence, we can write it as x = i=1 ai vf,i , where {vf,i }i=1,2,3 are the three ver-
tices of facet f . The fact that x projects to the 2D image location (u, v) can now
be expressed by the relation
⎡ ⎤
u
A (a1 vf,1 + a2 vf,2 + a3 vf,3 ) = k ⎣ v ⎦ , (1)
1
where k is a scalar accounting for depth. Since, from the last row of Eq. 1, k can
be expressed in terms of the vertex coordinates, we have
⎡ ⎤
! "
# $ vf,1 u
a1 B a2 B a3 B ⎣ ⎦
vf,2 = 0 , with B = A2×3 − A3 , (2)
v
vf,3
where A2×3 are the first two rows of A, and A3 is the third one. nc such corre-
spondences between 3D surface points and 2D image locations therefore provide
2nc linear constraints such as those of Eq. 2. They can be jointly expressed by
the linear system
MX = 0 , (3)
# $
where M is a 2nc × 3nv matrix obtained by concatenating the a1 B a2 B a3 B
matrices of Eq. 2.
Closed-Form Solution to Non-rigid 3D Surface Registration 585
(a) (b)
5
6 6 x 10
x 10 x 10 5
3 3
4.5
2.5 2.5 4
3.5
2 2
3
2
1 1
1.5
1
0.5 0.5
0.5
0 0 0
50 100 150 200 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Fig. 2. (a,b) Original and side views of a surface used to generate a synthetic sequence.
The 3D shape was reconstructed by an optical motion capture system. (c,d) Eigenval-
ues of the linear system written from correspondences randomly established for the
synthetic shape of (a). (c) The system was written in terms of 243 vertex coordinates.
One third of the eigenvalues are close to zero. (d) The system was written in terms of
50 PCA modes. There are still a number of near zero eigenvalues. (e) First derivative
of the curve (d) (in reversed x-direction). We take the maximum value of nl to be the
one with maximum derivative, which corresponds to the jump in (d).
Following the idea introduced in [9], we write the solution of the linear system of
Eq. 3 as a weighted sum of the eigenvectors li , 1 ≤ i ≤ nv of MT M, which are
those associated with the eigenvalues that are almost zero. Therefore we write
nv
X= βi l i , (4)
i=1
586 M. Salzmann et al.
where we only show the first line of the original system of Eq. 6 and its prod-
uct with β1 , and where Di,j 1 stands for the coefficient on the first line of D
corresponding to the product βi βj .
It can be shown that multiplying the inextensibility equations by all the βi
only yields a sufficient number of equations for very small meshes, i.e. less than
12 vertices for a hexagonal mesh. In theory, one could solve this problem by
applying Extended Linearization iteratively by re-multiplying the new equations
by the linear terms. However, in practice, the resulting system quickly becomes
so large that it is intractable, i.e. for a 10 × 10 mesh, the number of equations
only becomes larger than the number of unknowns when the size of the system
is of the order 1010 . In other words, Extended Linearization cannot deal with a
problem as large as ours and we are not aware of any other closed-form approach
to solving systems of quadratic equations that could. We address this issue in
the next section.
where the pi are the deformation modes and the αi their associated weights.
In our implementation, modes were obtained by applying Principal Component
Analysis to a matrix of registered training meshes in deformed configurations,
from which the mean shape X0 was subtracted [7]. The pi therefore are the
eigenvectors of the data covariance matrix. Nonetheless, they could also have
been derived by modal analysis, which amounts to computing the eigenvectors
of a stiffness matrix, and is a standard approach in physics-based modeling [3].
In this formulation, recovering the shape amounts to computing the weights
α. Since the shape must satisfy Eq. 3, α must then satisfy
M(X0 + Pα) = 0 . (9)
When solving this system, to ensure that the recovered weights do not generate
shapes exceedingly far from our training data, we introduce a regularization term
by penalizing αi with the inverse of the corresponding eigenvalue σi of the data
covariance matrix. We therefore solve
! "! "
MP MX0 α
=0, (10)
wr S 0 1
588 M. Salzmann et al.
4 Experimental Results
In this section we show that our method can be successfully applied to recon-
structing non-rigid shapes from individual images and a reference configuration.
We present results on both synthetic data and real images.
σ = 0, r = 0% σ = 5, r = 0% σ = 10, r = 0%
g o g o g o
20 20 20
10 10 10
0 0 0
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
Mean curvature x 10
−4 Mean curvature x 10
−4 Mean curvature x 10
−4
σ = 0, r = 5% σ = 5, r = 5% σ = 10, r = 5%
g o g o g o
20 20 20
10 10 10
0 0 0
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
Mean curvature x 10
−4 Mean curvature x 10
−4 Mean curvature x 10
−4
40
ncf = 1
40
ncf = 1 Mean 3D distance [mm] 40
ncf = 1
ncf = 0.5 ncf = 0.5 ncf = 0.5
30 30 30
20 20 20
10 10 10
0 0 0
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
Mean curvature x 10
−4 Mean curvature x 10
−4 Mean curvature x 10
−4
4
5
2
0 0
2 4 6 8 10 12 2 4 6 8 10 12
Mean curvature x 10
−4 Mean curvature x 10
−4
(a) (b)
Fig. 4. Comparison of our closed-form results against the results of constrained op-
timization. Optimization was performed on the vertex coordinates using Matlab’s
fmincon function, and starting from the flat position. (a) Mean vertex-to-vertex dis-
tance. (b) Reprojection error. Constrained optimization is both much slower and far
less accurate than our approach.
shape, the latter remains meaningful and could be used to initialize an iterative
algorithm.
In Fig. 4, we compare our results against results obtained with Matlab’s con-
strained optimization fmincon function. We use it to minimize the residual of
the linear system of Eq. 3 with respect to the vertex coordinates, under the
constraints that edge lengths must remain constant. We first tried to use the
similar representation in terms of modes. However, since the constraints could
never be truly satisfied, the algorithm would never converge towards an accept-
able solution. This forced us to directly use the vertex coordinates. To improve
convergence and prevent the surface from crumpling, we added a smoothness
term [11]. For all the frames, the initialization was set to the flat position. In
Fig. 4(a), we show the mean 3D vertex-to-vertex distance for the case where
σg = 5, ro = 0, and ncf = 5. The red curve corresponds to our closed-form solu-
tion and the blue one to constrained optimization. Note that our approach gives
much better results. Furthermore, it is also much faster, requiring only 1.5 min-
utes per frame as opposed to 1.5 hours for constrained optimization. Fig. 4(b)
shows the reprojection errors for the same cases.
Fig. 6. Shape recovery of a bed-sheet. Top Row: Recovered mesh overlaid on the orig-
inal image. Bottom Row: Mesh seen from a different viewpoint.
video sequences, nothing links one frame to the next, and no initialization is
required. Corresponding videos are given as supplementary material.
In the case of the sheet, we deformed it into several unrelated shapes, took
pictures from 2 different views for each deformation, and reconstructed the sur-
face from a single image and a reference configuration. In Fig. 5, we show the
results on four different cases. From our recovered shape, we generated synthetic
textured images roughly corresponding to the viewpoint of the second image.
As can be seen in the two bottom rows of Fig. 5, our synthetic images closely
match the real side views. Additionally, we also reconstructed the same sheet
592 M. Salzmann et al.
Fig. 7. Shape recovery of a piece of cloth. From Top to Bottom: Mesh computed in
closed-form overlaid on the input image, side view of that mesh, refined mesh after 5
Gauss-Newton iterations.
Fig. 8. Shape recovery of the central part of a t-shirt. From Top to Bottom: Mesh
computed in closed-form overlaid on the input image, side view of that mesh, refined
mesh after 5 Gauss-Newton iterations.
from the images of a video sequence, and show the results in Fig. 6. Note that
no initialization was required, and that nothing links one frame to the next.
In Figs. 7 and 8, we show results for images of a piece of cloth and of a
t-shirt waved in front of the camera. Note that in both cases, the closed-form
solution closely follows what we observe in the videos. To further refine it, we
implemented a simple Gauss-Newton optimization technique, and minimize the
Closed-Form Solution to Non-rigid 3D Surface Registration 593
residual D̃b̃ − d̃ corresponding to Eq. 11 with respect to the β̃i . In the third
row of the figures, we show the refined mesh after 5 iterations this scheme.
This proved sufficient to recover finer details at a negligible increase in overall
computation time.
5 Conclusion
References
13. Bartoli, A., Zisserman, A.: Direct Estimation of Non-Rigid Registration. In: BMVC
(2004)
14. Gay-Bellile, V., Bartoli, A., Sayd, P.: Direct estimation of non-rigid registrations
with image-base self-occlusion reasoning. In: ICCV (2007)
15. Pilet, J., Lepetit, V., Fua, P.: Real-Time Non-Rigid Surface Detection. In: CVPR
(2005)
16. Zhu, J., Lyu, M.R.: Progressive finit newton approach to real-time nonrigid surface
detection. In: ICCV (2007)
17. McInerney, T., Terzopoulos, D.: A Finite Element Model for 3D Shape Reconstruc-
tion and Nonrigid Motion Tracking. In: ICCV (1993)
18. Delingette, H., Hebert, M., Ikeuchi, K.: Deformable surfaces: A free-form shape
representation. Geometric Methods in Computer Vision (1991)
19. Perriollat, M., Hartley, R., Bartoli, A.: Monocular Template-based Reconstruction
of Inextensible Surfaces. In: BMVC (2008)
20. Cootes, T., Edwards, G., Taylor, C.: Active Appearance Models. In: Burkhardt,
H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407. Springer, Heidelberg (1998)
21. Llado, X., Bue, A.D., Agapito, L.: Non-rigid 3D Factorization for Projective Re-
construction. In: BMVC (2005)
22. Salzmann, M., Lepetit, V., Fua, P.: Deformable Surface Tracking Ambiguities. In:
CVPR (2007)
23. Salzmann, M., Hartley, R., Fua, P.: Convex Optimization for Deformable Surface
3–D Tracking. In: ICCV (2007)
24. Quan, L., Lan, Z.: Linear N-Point Camera Pose Determination. PAMI 21, 774–780
(1999)
25. Fiore, P.D.: Efficient linear solution of exterior orientation. PAMI 23, 140–148
(2001)
26. Ansar, A., Daniilidis, K.: Linear pose estimation from points or lines. PAMI 25,
578–589 (2003)
27. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. IJCV 20,
91–110 (2004)
Implementing Decision Trees and Forests on a
GPU
Toby Sharp
1 Introduction
Since their introduction, randomized decision forests (or random forests) have
generated considerable interest in the machine learning community as new tools
for efficient discriminative classification [1,2]. Their introduction in the computer
vision community was mostly due to the work of Lepetit et al in [3,4]. This gave
rise to a number of papers using random forests for: object class recognition and
segmentation [5,6], bilayer video segmentation [7], image classification [8] and
person identification [9].
Random forests naturally enable a wide variety of visual cues (e.g. colour,
texture, shape, depth etc.). They yield a probabilistic output, and can be made
computationally efficient. Because of these benefits, random forests are being
established as efficient and general-purpose vision tools. Therefore an optimized
implementation of both their training and testing algorithms is desirable.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 595–608, 2008.
c Springer-Verlag Berlin Heidelberg 2008
596 T. Sharp
1.2 Outline
Algorithm 1 describes how a binary decision tree is conceptually evaluated on
input data. In computer vision techniques, the input data typically correspond
to feature values at pixel locations. Each parent node in the tree stores a binary
function. For each data point, the binary function at the root node is evaluated
on the data. The function value determines which child node is visited next.
This continues until reaching a leaf node, which determines the output of the
procedure. A forest is a collection of trees that are evaluated independently.
In §2 we describe the features we use in our application which are useful for
object class recognition. In §3, we show how to map the evaluation of a decision
Fig. 1. Left: A 320 × 213 image from the Microsoft Research recognition database
[14] which consists of 23 labeled object classes. Centre: The mode of the pixelwise
distribution given by a forest of 8 trees, each with 256 leaf nodes, trained on a subset
of the database. This corresponds to the ArgMax output option (§3.3). This result was
generated in 7 ms. Right: The ground truth labelling for the same image.
Implementing Decision Trees and Forests on a GPU 597
Algorithm 1. Evaluate the binary decision tree with root node N on input x
1. while N has valid children do
2. if T estF eature(N, x) = true then
3. N ← N.RightChild
4. else
5. N ← N.Lef tChild
6. end if
7. end while
8. return data associated with N
forest to a GPU. The decision forest data structure is mapped to a forest texture
which can be stored in graphics memory. GPUs are highly data parallel machines
and their performance is sensitive to flow control operations. We show how to
evaluate trees with a non-branching pixel shader. Finally, the training of decision
trees involves the construction of histograms – a scatter operation that is not
possible in a pixel shader. In §4, we show how new GPU hardware features
allow these histograms to be computed with a combination of pixel shaders and
vertex shaders. In §5 we show results with speed gains of 100 times over a CPU
implementation.
Our framework allows clients to use any features which can be computed in a
pixel shader on multi-channel input. Our method is therefore applicable to more
general classification tasks within computer science, such as multi-dimensional
approximate nearest neighbour classification. We present no new theory but con-
centrate on the highly parallel implementation of decision forests. Our method
yields very significant performance increases over a standard CPU version, which
we present in §5.
We have chosen Microsoft’s Direct3D SDK and High Level Shader Language
(HLSL) to code our system, compiling for Shader Model 3.
2 Visual Features
2.1 Choice of Features
To demonstrate our method, we have adopted visual features that generalize
those used by many previous works for detection and recognition, including
[15,16,3,17,7]. Given a single-channel
input image I and a rectangle R, let σ
represent the sum σ(I, R) = x∈R I(x).
The features we use are differences of two such sums over rectangles R0 , R1
in channels c0 , c1 of the input data. The response of a multi-channel image I to
a feature F = {R0 , c0 , R1 , c1 } is then ρ(I, F ) = σ(I[c0 ], R0 ) − σ(I[c1 ], R1 ). The
Boolean test at a tree node is given by the threshold function θ0 ≤ ρ(I, F ) < θ1 .
This formulation generalizes the Haar-like features of [15], the summed rectan-
gular features of [16] and the pixel difference features of [3]. The generalization
of features is important because it allows us to execute the same code for all
the nodes in a decision tree, varying only the values of the parameters. This will
enable us to write a non-branching decision evaluation loop.
598 T. Sharp
Fig. 3. HLSL code which represents the features used to demonstrate our system.
These features are suitable for a wide range of detection and recognition tasks.
Figure 3 shows the HLSL code which is used to specify our choice of features
(§2.1). The variables for the feature are encoded in the Parameters structure.
The Boolean test for a given node and pixel is defined by the TestFeature
method, which will be called by the evaluation and training procedures as
necessary.
We would like to stress that, although we have adopted these features to
demonstrate our implementation and show results, there is nothing in our frame-
work which requires us to use a particular feature set. We could in practice use
any features that can be computed in a pixel shader independently at each input
data point, e.g. pixel differences, dot products for BSP trees or multi-level forests
as in [6].
3 Evaluation
Once the input data textures have been prepared, they can be supplied to a
pixel shader which performs the evaluation of the decision forest at each pixel
in parallel.
Our strategy for the evaluation of a decision forest on the GPU is to transform
the forest’s data structure from a list of binary trees to a 2D texture (Figure 4).
We lay out the data associated with a tree in a four-component float texture,
with each node’s data on a separate row in breadth-first order.
In the first horizontal position of each row we store the texture coordinate of
the corresponding node’s left child. Note that we do not need to store the right
child’s position as it always occupies the row after the left child. We also store
all the feature parameters necessary to evaluate the Boolean test for the node.
For each leaf node, we store a unique index for the leaf and the required output
– a distribution over class labels learned during training.
To navigate through the tree during evaluation, we write a pixel shader that
uses a local 2D node coordinate variable in place of a pointer to the current node
(Figure 5). Starting with the first row (root node) we read the feature parameters
600 T. Sharp
Fig. 4. Left: A decision tree structure containing parameters used in a Boolean test at
each parent node, and output data at each leaf node. Right: A 7 × 5 forest texture built
from the tree. Empty spaces denote unused values.
Fig. 5. An HLSL pixel shader which evaluates a decision tree on each input point in
parallel without branching. Here we have omitted evaluation on multiple and unbal-
anced trees for clarity.
and evaluate the Boolean test on the input data using texture-dependent reads.
We then update the vertical component of our node coordinate based on the
result of the test and the value stored in the child position field. This has the
effect of walking down the tree according to the computed features. We continue
this procedure until we reach a row that represents a leaf node in the tree, where
we return the output data associated with the leaf.
For a forest consisting of multiple trees, we tile the tree textures horizontally.
An outer loop then iterates over the trees in the forest; we use the horizontal
component of the node coordinate to address the correct tree, and the vertical
component to address the correct node within the tree. The output distribution
for the forest is the mean of the distributions for each tree.
Implementing Decision Trees and Forests on a GPU 601
This method allows our pixel shader to be non-branching (i.e. it does not con-
tain flow control statements) which is crucial for optimal execution performance.
4 Training
Training of randomized trees is achieved iteratively, growing a tree by one level
each training round. For each training round, a pool of candidate features is
602 T. Sharp
sampled, and these are then evaluated on all the training data to assess their
discriminative ability. Joint histograms over ground truth labels and feature
responses are created, and these histograms may be used in conjunction with
various learning algorithms, e.g. ID3 [21] or C4.5 [22], to choose features for new
tree nodes. Thus learning trees can be a highly compute-intensive task. We adopt
a general approach for efficient training on the GPU, suitable for any learning
algorithm.
A training database consists of training examples together with ground truth
class labels. Given a training database, a pool of candidate features and a decision
tree, we compute and return to our client a histogram that can be used to extend
the tree in accordance with a learning algorithm. For generality, our histogram is
4D and its four axes are: the leaf node index, ground truth class label, candidate
feature index and quantized feature response. Armed with this histogram, clients
can add two new children to each leaf of the current tree, selecting the most
discriminative candidate feature as the new test.
In one sweep over the training database we visit each labeled data point and
evaluate its response to each candidate feature. We also determine the active
leaf node in the tree and increment the appropriate histogram bin. Thus for each
training round we evaluate the discriminative ability of all candidate features at
all leaf nodes of the current tree.
Fig. 6. Processing training data. (a) A set of four training examples, already pre-
processed for feature computation. (b) The same data, rearranged so that each texture
contains corresponding channels from all four training examples. (c) The appropriate
textures are selected for a given feature and the box filters applied. (d) The final feature
response is the difference of the two box filtered textures.
Since we are pre-filtering our sRGB image data (§2), we can either perform
all the pre-processing to the training database in advance, or we can apply the
pre-processing as each image is fetched from the database. After the pre-filtering
(Figure 6a) we re-arrange the texture channels so that each texture contains one
filtered component from each of the four training examples (Figure 6b). The
input textures are thus prepared for evaluating our features efficiently.
We then iterate through the supplied set of candidate features, computing
the response of the current training examples to each feature. For each feature
we select two input textures according to the channels specified in the feature
(Figure 6c). We compute each box filter convolution on four training images
in parallel by passing the input texture to a pixel shader that performs the
necessary look-ups on the integral image. In a third pass, we subtract the two
box filter responses to recover the feature response (Figure 6d).
We ensure that our leaf image (§4.2) also comprises four components that
correspond to the four current training examples.
Fig. 7. An HLSL vertex shader that scatters feature response values to the appropriate
position within a histogram
A simple pixel shader emits a constant value of 1 and, with additive blending
enabled, the histogram values are incremented as desired.
We execute this pipeline four times for the four channels of data to be ag-
gregated into the histogram. A shader constant allows us to select the required
channel.
In order to histogram the real-valued feature responses they must first be quan-
tized. We require that the client provides the number of quantization bins to
use for the training round. An interval of interest for response values is also pro-
vided for each feature in the set of candidates. In our Scatter vertex shader, we
then linearly map the response interval to the histogram bins, clamping to end
bins. We make the quantization explicit in this way because different learning
algorithms may have different criteria for choosing the parameters used for the
tree’s Boolean test.
One approach would be to use 20-30 quantization levels during a training
round and then to analyze the histogram, choosing a threshold value adaptively
to optimize the resulting child distributions. For example, the threshold could
be chosen to minimize the number of misclassified data points or to maximize
the KL-divergence. Although this method reduces training error, it may lead to
over-fitting. Another approach would be to use a very coarse quantization (only
2 or 3 bins) with randomized response intervals. This method is less prone to
over-fitting but may require more training rounds to become sufficiently discrim-
inative.
We have tested both of the above approaches and found them effective. We
currently favour the latter approach, which we use with the ID3 algorithm [21]
to select for every leaf node the feature with the best information gain. Thus we
double the number of leaf nodes in a tree after each training round.
We create flexibility by not requiring any particular learning algorithm. In-
stead, by focusing on computation of the histogram, we enable clients to adopt
their preferred learning algorithm efficiently.
Implementing Decision Trees and Forests on a GPU 605
5 Results
Our test system consists of a dual-core Intel Core 2 Duo 2.66 GHz and an nVidia
GeForce GTX 280. (Timings on a GeForce 8800 Ultra were similar.) We have
coded our GPU version using Direct3D 9 with HLSL shaders, and a CPU version
using C++ for comparison only. We have not developed an SSE version of the
CPU implementation which we believe may improve the CPU results somewhat
(except when performance is limited by memory bandwidth). Part of the appeal
of the GPU implementation is the ability to write shaders using HLSL which
greatly simplifies the adoption of vector instructions.
In all cases identical output is attained using both CPU and GPU versions.
Our contribution is a method of GPU implementation that yields a considerable
speed improvement, thereby enabling new real-time recognition applications.
We separate our results into timings for pre-processing, evaluation and train-
ing. All of the timings depends on the choice of features; we show timings for our
generalized recognition features. For reference, we give timings for our feature
pre-processing in Figure 8.
Training time can be prohibitively long for randomized trees, particularly with large
databases. This leads to pragmatic short-cuts such as sub-sampling training data,
which in turn has an impact on the discrimination performance of learned trees.
Our training procedure requires time linear in the number of training exam-
ples, the number of trees, the depth of the trees and the number of candidate
features evaluated.
Fig. 9. Breakdown of time spent during one training round with 100 training examples
and a pool of 100 candidate features. Note the high proportion of time spent updating
the histogram.
606 T. Sharp
To measure training time, we took 100 images from the labeled object recog-
nition database of [14] with a resolution of 320×213. This data set has 23 labeled
classes. We used a pool of 100 candidate features for each training round. The
time taken for each training round was 12.3 seconds. With these parameters, a
balanced tree containing 256 leaf nodes takes 98 seconds to train. Here we have
used every pixel of every training image.
Training time is dominated by the cost of evaluating a set of candidate features
on a training image and aggregating the feature responses into a histogram.
Figure 9 shows a breakdown of these costs. These figures are interesting as they
reveal two important insights:
First, the aggregation of the histograms on the GPU is comparatively slow,
dominating the training time significantly. We experimented with various dif-
ferent method for accumulating the histograms, maintaining the histogram in
system memory and performing the incrementation on the CPU. Unfortunately,
this did not substantially reduce the time required for a training round. Most
recently, we have begun to experiment with using CUDA [24] for this task and
we anticipate a significant benefit over using Direct3D.
Second, the computation of the rectangular sum feature responses is extremely
fast. We timed this operation as able to compute over 10 million rectangular sums
per ms on the GPU. This computation time is insignificant next to the other
timings, and this leads us to suggest that we could afford to experiment with
more arithmetically complex features without harming training time.
Fig. 10. Timings for evaluating a forest of decision trees. Our GPU implementation
evaluates the forest in about 1% of the time required by the CPU implementation.
Implementing Decision Trees and Forests on a GPU 607
Fig. 11. (a)-(b) Object class recognition. A forest of 8 trees was trained on labelled
data for grass, sky and background labels. (a) An outdoor image which is not part of
the training set for this example. (b) Using the Distribution output option, the blue
channel represents the probability of sky and the green channel the probability of grass
at each pixel (5 ms). (c)-(d) Head tracking in video. A random forest was trained
using spatial and temporal derivative features instead of the texton filter bank. (c) A
typical webcam video frame with an overlay showing the detected head position. This
frame was not part of the training set for this example. (d) The probability that each
pixel is in the foreground (5 ms).
5.3 Conclusion
We have shown how it is possible to use GPUs for the training and evaluation
of general purpose decision trees and forests, yielding speed gains of around 100
times.
References
1. Amit, Y., Geman, D.: Shape quantization and recognition with randomized trees.
Neural Computation 9(7), 1545–1588 (1997)
2. Breiman, L.: Random forests. ML Journal 45(1), 5–32 (2001)
3. Lepetit, V., Fua, P.: Keypoint recognition using randomized trees. IEEE Trans.
Pattern Anal. Mach. Intell. 28(9), 1465–1479 (2006)
4. Ozuysal, M., Fua, P., Lepetit, V.: Fast keypoint recognition in ten lines of code.
In: IEEE CVPR (2007)
5. Winn, J., Criminisi, A.: Object class recognition at a glance. In: IEEE CVPR,
video track (2006)
6. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image catego-
rization and segmentation. In: IEEE CVPR, Anchorage (2008)
7. Yin, P., Criminisi, A., Winn, J.M., Essa, I.A.: Tree-based classifiers for bilayer
video segmentation. In: CVPR (2007)
8. Bosh, A., Zisserman, A., Munoz, X.: Image classification using random forests and
ferns. In: IEEE ICCV (2007)
9. Apostolof, N., Zisserman, A.: Who are you? - real-time person identification. In:
BMVC (2007)
10. Yang, R., Pollefeys, M.: Multi-resolution real-time stereo on commodity graphics
hardware. In: CVPR, vol. (1), pp. 211–220 (2003)
11. Brunton, A., Shu, C., Roth, G.: Belief propagation on the gpu for stereo vision. In:
CRV, p. 76 (2006)
12. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision.
International Journal of Computer Vision 70(1), 41–54 (2006)
608 T. Sharp
13. Steinkraus, D., Buck, I., Simard, P.: Using gpus for machine learning algorithms.
In: Proceedings of Eighth International Conference on Document Analysis and
Recognition, 2005, 29 August-1 September 2005, vol. 2, pp. 1115–1120 (2005)
14. Winn, J.M., Criminisi, A., Minka, T.P.: Object categorization by learned universal
visual dictionary. In: ICCV, pp. 1800–1807 (2005)
15. Viola, P.A., Jones, M.J.: Robust real-time face detection. International Journal of
Computer Vision 57(2), 137–154 (2004)
16. Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: TextonBoost: Joint appearance,
shape and context modeling for multi-class object recognition and segmentation.
In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp.
1–15. Springer, Heidelberg (2006)
17. Deselaers, T., Criminisi, A., Winn, J.M., Agarwal, A.: Incorporating on-demand
stereo for real time recognition. In: CVPR (2007)
18. James, G., O’Rorke, J.: Real-time glow. In: GPU Gems: Programming Techniques,
Tips and Tricks for Real-Time Graphics, pp. 343–362. Addison-Wesley, Reading
(2004)
19. Blelloch, G.E.: Prefix sums and their applications. Technical Report CMU-CS-90-
190, School of Computer Science, Carnegie Mellon University (November 1990)
20. Hensley, J., Scheuermann, T., Coombe, G., Singh, M., Lastra, A.: Fast summed-
area table generation and its applications. Comput. Graph. Forum 24(3), 547–555
(2005)
21. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
22. Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, California
(1992)
23. Scheuermann, T., Hensley, J.: Efficient histogram generation using scattering on
gpus. In: SI3D, pp. 33–37 (2007)
24. http://www.nvidia.com/cuda
General Imaging Geometry
for Central Catadioptric Cameras
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 609–622, 2008.
c Springer-Verlag Berlin Heidelberg 2008
610 P. Sturm and J.P. Barreto
size 6 × 10 and how this may be used for calibrating catadioptric cameras with
a straightforward DLT approach, something which has not been possible up to
now. We then give analogous results for the backprojection of image points and
the projection of quadrics and conics. These are the basis for our main result,
the general fundamental matrix for catadioptric cameras. It is of size 15× 15 and
an explicit compact expression is provided. Finally, we also show the existence
of plane homographies, again of size 15 × 15, that relate sets of matching image
points that are the projections of coplanar scene points.
Our results, like those cited above for para-catadioptric cameras, are based on
the use of so-called lifted coordinates to represent geometric objects. For example,
2D points are usually represented by 3-vectors of homogeneous coordinates; their
lifted coordinates are 6-vectors containing all degree-2 monomials of the original
coordinates. Lifted coordinates have also been used to model linear pushbroom
cameras [8] and to perform multi-body structure from motion [17,18].
perspective
camera
Fig. 1. Left: Projection of a 3D point to two image points (in cyan). Middle: Backpro-
jection of an image point to two 3D lines. Right: Illustration of plane homography. The
image point in cyan in the right camera is backprojected to the scene plane, giving the
two red points. Each one of them is projected to the unit sphere of the second camera
on two blue points. The four image points in the second camera are shown in cyan.
the second intersection point with the sphere is hidden from the 3D point by the
mirror. An exception is the case of an elliptical mirror where a 3D point may
actually be seen twice in the image. Although for the most useful catadioptric
cameras, a 3D point is in reality visible in only one image point, it turns out
that in order to obtain multi-linear expressions for epipolar geometry, plane
homographies etc., both mathematical image points have to be considered, see
later. So, from now on we consider two image points per 3D point and similar
considerations will be done for backprojection etc. in the following. The algebraic
formulation of the projection of 3D points, in the form of a 6 × 10 general
catadioptric projection matrix, is given in section 4.
Epipolar geometry. The basic question of epipolar geometry is: what is the
locus of points in the second image, that may be matches of a point q1 in the
first image? The answer follows from the insights explained so far. Let us first
backproject q1 . The two 3D lines we get can then be projected into the second
image, giving one conic each. Hence, the locus of matching points is the union of
two conics. This can be represented by a single geometric entity, a quartic curve
(note of course that not every quartic curve is the union of two conics).
Hence, if a multi-linear fundamental matrix exists that represents this epipo-
lar geometry, it must map an image point into some representation of a quartic
curve. The equation of a planar quartic curve depends on 15 coefficients (defined
up to scale), one per 4-th order monomial of a 2D point’s homogeneous coordi-
nates. Hence, we may expect the fundamental matrix to be of size 15 × · · · . Like
for perspective images, we may expect that the transpose of the fundamental
matrix gives the fundamental matrix going from the second to the first image.
The fundamental matrix for catadioptric images should thus intuitively be of
size 15 × 15. This is indeed the case, as is shown in section 7.
3 Background
Camera model. As mentioned before, we use the sphere based model [19].
Without loss of generality, let the unit sphere be located at the origin and the
T
optical center of the perspective camera, at the point Cp = (0,0, −ξ) . The
perspective camera is modeled by the projection matrix P ∼ Ap Rp I −Cp . For
full generality, we include a rotation Rp ; this may encode an actual rotation
of the true camera looking at the mirror, but may also simply be a projective
change of coordinates in the image plane, like for para-catadioptric cameras,
where the true camera’s rotation is fixed, modulo rotation about the mirror axis.
Note that all parameters of the perspective camera, i.e. both its intrinsic and
extrinsic parameter sets, are intrinsic parameters for the catadioptric camera.
Hence, we replace Ap Rp by a generic projective transformation K from now on.
The intrinsic parameters of the catadioptric camera are thus ξ and K.
The projection of a 3D point Q goes as follows (cf. section 2). The two
intersection points of the sphere and the line joining its center and Q, are
. T
Q1 , Q2 , Q3 , ± Q21 + Q22 + Q23 . Their images in the perspective camera are
General Imaging Geometry for Central Catadioptric Cameras 613
⎛ ⎞
Q1
q± ∼ Kr± ∼ K ⎝ . Q2
⎠
Q3 ± ξ Q1 + Q2 + Q3
2 2 2
In the following, we usually first work with the intermediate image points r± ∼
K−1 q± , before giving final results for the actual image points q± .
Plücker line coordinates. 3D lines may be represented by 6-vectors of so-
called Plücker coordinates. Let A and B be the non-homogeneous coordinates
of two generic 3D points. Let us define the line’s Plücker coordinates as the
T
6-vector L = AT − BT , (A × B)T .
All lines satisfy the Plücker constraint LT WL = 0 where W is
0I
W=
I0
Two lines L and L cut one another if and only if LT WL = 0. Consider a rigid
transformation for points
R t
0T 1
Lines are mapped accordingly using the transformation
R 03×3
T=
[t]× R R
n+d
of an m dimensional projective space P m , with m = − 1.
d
Consider the second order Veronese map V2,2 , that embeds the projective
plane into the 5D projective space, by lifting the coordinates of point q to
T
q̂ = q12 q1 q2 q22 q1 q3 q2 q3 q32
Vector q̂ and matrix qqT are composed by the same elements. The former can
be derived from the latter through a suitable re-arrangement of parameters.
614 P. Sturm and J.P. Barreto
Define v(U) as the vector obtained by stacking the columns of a generic matrix
U [22]. For the case of qqT , v(qqT ) has several repeated elements because of
matrix symmetry. By left multiplication with a suitable permutation matrix S
that adds the repeated elements, it follows that
⎛1 0 0 0 0 0 0 0 0⎞
0 1010000 0
q̂ = D−1 ⎝ 00 00 01 00 10 00 01 00 00 ⎠ v(qqT ), (2)
0 0000101 0
1 0000000 1
S
9
with D a diagonal matrix, Dii = j=1 Sij . This process of computing the lifted
representation of a point q can be extended to any second order Veronese map
Vn,2 independently of the dimensionality of the original space. It is also a mech-
anism that provides a compact representation for square symmetric matrices.
If U is symmetric, then it is uniquely represented by vsym (U), the column-wise
vectorization of its upper right triangular part:
T
vsym (U) = D−1 SU = (U11 , U12 , U22 , U13 , · · · , Unn )
Let us now discuss the lifting of linear transformations. Consider A such that
r = Aq. The relation rrT = A(qqT )AT can be written as a vector mapping
(rrT ) = (A ⊗ A)(qqT ),
with ⊗ denoting the Kronecker product [22]. Using the symmetric vectorization,
we have q̂ = vsym (qqT ) and r̂ = vsym (rrT ), thus:
We have just derived the expression for lifting linear transformations. A has a
lifted counterpart  such that r = A q iff r̂ =  q̂. For the case of a second
order Veronese map, the lifting of a 2D projective transformation A is  of size
6 × 6. This lifting generalizes to any projective transformation, independently
of the dimensions of its original and target spaces, i.e. it is also applicable to
rectangular matrices. We summarize a few useful properties [22].
= = ÂB̂
AB >
A−1 = Â−1 =T = D−1 ÂT D
A (3)
Also, for symmetric matrices U and M, we have the following property:
4 Projection of 3D Points
As explained in section 2, a 3D point is mathematically projected to two image
points. How to represent two 2D points via a single geometric entity? One way
is to compute the degenerate dual conic generated by them, i.e. the dual conic
containing exactly the lines going through at least one of the two points. Let the
two image points be q+ and q− (see section 3). The dual conic is given by
⎛ 2 ⎞
Q1 Q1 Q2 Q1 Q3
Ω ∼ q+ qT T ⎝
− + q− q+ ∼ K Q1 Q2 Q2
2
Q2 Q3 ⎠ KT
Q1 Q3 Q2 Q3 Q3 − ξ (Q1 + Q2 + Q3 )
2 2 2 2 2
We thus find an expression that is very similar to that for perspective cameras
and that may be directly used for calibrating catadioptric cameras using e.g. a
standard DLT like approach. While a 3 × 3 skew symmetric matrix has rank
2, its lifted counterpart is rank 3. Therefore, each 3D-to-2D match provides 3
linear constraints on the 59 parameters of Pcata , and DLT calibration can be
done with a minimum of 20 matches.
This is essential for deriving the proposed expression of the fundamental matrix.
Similarly to the case of projection, we want to express the backprojection func-
tion of a catadioptric camera as a linear mapping. Recall from section 2, that
the backprojection of an image point gives two 3D lines. How to represent two
3D lines via a single geometric entity? Several possibilities may exist; the one
that seems appropriate is to use a second order line complex: consider two 3D
lines L+ and L− . All lines that cut at least one of them, form a second order
line complex, represented by a 6 × 6 matrix C such that lines on the complex
satisfy LT CL = 0. The matrix C is given as (with W as defined in section 3)
C ∼ W L+ LT T
− + L− L+ W
with r ∼ K−1 q and Cp the center of the perspective camera (cf. section 3). The
line complex C generated by the two lines, is
0 0 0 0
C∼ ∼
0 b+ bT T
− + b− b+ 0 ξ 2 (rT r)e3 eT T T
3 − ξ r3 (e3 r + re3 ) + (ξ − 1)rr
2 2 T
General Imaging Geometry for Central Catadioptric Cameras 617
where u = vsym (U), û is a 21-vector and the 15 × 21 matrix Xlc only depends
on the intrinsic parameter ξ and is highly sparse (not shown due to lack of
space). Since the coefficients of q̂ˆ are 4th order monomials of q, we conclude
that a central catadioptric image of any line complex, and thus of any quadric
or conic, is a quartic curve. We may call the 15×21 matrix Plc the line complex
projection matrix for catadioptric cameras. It maps the lifted coefficients of
the line complex to the 15 coefficients of the quartic curve in the image.
618 P. Sturm and J.P. Barreto
ˆ T Fcata q̂
q̂ ˆ1 = 0
2
which has the familiar form known for perspective cameras. Fcata has rank 6 and
its left/right null space has dimension 9. While a perspective view has a single
epipole, in an omnidirectional view there are a pair of epipoles, e+ and e− ,
corresponding to the two antipodal intersections of the baseline with the sphere,
cf. section 4. The nullspace of Fcata comprises the doubly lifted coordinates
vectors of both epipoles. We conjecture that they are the only doubly lifted
3-vectors in the nullspace of Fcata , but this has to be proven in future work.
We have no space to discuss it, but for mixtures involving one hyper-catadiop-
tric and one other camera, the size of Fcata is smaller (15 × 6 for para–hyper and
6 × 6 for hyper–perspective). Other special cases are already known [14,5].
with b± as in section 5. We project them to the second image using the projection
matrix given in section 4. To do so, we first have to lift the coordinates of these
3D points: Q̂± ∼ Ŷ10×6 b̂± . The projection then gives two dual conics in the
second image (cf. section 4), represented by 6-vectors ω± ∼ Pcata Ŷ b̂± .
Let us compute the following symmetric 6 × 6 matrix:
T T
Γ ∼ ω+ ω− + ω− ω+ ∼ Pcata Ŷ b̂+ b̂T
− + b̂− b̂T T T
+ Ŷ Pcata (7)
Z
The matrix Hcata is the catadioptric plane homography. Its explicit form is
omitted due to lack of space.
By the same approach as in section 4, we can derive the following constraint
equation:
ˆ
[q2 ]× Hcata q̂1 = 015
Of the 15 constraints contained in the above equation, only five are linearly
independent. Hence, in order to estimate the 152 = 225 coefficients of Hcata , we
need at least 45 matches.
In the special case of para-catadioptric cameras, the homography is of size 6×6
and each match gives 6 equations, 3 of which are linearly independent. Hence,
12 matches are needed to estimate the 36 coefficients of that plane homography.
Fig. 2. Estimation of the homography mapping a planar grid into a catadioptric image.
It was determined from 12 clicked points (left side). Each corner is mapped into a pair
of image projections. The lines joining corresponding pairs form a pencil going through
the principal point which confirms the correctness of the estimation.
using a corner detector. From this initial step only 7 out of 91 points were missed.
The procedure was repeated a second time using all the good points and 6 more
points were correctly detected. The estimated homography maps each plane point
to a pair of antipodal image points (right side of Fig. 2). The shown result suggests
that the plane-to-image homography can be well estimated and that it is useful
for extracting and matching corners of a planar grid. Current work deals with cal-
ibrating the camera from such homographies, from multiple images of the grid.
There are several perspectives for our work. The shown results can be spe-
cialized to e.g. para-catadioptric cameras, leading to simpler expressions. It may
also be possible that due to the coordinate liftings used, some of the results hold
not only for catadioptric cameras, but also for other models, e.g. classical radial
distortion; this will be investigated. Current work is concerned with developing
practical DLT like calibration approaches for catadioptric cameras, using 3D or
planar calibration grids. Promising results have already been obtained.
References
1. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn.
Cambridge University Press, Cambridge (2004)
2. Svoboda, T., Pajdla, T.: Epipolar geometry for central catadioptric cameras.
IJCV 49, 23–37 (2002)
3. Kang, S.: Catadioptric self-calibration. In: CVPR, pp. 1201–1207 (2000)
622 P. Sturm and J.P. Barreto
4. Geyer, C., Daniilidis, K.: Structure and motion from uncalibrated catadioptric
views. In: CVPR, pp. 279–286 (2001)
5. Sturm, P.: Mixing catadioptric and perspective cameras. In: OMNIVIS, pp. 37–44
(2002)
6. Mičušı́k, B., Pajdla, T.: Structure from motion with wide circular field of view
cameras. PAMI 28(7), 1135–1149 (2006)
7. Claus, D., Fitzgibbon, A.: A rational function for fish-eye lens distortion. In: CVPR,
pp. 213–219 (2005)
8. Gupta, R., Hartley, R.: Linear pushbroom cameras. PAMI 19, 963–975 (1997)
9. Feldman, D., Pajdla, T., Weinshall, D.: On the epipolar geometry of the crossed-
slits projection. In: ICCV, pp. 988–995 (2003)
10. Pajdla, T.: Stereo with oblique cameras. IJCV 47, 161–170 (2002)
11. Seitz, S., Kim, J.: The space of all stereo images. IJCV 48, 21–38 (2002)
12. Menem, M., Pajdla, T.: Constraints on perspective images and circular panoramas.
In: BMVC (2004)
13. Baker, S., Nayar, S.: A theory of catadioptric image formation. In: ICCV, pp. 35–42
(1998)
14. Geyer, C., Daniilidis, K.: Properties of the catadioptric fundamental matrix. In:
Tistarelli, M., Bigun, J., Jain, A.K. (eds.) ECCV 2002. LNCS, vol. 2359, pp. 140–
154. Springer, Heidelberg (2002)
15. Barreto, J.: A unifying geometric representation for central projection systems.
CVIU 103 (2006)
16. Barreto, J.P., Daniilidis, K.: Epipolar geometry of central projection systems using
veronese maps. In: CVPR, pp. 1258–1265 (2006)
17. Wolf, L., Shashua, A.: Two-body segmentation from two perspective views. In:
CVPR, pp. 263–270 (2001)
18. Vidal, R., Ma, Y., Soatto, S., Sastry, S.: Two-view multibody structure from mo-
tion. IJCV 68, 7–25 (2006)
19. Geyer, C., Daniilidis, K.: A unifying theory for central panoramic systems. In: Ver-
non, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 445–461. Springer, Heidelberg (2000)
20. Barreto, J., Araujo, H.: Geometric properties of central catadioptric line images
and its application in calibration. PAMI 27, 1237–1333 (2005)
21. Semple, J., Kneebone, G.: Algebraic Projective Geometry. Claredon Press (1998)
22. Horn, R., Johnson, C.: Topics in Matrix Analysis. Cambridge University Press,
Cambridge (1991)
23. Ying, X., Zha, H.: Using sphere images for calibrating fisheye cameras under the
unified imaging model of the central catadioptric and fisheye cameras. In: ICPR,
pp. 539–542 (2006)
24. Ponce, J., McHenry, K., Papadopoulo, T., Teillaud, M., Triggs, B.: On the absolute
quadratic complex and its application to autocalibration. In: CVPR, pp. 780–787
(2005)
25. Valdés, A., Ronda, J., Gallego, G.: The absolute line quadric and camera autocal-
ibration. IJCV 66, 283–303 (2006)
26. Mei, C., Rives, P.: Single view point omnidirectional camera calibration from planar
grids. In: ICRA, pp. 3945–3950 (2007)
Estimating Radiometric Response Functions
from Image Noise Variance
1 Introduction
Many computer vision algorithms rely on the assumption that image intensity is
linearly related to scene radiance recorded at the camera sensor. However, this
assumption does not hold with most cameras; the linearity is not maintained in
the actual observation due to non-linear camera response functions. Linearization
of observed image intensity is important for many vision algorithms to work,
therefore the estimation of the response functions is needed.
Scene radiance intensity (input) I and observed image intensity (output) O
are related by the response function f as O = f (I). Assuming it is contin-
uous and monotonic, the response function can be inverted to obtain the in-
verse response function g (= f −1 ), and measured image intensities can be lin-
earized using I = g(O). Since only observed output intensities O are usually
available, most estimation methods attempt to estimate the inverse response
functions.
Institute of Industrial Science, University of Tokyo.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 623–637, 2008.
c Springer-Verlag Berlin Heidelberg 2008
624 J. Takamatsu, Y. Matsushita, and K. Ikeuchi
Fig. 1. The noise variance in the input domain has an affine relationship with input
intensity level. Due to the non-linearity of the response function, the affine relationship
is lost in the output domain. The proposed method estimates the response functions
by recovering the affinity of the measured noise variances.
Output
Output
Fig. 2. The relationship between response function and noise variances in input and
output domains. The magnitude of output noise variance (height of the filled region)
varies with the slope of the response function with a fixed input noise variance.
the exact alternation has not been explicitly described. Second, it introduces
a method that has a wide range of applicability even with noisy observations.
While many existing algorithms break down when the noise level is high, we
show that the proposed method is not sensitive to the noise level because it uses
the noise as information.
The conditional density function (cdf) p(O|Õ) represents the noise distribution
in the output domain, i.e., the probability that the output intensity becomes O
when the noise-free intensity is Õ. Likewise, the cdf p(I|I) ˜ represents the noise
distribution in the input domain when the noise-free input intensity is I. ˜ The
function f is the response function, and Õ and I are related by f as Õ = f (I).
˜ ˜
μO (= μO (Õ)) is the expectation of the output intensity with the cdf p(O|Õ).
Using the Taylor series of f (I) around I˜ and assuming that the second- and
higher-degree terms of the series are negligible (discussed in Section 2.3), we
obtain
σO
2
(Õ) f 2 (I)σ
˜ 2 (I),
I
˜ (1)
where σI2 (I)
˜ is the function of noise variance in the input domain when the
noise-free input intensity is I.
˜ This equation shows the relationship between the
noise variance and the response functions. This relationship has been pointed out
in [8,20], however, the correctness of the equation was not thoroughly discussed.
To simplify the notation in derivation of Eq. (1), we define:
μI = μI (I) = Ip(I|I)dI,
˜ ˜ Id = μI − I,
˜
Mn = Mn (I) ˜ = (I − μI )n p(I|I)dI ˜ (n ≥ 0). (2)
the n-th moment about the mean. Note that we do not put any assumptions
about the profile and model of the noise distributions.
The expectation of the output intensity μO where the noise-free output in-
tensity is Õ can be written as
∞
f (j) (I)
˜
μO = μO (Õ) = f (I)p(I|I)dI
˜ = Õ + Nj , (3)
j=1
j!
Note that the response function f is represented using its Taylor series.
From the definition, the noise variance function σO 2
(Õ) where the noise-free
output intensity is Õ can be derived as
∞
9 :2 ∞ ∞
f (j) (I)
˜ ˜ f (k) (I)
f (j) (I) ˜
σO (Õ) =
2
Lj,j + 2 Lj,k , (5)
j=1
j! j=1
j! k!
k>j
σO
2
(Õ) = f 2 (I)L
˜ 1,1 + f (I)f
˜ (I)L
˜ 1,2 + · · ·
= f 2 (I)σ ˜ + f (I)f
˜ I2 (I) ˜ (I)(2I
˜ d σI (I) + M3 ) + · · · .
2 ˜
(7)
Eq. (7) is the exact form of the relationship between the response function and
noise variances in the input and output domains. By discarding the second- and
higher-degree terms of Eq. (7), Eq. (1) is obtained. We discuss the validity of
this approximation in Section 2.3.
I = aP + NDC + NS + NR , (8)
628 J. Takamatsu, Y. Matsushita, and K. Ikeuchi
σI2 (I)
˜ = Iσ
˜ 2 + σ2 + σ2 ,
S DC R (9)
where σ∗2 denotes the variances of the different noise sources [8]. Eq. (9) can be
written in a simplified form as
σI2 (I)
˜ = AI˜ + B, (10)
since the minimum unit of the distribution equals to a (See Eq. (8)). σS2 Ĩ is the
variance of shot noise where the noise-free input intensity is I.
˜ By substituting
this into Eq. (4) yields
i
i i−j j−2 2
NSi Idi + Id a σSĨ . (13)
j=2
j
Even in the worst case where ji is overestimated as 2i , NSi becomes exponen-
tially smaller, since Eq. (13) is rewritten as
i
NSi ≤ Idi + (2Id )i−j (2a)j−2 (2σSĨ )2 , (14)
j=2
and we know that 2Id ) 1, 2a ) 1, and (2σSĨ )2 ) 1. This equation shows that
NSi exponentially decreases as i increases.
The term Li,j is defined as Li,j = Ni+j − Ni Nj in Eq. (6). Because the i-th
moment of image noise Ni can be computed as the sum of the readout and shot
noise as Ni = NRi + NSi , it also becomes exponentially smaller as i increases.
From these results, we see that the term Li,j becomes exponentially smaller as
i + j increases.
Ratio of L1,1 to L1,2 . Now we show that L1,2 is small enough to be negligible
compared with L1,1 . A detailed calculation gives us L1,2 = 2Id M2 + M3 . The
third moment of shot noise MS3 can be computed from Eq. (12). Also, the third
moment of readout noise can be obtained using Eq. (4) as
MR3 = NR3 − 3Id NR2 − Id3 = −3Id σR
2
− Id3 . (15)
From these results, the following equation is obtained:
L1,2 = 2Id M2 − 3Id σR
2
− Id3 + aσS2 Ĩ . (16)
Since M2 σR 2
+ σS2 Ĩ , a ) Id , if M2 ) Id2 , the order of L1,2 is roughly the
same as the order of Id M2 . Since Id is the difference between the noise-free
input intensity and the mean which can be naturally considered very small, it is
implausible to have cases where M2 ) Id2 .
From these results, the order of L1,2 is roughly equivalent to the order of
Id L1,1 , and Id is small because it is computed in the normalized input domain,
e.g., in the order of 10−2 ( 1/28) in 8-bit image case. Therefore, L1,2 is about
10−2 times smaller than L1,1 .
To summarize, L1,2 is sufficiently small compared with L1,1 , and Li,j decreases
exponentially as i + j increases. Also, because response functions are smooth,
f (I)
˜ ) f (I).
˜ Therefore, Eq. (7) can be well approximated by Eq. (1).
3 Estimation Algorithm
This section designs an evaluation function for estimating inverse response func-
tions g, using the result of the previous section.
630 J. Takamatsu, Y. Matsushita, and K. Ikeuchi
Eq. (20) involves the estimation of A and B, which can be simply solved by
linear least square fitting, given g.
To make the algorithm robust against the measuring errors, namely the er-
roneous component in the measured noise, we use weighting factors. Eq. (20) is
changed to
1 2 2
E2 (g; σO
2
(Õ)) = min w(Õ) σO (Õ) − σO
2
(Õ) , (21)
m
A,B w(Õ) m
Õ
where the weight function w(Õ) controls the reliability on the measured noise
variance σO2
m
(Õ) at the intensity level Õ. We use a Cauchy distribution
(Lorentzian function) for computing the weight function w(Õ):
1
w(Õ) = , (22)
e2 +ρ
where e is defined as e = σO 2
(Õ) − σO 2
m
(Õ). A damping factor ρ controls the
relationship between the difference e and weight w(Õ). As ρ becomes smaller,
the weight w(Õ) decreases more rapidly as the difference e increases.
We also add a smoothness constraint to the evaluation function, and the
evaluation function becomes
1 1
E3 (g; σO
2
(Õ)) = 2 E2 + λs g (Õ)2 , (23)
m
Õ σOm (Õ)
nÕ
Õ
Estimating Radiometric Response Functions from Image Noise Variance 631
where nÕ is the number of possible noise-free output intensity levels, e.g., 256 in
8-bit case. λs is
a regularization factor that controls the effect of the smoothness
constraint. 1/ Õ σO 2
m
(Õ) is a normalization factor that makes E2 independent
of the degree of noise level.
Our method estimates the inverse response function ĝ by minimizing Eq. (23)
given the measured noise variance σO 2
m
(Õ):
ĝ = argmin E3 g; σO 2
m
(Õ) . (24)
g
3.3 Implementation
In our implementation, we set the damping factor ρ to the variance of the dif-
ference e in Eq. (22). The regularization factor λs is set to 5 × 10−7 from our
empirical observation. Minimization is performed in an alternating manner. We
perform the following steps until convergence:
1. minimize the evaluation function in Eq. (23) with fixing the weight func-
tion w(Õ)
2. recompute the values of the weight function w(Õ) using the current estima-
tion result
We use the Nelder-Mead Simplex method [24] as the minimization algorithm
implemented in Matlab as a function fminsearch. The values of the weight
function w(Õ) are set to one for every Õ at the beginning. During the exper-
iments, we used five initial guesses for the inverse response function g as the
input to the algorithm. The converged result that minimizes the energy score is
finally taken as the global solution.
4 Experiments
We used two different setups to evaluate the performance of the proposed al-
gorithm; one is with multiple images taken by a fixed video camera, the other
is using a single image. The two setups differ in the means for collecting noise
variance information.
632 J. Takamatsu, Y. Matsushita, and K. Ikeuchi
h(O, Õ)
p(O|Õ) = . (25)
O h(O, Õ)
Results. We used three different video cameras for this experiment: Sony
DCR-TRV9E (Camera A), Sony DCR-TRV900 NTSC (Camera B), and Sony
DSR-PD190P (Camera C). To obtain the ground truth of Camera C, we
used Mitsunaga and Nayar’s method [4], and the Macbeth color checker-based
method [1], and combined these results by taking the mean. For Camera A
and B, we used only the Macbeth color checker-based method [1] to obtain the
ground truth because the exposure setting was not available in these cameras.
The results obtained by the proposed method are compared with the ground
truth curves.
Figure 3 shows the results of our algorithm. The top row shows the plot of
the estimated inverse response functions with the corresponding ground truth
Fig. 3. Results of our estimation method. Top row: comparison of inverse response
functions. Bottom row: measured noise variance and fitting result.
Estimating Radiometric Response Functions from Image Noise Variance 633
curves. The bottom row shows the estimated and measured distributions of
noise variances; the horizontal axis is the normalized output, and the vertical
axis corresponds to the noise variance. Figure 4 shows the scenes used to obtain
these results.
Figure 3 (a) shows an estimation result using the blue channel of Camera A.
The maximum difference is 0.052 and the RMSE is 0.025 in terms of normalized
input. As shown in the bottom of (a), the noise variances in lower output levels
contain severe measured errors. Our algorithm is robust against such errors be-
cause of the use of adaptive weighting factors. Figure 3 (b) shows the result of
Camera B (green channel). The maximum difference is 0.037 and the RMSE is
0.022. Figure 3 (c) shows the estimation result of Camera C (red channel). The
input frames are obtained by setting the camera gain to 12 db which causes high
noise level. The maximum difference is 0.037 and the RMSE is 0.024.
Table 1 summarizes all the experimental results. For each camera, three differ-
ent scenes are used. The algorithm is applied to RGB-channels independently,
therefore 9 datasets for each camera are used. Disparity represents the mean
of maximum differences in normalized input. From these results, the proposed
method performs well even though the algorithm only uses the noise variance as
input.
Figure 5 shows the comparison between our method and Matsushita and Lin’s
method [15]. Unlike other estimation methods, these two methods take noise as
input. We use Camera B for the comparison.
As shown in the result, the estimation results are equivalent when the number
of images is relatively large. However, Matsushita and Lin’s method breaks down
when the number of samples becomes small, and our method shows significant
superiority. In statistics, it is known that variance of measured from samples’
variance is inversely proportional to the number of the samples. Therefore, the
measured variance becomes more stable than the profile of noise distribution
does, as the number of samples increases. In addition, Matsushita and Lin’s
symmetry criterion naturally requires large number of samples to make the noise
profiles smooth, while it does not hold in the lower number of samples in Figure 5.
These are why our method works well when the number of samples is relatively
small.
Fig. 5. Comparison between our method and Matsushita and Lin’s method [15]. Our
method uses noise variance, but not profiles of noise distributions. Our method works
well even when the sampling number is relatively small.
1
p(σO
2
(Õ)|g) = exp −λp E3 (g; σO
2
(Õ)) , (27)
m
Z m
Fig. 6. Relationship between the noise level and mean RMSE of the estimates. Left
image shows one of the photographed scenes. Top row shows magnification of a part of
the image at different ISO levels. Bottom row shows the mean RMSE of RGB channels
at each ISO gain level, and demonstrates that our estimation method is independent
of noise levels.
Results. We used a Canon EOS-20D camera for the experiment. To obtain the
ground truth, we used Mitsunaga and Nayar’s method [4] using images taken
with different exposures. Since our focus is on estimating the inverse response
functions from the measured noise variances, we photographed a scene composed
of relatively flat and uniformly colored surfaces, so that the noise variances can
be easily obtained. The left image in Figure 6 shows one of two scenes used for
the experiment. We photographed them five times each at six different camera
gains (ISO 100 ∼ 3200). We manually selected 21 homogeneous image regions
to obtain the noise variances as input. In total, we ran our estimation algorithm
60 times (= 2 scenes × 5 shots × 6 ISO levels) for each RGB color channel.
Figure 6 summarizes the results of estimation at different ISO levels. The
noise level increases with the ISO gain level, as shown by the cropped images
on the top. The results indicate that the estimation is unaffected by the greater
noise level. The mean RMSE is almost constant across the different ISO levels,
which verifies that our method is not sensitive to the noise level.
5 Conclusions
In this paper, we have proposed the method for estimating a radiometric response
function using noise variance, not noise distribution, as input. The relationship
between the radiometric response function and noise variances in input and out-
put domains is explicitly derived, and this result is used to develop the estimation
algorithm. The experiments are performed for two different scenarios; one is with
multiple shots of the same scene, and the other is only from a single image. These
experiments quantitatively demonstrate the effectiveness of the proposed algo-
rithm, especially its robustness against noise. With our method, either special
equipment or images taken with multiple exposures are not necessary.
636 J. Takamatsu, Y. Matsushita, and K. Ikeuchi
Limitations. It is better for our method that the measured noise variances cover
a wide range of intensity levels. Wider coverage provides more information to
the algorithm, so the problem becomes more constrained. This becomes an is-
sue, particularly in the single-image case. In the single-image case, we used a
simple method to collect the noise variances, but more sophisticated methods
such like [20] can be used to obtain more accurate measurements that could
potentially cover a wider range of intensity levels.
Acknowledgement
The authors would like to thank Dr. Bennett Wilburn for his useful feedback on
this research.
References
1. Chang, Y.C., Reid, J.F.: Rgb calibration for color image analysis in machine vision.
IEEE Trans. on Image Processing 5, 1414–1422 (1996)
2. Nayar, S.K., Mitsunaga, T.: High dynamic range imaging: Spatially varying pixel
exposures. In: Proc. of Comp. Vis. and Patt. Recog. (CVPR), pp. 472–479 (2000)
3. Mann, S., Picard, R.: Being ‘undigital’ with digital cameras: Extending dynamic
range by combining differently exposed pictures. In: Proc. of IS & T 48th Annual
Conf., pp. 422–428 (1995)
4. Mitsunaga, T., Nayar, S.K.: Radiometric self-calibration. In: Proc. of Comp. Vis.
and Patt. Recog. (CVPR), pp. 374–380 (1999)
5. Grossberg, M.D., Nayar, S.K.: What is the space of camera response functions? In:
Proc. of Comp. Vis. and Patt. Recog. (CVPR), pp. 602–609 (2003)
6. Mann, S.: Comparametric equations with practical applications in quantigraphic
image processing. IEEE Trans. on Image Processing 9, 1389–1406 (2000)
7. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from pho-
tographs. Proc. of ACM SIGGRAPH, 369–378 (1997)
8. Tsin, Y., Ramesh, V., Kanade, T.: Statistical calibration of ccd imaging process.
In: Proc. of Int’l Conf. on Comp. Vis. (ICCV), pp. 480–487 (2001)
9. Pal, C., Szeliski, R., Uyttendale, M., Jojic, N.: Probability models for high dynamic
range imaging. In: Proc. of Comp. Vis. and Patt. Recog. (CVPR), pp. 173–180
(2004)
10. Grossberg, M.D., Nayar, S.K.: What can be known about the radiometric response
function from images? In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.)
ECCV 2002. LNCS, vol. 2353, pp. 189–205. Springer, Heidelberg (2002)
11. Kim, S.J., Pollefeys, M.: Radiometric alignment of image sequences. In: Proc. of
Comp. Vis. and Patt. Recog. (CVPR), pp. 645–651 (2004)
12. Lin, S., Gu, J., Yamazaki, S., Shum, H.Y.: Radiometric calibration from a single
image. In: Proc. of Comp. Vis. and Patt. Recog. (CVPR), pp. 938–945 (2004)
13. Lin, S., Zhang, L.: Determining the radiometric response function from a single
grayscale image. In: Proc. of Comp. Vis. and Patt. Recog. (CVPR), pp. 66–73
(2005)
14. Wilburn, B., Xu, H., Matsushita, Y.: Radiometric calibration using temporal irra-
diance mixtures. In: Proc. of Comp. Vis. and Patt. Recog. (CVPR) (2008)
Estimating Radiometric Response Functions from Image Noise Variance 637
15. Matsushita, Y., Lin, S.: Radiometric calibration from noise distributions. In: Proc.
of Comp. Vis. and Patt. Recog. (CVPR) (2007)
16. Takamatsu, J., Matsushita, Y., Ikeuchi, K.: Estimating camera response functions
using probabilistic intensity similarity. In: Proc. of Comp. Vis. and Patt. Recog.
(CVPR) (2008)
17. Matsushita, Y., Lin, S.: A probabilistic intensity similarity measure based on noise
distributions. In: Proc. of Comp. Vis. and Patt. Recog. (CVPR) (2007)
18. Janesick, J.R.: Photon Transfer. SPIE Press (2007)
19. Schechner, Y.Y., Nayar, S.K., Belhumeur, P.N.: Multiplexing for optimal lighting.
IEEE Trans. on Patt. Anal. and Mach. Intell. 29, 1339–1354 (2007)
20. Liu, C., Freeman, W.T., Szeliski, R., Kang, S.B.: Noise estimation from a single
image. In: Proc. of Comp. Vis. and Patt. Recog. (CVPR), pp. 901–908 (2006)
21. Healey, G.E., Kondepudy, R.: Radiometric ccd camera calibration and noise esti-
mation. IEEE Trans. on Patt. Anal. and Mach. Intell. 16, 267–276 (1994)
22. Alter, F., Matsushita, Y., Tang, X.: An intensity similarity measure in low-light
conditions. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS,
vol. 3954, pp. 267–280. Springer, Heidelberg (2006)
23. Consul, P.C.: Generalized Poisson Distributions: Properties and Applications. Mar-
cel Dekker Inc., New York (1989)
24. Nelder, J.A., Mead, R.: A simplex method for function minimization. Computer
Journal 7, 308–312 (1965)
25. Botev, Z., Kroese, D.: Global likelihood optimization via the cross-entropy method
with an application to mixture models. In: Proc. of the 36th Conf. on Winter simul.,
pp. 529–535 (2004)
Solving Image Registration Problems Using
Interior Point Methods
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 638–651, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Solving Image Registration Problems Using Interior Point Methods 639
Mikolajczyk and Schmid [4] who proposed a very effective scheme for detecting
and matching interest points under severe affine deformations. This approach
works best when the interframe motion is close to affine since more complicated
deformation models can distort the feature points beyond recognition. Further,
it becomes increasingly difficult to apply robust estimation methods as the com-
plexity of the deformation model increases since an ever increasing number of
reliable point matches are required.
Belongie and Malik [5] proposed an elegant approach to matching shapes
based on information derived from an analysis of contour features. This approach
is similar to [4] in that it revolves around feature extraction and pointwise cor-
respondence. The method described in this work is very different from these in
that it avoids the notion of features altogether, instead it proceeds by construct-
ing a matching function based on low level correlation volumes and allows every
pixel in the image to constrain the match to the extent that it can.
Shekhovstov Kovtun and Hlavac [6] have developed a novel method for image
registration that uses Sequential Tree-Reweighted Message passing to solve a lin-
ear program that approximates a discrete Markov Random Field optimization
problem. Their work also seeks to construct a globally convex approximation
to the underlying image matching problem but the approach taken to formulat-
ing and solving the optimization problem differ substantially from the method
discussed in this paper.
Linear programming has been previously applied to motion estimation [7,8].
The work by Jiang et al.. [7] on matching feature points is similar to ours in that
the data term associated with each feature is approximated by a convex com-
bination of points on the lower convex hull of the match cost surface. However,
their approach is formulated as an optimization over the interpolating coeffi-
cients associated with these convex hull points which is quite different from the
approach described in this paper. Also their method uses the simplex method
for solving the LP while the approach described in this paper employs an inte-
rior point solver which allows us to exploit the structure of the problem more
effectively.
The problem of recovering the deformation that maps a given base image onto a
given target image can be phrased as an optimization problem. For every pixel
in the target image one can construct an objective function, exy , which captures
how similar the target pixel is to its correspondent in the base image as a function
of the displacement applied at that pixel.
Figure 1(a) shows an example of one such function for a particular pixel in one
of the test images. This particular profile was constructed by computing the #2 dif-
ference between the RGB value of the target pixel and the RGB values of the pixels
in the base image for various displacements up to ±10 pixels in each direction.
Our goal then is to minimize an objective function E(px , py ) which models
how the discrepancy between the target and base images varies as a function of
the deformation parameters, px and py .
E(px , py ) = exy (Dx (x, y, px ), Dy (x, y, py )) (1)
x y
In general, since the component exy functions can have arbitrary form the land-
scape of the objective function E(px , py ) may contain multiple local minima
Fig. 1. (a) Error surface associated with particular pixel in the target image that
encodes how compatible that pixel is with various x, y displacements (b) Piecewise
planar convex approximation of the error surface
Solving Image Registration Problems Using Interior Point Methods 641
which can confound most standard optimization methods that proceed by con-
structing local approximations of the energy function.
The crux of the proposed approach is to introduce a convex approximation for
the individual objective functions exy . This leads directly to an approximation
of the global objective function E (px , py ) which is convex in the deformation
parameters. Once this has been done, one can recover estimates for the deforma-
tion parameters and, hence, the deformation by solving a convex optimization
problem which is guaranteed to have a unique minimum.
The core of the approximation step is shown in Figure 1(b), here the original
objective function is replaced by a convex lower bound which is constructed
by considering the convex hull of the points that define the error surface. This
convex lower hull is bounded below by a set of planar facets.
In order to capture this convex approximation in the objective function we
introduce one auxiliary variable z(x, y) for every pixel in the target image. There
are a set of linear constraints associated with each of these variables which reflect
the constraint that this value must lie above all of the planar facets that define
the convex lower bound.
z(x, y) ≥ aix (x, y)Dx (x, y, px ) + aiy (x, y)Dy (x, y, py ) − bi (x, y) ∀i (2)
Here the terms aix , aiy and bi denote the coefficients associated with each of the
facets in the approximation.
The problem of minimizing the objective function E (px , py ) can now be
rephrased as a linear program as follows:
minpx ,py ,z x y z(x, y) (3)
st z(x, y) ≥ ax (x, y)Dx (x, y, px ) + aiy Dy (x, y, py )
i
− b (x, y) ∀x, y, i
i
(4)
where Ax and Ay are Iz are sparse matrices obtained by concatenating the con-
straints associated with all of the planar facets and z and b are vectors obtained
by collecting the z(x, y) and bi (x, y) variables respectively.
Note that the Ax , Ay and Iz matrices all have the same fill pattern and are
structured as shown in equation 6, the non zero entries in the Iz matrix are all
1. In this equation M denotes the total number of pixels in the image and Si
refers to the number of planar facets associated with pixel i.
642 C.J. Taylor and A. Bhusnurmath
⎡ ⎤
a11 0 ··· ··· 0
⎢ a21 0 ··· ··· 0 ⎥
⎢ ⎥
⎢ .. ⎥
⎢ . 0 ··· ··· 0 ⎥
⎢ ⎥
⎢ aS 1 1 0 ··· ··· 0 ⎥
⎢ ⎥
⎢ 0 a12 0 ··· 0 ⎥
⎢ ⎥
⎢ 0 a22 0 ··· 0 ⎥
⎢ ⎥
A=⎢
⎢ 0
.. ⎥
⎥ (6)
⎢ . 0 ··· 0 ⎥
⎢ 0 aS 2 2 0 ··· 0 ⎥
⎢ ⎥
⎢ .. .. ⎥
⎢ 0 0 . . 0 ⎥
⎢ ⎥
⎢ 0 ··· ··· 0 a1M ⎥
⎢ ⎥
⎢ .. ⎥
⎣ 0 ··· ··· 0 . ⎦
0 · · · · · · 0 aS M M
The linear program shown in Equation 5 can be augmented to include constraints
on the displacement entries, Dx , Dy and the z values as shown in Equation 7.
Here the vectors blb and bub capture the concatenated lower and upper bound
constraints respectively. It would also be a simple matter to include bounding
constraints on the parameter values at this stage. Alternatively one could easily
add a convex regularization term to reflect a desire to minimize the bending
energy associated with the deformation.
minpx ,py ,z 1T z (7)
⎡ ⎤⎛ ⎞⎛ ⎞ ⎛ ⎞
Ax Ay −Iz C 0 0 px b
⎣ −I ⎦ ⎝ 0 C 0 ⎠ ⎝ py ⎠ ≤ ⎝ blb ⎠
I 0 0 I z bub
Note that the proposed approximation procedure increases the ambiguity asso-
ciated with matching any individual pixel since the convex approximation is a
lower bound which may significantly under estimate the cost associated with
assigning a particular displacement to a pixel. What each pixel ends up con-
tributing is a set of convex terms to the global objective function. The linear
program effectively integrates the convex constraints from tens of thousands of
pixels, constraints which are individually ambiguous but which collectively iden-
tify the optimal parameters. In this scheme each pixel contributes to constraining
the deformation parameters to the extent that it is able. Pixels in homogenous
regions may contribute very little to the global objective while well defined fea-
tures may provide more stringent guidance. There is no need to explicitly identify
distinguished features since local matching ambiguities are handled through the
approximation process.
min f0 (x)
st fi (x) ≤ 0, i = 1, . . . , m (8)
m
is solved by minimizing φ(x, t) = tf0 (x) − log(−fi (x)) for increasing values
i=1
of t until convergence. At each value of t a local step direction, the Newton step,
needs to be computed. This involves the solution of a system of linear equations
involving the Hessian and the gradient of φ(x, t). The Hessian can be computed
from the following expression H = [AT diag(s−2 )A] where s = b − Ax and s−2
denotes the vector formed by inverting and squaring the elements of s. Similarly
the gradient of the φ(x, t) can be computed from the following expression:
In short, computing the Newton Step boils down to solving the linear system in
Equation 14. Note that the size of this system depends only on the dimension of
the parameter vector, p. For example if one were interested in fitting an affine
model which involves 6 parameters, 3 for px and 3 for py , one would only end
644 C.J. Taylor and A. Bhusnurmath
up solving a linear system with six degrees of freedom. Note that the compu-
tational complexity of this key step does not depend on the number of pixels
being considered or on the number of constraints that were used to construct the
convex approximation. This is extremely useful since typical matching problems
will involve hundreds of thousands of pixels and a similar number of constraint
equations. Even state of the art LP solvers like MOSEK and TOMLAB would
have difficulty solving problems of this size.
Experiments were carried out with two classes of deformation models. In the first
class the displacements at each pixel are computed as a polynomial function of
the image coordinates. For example for a second order model:
Dx (x, y) = c1 + c2 x + c3 y + c4 xy + c5 x2 + c6 y 2 (15)
These deformations are parameterized by the coefficients of the polynomials. The
complexity of the model can be adjusted by varying the degree of the polynomial.
A number of interesting deformation models can be represented in this manner
include affine, bilinear, quadratic and bicubic.
Another class of models can be represented as a combination of an affine
deformation and a radial basis function. That is
Dx (x, y) = c1 + c2 x + c3 y + ki φ((x, y) − (xi , yi )) (16)
i
be considered at the finer scales which limits the size of the correlation volumes
that must be constructed.
In the experiments described in section 3.1 the images are first downsampled
by a factor of 4 and then matched. The deformations computed at this scale in-
form the search for correspondences at the next finer scale which is downsampled
from the originals by a factor of 2.
Note that as the approach proceeds to finer scales, the convex approximation
is effectively being constructed over a smaller range of disparities which means
that it increasingly approaches the actual error surface.
3 Experimental Results
Two different experiments were carried out to gauge the performance of the
registration scheme quantitatively. In the first experiment each of the images in
our data set was warped by a random deformation and the proposed scheme was
employed to recover the parameters of this warp. The recovered deformation was
compared to the known ground truth deformation to evaluate the accuracy of
the method.
In the second set of experiments the registration scheme was applied to por-
tions of the Middlebury stereo data set. The disparity results returned by the
method were then compared to the ground truth disparities that are provided
for these image pairs.
Table 1. This table details the deformation applied to each of the images in the data
set and reports the discrepancy between the deformation field returned by the method
and the ground truth displacement field
error in pixels
Image Deformation Model no. of parameter mean median max
Football Gaussian 38 0.1524 0.1306 0.5737
Hurricane Gaussian 38 0.1573 0.1262 0.7404
Spine Affine 6 0.1468 0.1314 0.4736
Peppers Gaussian 38 0.1090 0.0882 0.7964
Cells Thin Plate Spine 38 0.1257 0.1119 0.8500
Brain Gaussian 38 0.1190 0.0920 0.8210
Kanji third degree polynomial 20 0.1714 0.0950 2.5799
Aerial bilinear 8 0.0693 0.0620 0.2000
Face1 Gaussian 38 0.1077 0.0788 0.6004
Face2 Gaussian 38 0.5487 0.3095 4.6354
results are tabulated in Table 1. This table also indicates what type of defor-
mation model was applied to each of the images along with the total number of
parameters required by that model.
Note that in every case the deformed result returned by the procedure is al-
most indistinguishable from the given target. More importantly, the deformation
fields returned by the procedure are consistently within a fraction of a pixel of
the ground truth values. The unoptimized Matlab implementation of the match-
ing procedure takes approximately 5 minutes to proceed through all three scales
and produce the final deformation field for a given image pair.
(a) Football
(b) Hurricane
(c) Spine
(d) Peppers
(e) Cells
Fig. 2. Results obtained by applying the proposed method to actual image pairs. The
first two columns correspond to the input base and target images respectively while
the last column corresponds to the result produced by the registration scheme.
The first column of Figure 4 shows the left image in the pair, the second
column shows what would be obtained if one used the raw SSD stereo results
and the final column shows the ground truth disparities.
648 C.J. Taylor and A. Bhusnurmath
(f) Brain
(g) Kanji
(h) Aerial
(i) Face1
(j) Face2
(a) Teddy
(b) Venus
Fig. 4. The proposed image registration scheme was applied to the delineated regions
in the Middlebury Stereo Data Set. The first column shows the left image, the second
column the raw results of the SSD correlation matching and the last column the ground
truth disparity.
Table 2. This table reports the discrepancy between the affine deformation field re-
turned by the method and the ground truth disparities within each region
error in pixels
Image Region mean median
teddy bird house roof 0.2558 0.2245
teddy foreground 0.9273 0.8059
venus left region 0.0317 0.0313
venus right region 0.0344 0.0317
The selected rectangles are overlaid on each of the images. These regions
were specifically chosen in areas where there was significant ambiguity in the
raw correlation scores to demonstrate that the method was capable of correctly
integrating ambiguous data. Table 2 summarizes the results of the fitting proce-
dure. The reconstructed disparity fields within the regions were compared to the
ground truth disparities and the mean and median discrepancy between these
two fields is computed over all of the pixels within the region.
4 Conclusion
This paper has presented a novel approach to tackling the image registration
problem wherein the original image matching objective function is approximated
650 C.J. Taylor and A. Bhusnurmath
by a linear program which can be solved using the interior point method. The
paper also describes how one can exploit the special structure of the resulting
linear program to develop efficient algorithms. In fact the key step in the resulting
resulting procedure only involves inverting a symmetric matrix whose dimension
reflects the complexity of the model being recovered.
While the convex approximation procedure typically increases the amount of
ambiguity associated with any individual pixels, the optimization procedure ef-
fectively aggregates information from hundreds of thousands of pixels so the net
result is a convex function that constrains the actual global solution. In a cer-
tain sense, the proposed approach is dual to traditional non-linear optimization
schemes which seek to construct a local convex approximation to the objective
function. The method described in this work proceeds by constructing a global
convex approximation over the specified range of displacements.
A significant advantage of the approach is that once the deformation model
and displacement bounds have been selected, the method is insensitive to ini-
tialization since the convex optimization procedure will converge to the same
solution regardless of the start point. This means that the method can be di-
rectly applied to situations where there is a significant deformation.
The method does not require any special feature detection or contour extrac-
tion procedure. In fact all of the correlation volumes used in the experiments
were computed using nothing more than pointwise pixel comparisons. Since the
method does not hinge on the details of the scoring function more sophisticated
variants could be employed as warranted. The results indicate the method pro-
duces accurate results on a wide range of image types and can recover fairly
large deformations.
References
1. Bajcsy, R., Kovacic, S.: Multiresolution elastic matching. Computer Vision, Graph-
ics and Image Processing 46(1), 1–21 (1989)
2. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. IEEE Transactions
on Pattern Analysis and Machine Intelligence 23(6), 681–685 (2001)
3. Baker, S., Matthews, I.: Equivalence and efficiency of image alignment algorithms.
In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1090–1097
(2001)
4. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors.
International Journal of Computer Vision 60(1), 63–86 (2004)
5. Belongie, S., Malik, J.: Shape matching and object recognition using shape con-
texts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(24),
509 (2002)
6. Shekhovstov, A., Kovtun, I., Hlavac, V.: Efficient mrf deformation model for non-
rigid image matching. In: IEEE Conference on Computer Vision and Pattern
Recognition (2007)
7. Jiang, H., Drew, M., Li, Z.N.: Matching by linear programming and successive
convexification. PAMI 29(6) (2007)
8. Ben-Ezra, M., Peleg, S., Werman, M.: Real-time motion analysis with linear pro-
gramming. In: ICCV (1999)
Solving Image Registration Problems Using Interior Point Methods 651
9. Friston, K.J., Ashburner, J., Frith, C.D., Poline, J.B., Heather, J.D., Frackowiak,
R.S.J.: Spatial registration and normalization of images. Human Brain Mapping 2,
165–189 (1995)
10. Modersitzki, J.: Numerical Methods for Image Registration. Oxford University
Press, Oxford (2004)
11. Boyd, S., VandenBerghe, L.: Convex Optimization. Cambridge University Press,
Cambridge (2004)
3D Face Model Fitting for Recognition
1 Introduction
The use of 3D scan data for face recognition purposes has become a popular research
area. With high recognition rates reported for several large sets of 3D face scans, the 3D
shape information of the face proved to be a useful contribution to person identification.
The major advantage of 3D scan data over 2D color data, is that variations in scaling and
illumination have less influence on the appearance of the acquired face data. However,
scan data suffers from noise and missing data due to self-occlusion. To deal with these
problems, 3D face recognition methods should be invariant to noise and missing data,
or the noise has to be removed and the holes interpolated. Alternatively, data could be
captured from multiple sides, but this requires complex data acquisition. In this work
we propose a method that produces an accurate fit of a statistical 3D shape model of
the face to the scan data. The 3D geometry of the generated face instances, which are
without noise and holes, are effectively used for 3D face recognition.
Related work. The task to recognize 3D faces has been approached with many dif-
ferent techniques as described in surveys of Bowyer et al. [1] and Scheenstra et al. [2].
Several of these 3D face recognition techniques are based on 3D geodesic surface infor-
mation, such as the methods of Bronstein et al. [3] and Berretti et al. [4]. The geodesic
distance between two points on a surface is the length of the shortest path between two
points. To compute accurate 3D geodesic distances for face recognition purposes, a 3D
face without noise and without holes is desired. Since this is typically not the case with
laser range scans, the noise has to be removed and the holes in the 3D surface interpolated.
However, the success of basic noise removal techniques, such as Laplacian smoothing is
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 652–664, 2008.
© Springer-Verlag Berlin Heidelberg 2008
3D Face Model Fitting for Recognition 653
very much dependent on the resolution of the scan data. Straightforward techniques to
interpolate holes using curvature information or flat triangles often fail in case of com-
plex holes, as pointed out in [5]. The use of a deformation model to approximate new
scan data and interpolate missing data is a gentle way to regulate flaws in scan data.
A well known statistical deformation model specifically designed for surface meshes
of 3D faces, is the 3D morphable face model of Blanz and Vetter [6]. This statistical
model was built from 3D face scans with dense correspondences to which Principal
Component Analysis (PCA) was applied. In their early work, Blanz and Vetter [6] fit
this 3D morphable face model to 2D color images and cylindrical depth images from the
Cyberware T M scanner. In each iteration of their fitting procedure, the model parame-
ters are adjusted to obtain a new 3D face instance, which is projected to 2D cylindrical
image space allowing the comparison of its color values (or depth values) to the in-
put image. The parameters are optimized using a stochastic Newton algorithm. More
recently, Blanz et al. [7] proposed a method to fit their 3D morphable face model to
more common textured depth images. The fitting process is similar to their previous
algorithm, but now the cost function is minimized using both color and depth values
after the projection of the 3D model to 2D cylindrical image space. To initialize their
fitting process, they manually select seven corresponding face features on their model
and in the depth scan. A morphable model of expressions was proposed by Lu et al.
[8]. Starting from an existing neutral scan, they use their expression model to adjust the
vertices in a small region around the nose to obtain a better fit of the neutral scan to a
scan with a certain expression.
Non-statistical deformation models were proposed as well. Huang et al. [9] proposed
a global to local deformation framework to deform a shape with an arbitrary dimension
(2D, 3D or higher) to a new shape of the same class. They show their framework’s ap-
plicability to 3D faces, for which they deform an incomplete source face to a target face.
Kakadiaris et al. [10] deform an annotated face model to scan data. Their deformation
is driven by triangles of the scan data attracting the vertices of the model. The deforma-
tion is restrained by a stiffness, mass and damping matrix, which control the resistance,
velocity and acceleration of the model’s vertices. The advantage of such deformable
faces is that they are not limited to the statistical changes of the input shapes, so the de-
formation has less restrictions. However, this is also their disadvantage, because these
models cannot rely on statistics in case of noise and missing data.
Contribution. First, we propose a fully automatic algorithm to efficiently optimize
the parameters of the morphable face model, creating a new face instance that accurately
fits the 3D geometry of the scan data. Unlike other methods, ours needs no manual ini-
tialization, so that batch processing of large data sets has become feasible. Second, we
quantitatively evaluate our fitted face models and show that the use of multiple compo-
nents improves the fitting process. Thirdly, we show that our model fitting method is
more accurate than existing methods. Fourthly, we show that the accurately generated
face instances can be effectively used for 3D face recognition.
3D faces. This statistical point distribution model (PDM) was built from 100 cylin-
drical 3D face scans with neutral expressions from which n=75,972 correspondences
were selected using an optic flow algorithm. Each face shape Si was described us-
ing the set of correspondences S = (x1 , y1 , z1 , ..., xn , yn , zn )T ∈ !3n and a mean
face S̄ was determined. PCA was applied to these 100 sets Si to obtain the m=99
most important eigenvectors of the PDM. The mean face S̄, the eigenvectors si =
(Δx1 , Δy1 , Δz1 , ..., Δxn , Δyn , Δzn )T , the eigenvalues λi (σi2 = λi ) and weights wi
m
are used to model new face instances according to Sinst = S̄ + i=1 wi σi si . Weight
wi represents the number of standard deviations a face instance morphs along eigen-
vector evi . Since the connectivity of the n correspondences in the PDM is known, each
instance is a triangular mesh with proper topology and without holes.
3 Face Scans
We fit the morphable face model to the 3D frontal face scans of the University of Notre
Dame (UND) Biometrics Database [12]. This set contains 953 range scans and a corre-
sponding 2D color texture from 277 different subjects. All except ten scans were used
in the Face Recognition Grand Challenge (FRGC v.1). Because the currently used mor-
phable model is based on faces with neutral expressions only, it makes no sense to use
collections containing many non-neutral scans such as the FRGC v.2. Nevertheless, our
proposed method performs well for the small expression variations of the UND set.
Throughout this work, we have only used the 3D scan data and neglected the available
2D color information.
We aim at 3D face recognition, so we need to segment the face from each scan. For
that, we employ our pose normalization method [13] that normalizes the pose of the face
and localizes the tip of the nose. Before pose normalization was applied to the UND
scan data, we applied a few basic preprocessing steps to the scan data: the 2D depth
images were converted to triangle meshes by connecting the adjacent depth samples
with triangles, slender triangles and singularities were removed, and only considerably
large components were retained.
The cleaned surface meshes were randomly sampled, such that every ≈2.0 mm2 of
the surface is approximately sampled once. The pose normalization method uses these
locations in combination with their surface normal as initial placements for a nose tip
template. To locations where this template fits well, a second template of global face
Fig. 1. Face segmentation. The depth image (left) is converted to a surface mesh (middle). The
surface mesh is cleaned, the tip of the nose is detected and the face segmented (right, in pink).
3D Face Model Fitting for Recognition 655
features is fitted to normalize the face’s pose and to select the tip of the nose. The face
is then segmented by removing the scan data with a Euclidean distance larger than 100
mm from the nose tip. These face segmentation steps are visualized in Fig. 1.
using a kD-tree. The RMS distance is then measured between M1 and M2 as:
?
@ n
@1
drms (M1 , M2 ) = A emin (pi , M2 )2 , (2)
n i=1
using n vertices from M1 . Closest point pairs (p,p ) for which p belongs to the bound-
ary of the face scan, are not used in the distance measure.
The morphable face model has n=75,972 vertices that cover the face, neck and ear
regions and its resolution in the upward direction is three times higher than in its side-
ways direction. Because the running time of our measure is dependent on the number of
vertices, we recreated the morphable face model such that it contains only the face (data
within 110 mm from the tip of the nose) and not the neck and ears. To obtain a more
uniform resolution of for the model, we reduced the upward resolution to one third of
the original model. The number of vertices of this adjusted morphable mean face is now
n=12,964 vertices, a sample every ≈2.6 mm2 of the face area.
vertices according to Sinst = S̄ + m i=1 wi σi si , measuring the RMS-distance of the
new instance to the scan data, selecting new weights and continue until the optimal
instance is found. Knowing that each instance is evaluated using a large number of
vertices, an exhaustive search for the optimal set of m weights is too computationally
expensive.
A common method to solve large combinatorial optimization problems is simulated
annealing (SA) [14]. In our case, random m-dimensional vectors could be generated
which represent different morphs for a current face instance. A morph that brings the
current instance closer to the scan data is accepted (downhill), and otherwise it is ei-
ther accepted (uphill to avoid local minima) or rejected with a certain probability. In
each iteration, the length of the m-dimensional morph vector can be reduced as imple-
mentation of the “temperature” scheme. The problem with such a naive SA approach
is that most random m-dimensional morph vectors are uphill. In particular close to the
optimal solution, a morph vector is often rejected, which makes it hard to produce an
accurate fit. Besides this inefficiency, it doesn’t take the eigensystem of the morphable
face model into account.
Instead, we propose an iterative downhill walk along the consecutive eigenvectors
from a current instance towards the optimal solution. Starting from the mean face S̄
(∀mi=1 wi = 0), try new values for w1 and keep the best fit, then try new values for w2
and keep the best fit, and continue until the face is morphed downhill along all m eigen-
vectors. Then iterate this process with a smaller search space for wi . The advantage
in computation costs of this method is twofold. First, the discrete number of morphs
in the selected search space directly defines the number of rejected morphs per itera-
tion. Second, optimizing one wi at a time means only a one (instead of m) dimensional
modification of the current face instance Snew = Sprev + (wnew − wprev )σi si .
Because the first eigenvectors induce the fitting of global face properties (e.g. face
height and width) and the last eigenvectors change local face properties (e.g. nose length
and width), each iteration follows a global to local fitting scheme (see Fig. 2). To avoid
local minima, two strategies are applied. (1) The selected wi in one iteration is not
evaluated in the next iteration, forcing a new (similar) path through the m-dimensional
space. (2) The vertices of the morphable face model are uniformly divided over three
Fig. 2. Face morphing along eigenvectors starting from the mean face (center column). Differ-
ent weights for the principal eigenvectors (e.g. i=1,2) changes the global face shape. For latter
eigenvectors the shape changes locally (e.g. i=50).
3D Face Model Fitting for Recognition 657
sets and in each iteration a different set is modified and evaluated. Only in the first
and last iteration all vertices are evaluated. Notice that this also reduces the number of
vertices to fit and thus the computation costs.
The fitting process starts with the mean face and morphs in place towards the scan
data, which means that the scan data should be well aligned to the mean face. To do so,
the segmented and pose normalized face is placed with its center of mass on the center
of mass of the mean face, and finely aligned using the Iterative Closest Point (ICP)
algorithm [15]. The ICP algorithm iteratively minimizes the RMS distance between
vertices. To further improve the effectiveness of the fitting process, our approach is
applied in a coarse fitting and a fine fitting step.
Component selection. All face instances generated with the morphable model are
assumed to be in correspondence, so a component is simply a subset of vertices in the
mean shape S̄ (or any other instance). We define seven components in our adjusted
morphable face model (see Fig. 3). Starting with the improved alignment, we can in-
dividually fit each of the components to the scan data using the fine fitting scheme,
obtaining a higher precision of the fitting process (as shown in Sect. 6.1). Individual
components for the left and right eyes and cheeks were selected, so that our method ap-
plies to non-symmetric faces as well. The use of multiple components has no influence
on the fitting time, because the total number of vertices remains the same and only the
selected vertices are modified and evaluated.
Component blending. A drawback of fitting each component separately is that in-
consistencies may appear at the borders of the components. During the fine fitting, the
border triangles of two components may start to intersect, move apart, or move across
(Fig. 3). The connectivity of the complete mesh remains the same, so two components
moving apart remain connected with elongated triangles at their borders. We solve these
inconsistencies by means of a post-processing step, as described in more detail below.
Fig. 3. Multiple components (a) may intersect (b1), move apart (b2), or move across (b3).
Simulating a cylindrical scan (c) and smoothing the new border vertices (d) solves these
problems (e).
3D Face Model Fitting for Recognition 659
Knowing that the morphable face model is created from cylindrical range scans and
that the position of the face instance doesn’t change, it is easy to synthetically rescan the
generated face instance. Each triangle of the generated face instance Sfine is assigned
to a component (Fig. 3a). A cylindrical scanner is simulated, obtaining a cylindrical
depth image d(θ, y) with a surface sample for angle θ, height y with radius distance
d from the y-axis through the center of mass of S̄ (Fig. 3c). Basically, each sample is
the intersection point of a horizontal ray with its closest triangle, so we still know to
which component it belongs. The cylindrical depth image is converted to a 3D triangle
mesh by connecting the adjacent samples and projecting the cylindrical coordinates to
3D. This new mesh Sfine has a guaranteed resolution depending on the step sizes of
θ and y, and the sampling solves the problem of intersecting and stretching triangles.
However, ridges may still appear at borders where components moved across. There-
fore, Laplacian smoothing is applied to the border vertices and their neighbors (Fig.
3d). Finally, data further then 110 mm from the tip of the nose is removed to have the
final model Sfinal (Fig. 3e) correspond to the segmented face. In Sect. 6.1, we evaluate
both the single and multiple component fits.
5 Face Recognition
Our model fitting algorithm provides a clean model of a 3D face scan. In this section,
we use this newly created 3D geometry as input for two 3D face matching methods.
One compares facial landmarks and the other compares extracted contour curves.
Landmarks. All vertices of two different instances of the morphable model are as-
sumed to have a one-to-one correspondence. Assuming that facial landmarks such as
the tip of the nose, corners of the eyes, etc. are morphed towards the correct position in
the scan data, we can use them to match two 3D faces. So, we assigned 15 anthropo-
morphic landmarks to the mean face and obtain their new locations by fitting the model
to the scan data. To match two faces A and B we use the sets of c=15 corresponding
landmark locations:
c
dcorr (A, B) = dp (ai , bi ) , (3)
i=1
Contour curves. Another approach is to fit the model to scans A and B and use the new
clean geometry as input for a more complex 3D face recognition method. To perform
3D face recognition, we extract from each fitted face instance three 3D facial contour
curves, and match only these curves to find similar faces. The three curves were ex-
tracted and matched as described by ter Haar and Veltkamp [13].
In more detail, after pose normalization and the alignment of the face scan to both S̄
and Scoarse , a correct pose of the face scan is assumed and thus a correct pose of the
final face instance Sfinal . Starting from the nose tip landmark pnt , 3D profile curves can
660 F.B. ter Haar and R.C. Veltkamp
Fig. 4. The similarity of two 3D faces is determined using one-to-one correspondences, with on
the left 15 corresponding landmarks and on the right 135 corresponding contour samples. The
optimal XY-, C-, and G-contour curves (inner to outer) were extracted, for which the G-contour
uses the (colored) geodesic distances. The line shown in black is one of the Np profiles.
6 Results
The results described in this section are based on the UND face scans. For each of the
953 scans we applied our face segmentation method (Sect. 3). Our face segmentation
method correctly normalized the pose of all face scans and adequately extracted the tip
of the nose in each of them. The average distance and standard deviation of the 953
automatically selected nose tips to our manually selected nose tips was 2.3 ±1.2 mm.
Model fitting was applied to the segmented faces, once using only a single compo-
nent and once using multiple components. Both instances are quantitatively evaluated
in Sect. 6.1, and both instances were used for 3D face recognition in Sect. 6.2.
3D Face Model Fitting for Recognition 661
Fig. 5. Fitted face models Sfinal based on a single component (1st and 3rd column) and multiple
components (2nd and 4rd column) to scan data in blue. Results from the front and side view,
show a qualitative better fit of the multiple components to the scan data. The last two subjects on
the right were also used in [7].
662 F.B. ter Haar and R.C. Veltkamp
Table 1. The quantitative evaluation (in mm) of our face fitting method
Table 2. Recognition rates and mean average precisions based on landmarks and contour curves
for single and multiple component fits.
Comparison. Blanz et al. [7] reported the accuracy of their model fitting method
using the average depth error between the cylindrical depth images of the input scan and
the output model. The mean depth error over 300 FRGC v.1 scans was 1.02 mm when
they neglected outliers (distance > 10 mm) and 2.74 mm otherwise. To compare the
accuracy of our method with their accuracy, we produced cylindrical depth images (as in
Fig. 3c) for both the segmented face scan and the fitted model and computed the average
depth error |dscan (θ, y)−dfinal (θ, y)| without and with the outliers. For the fitted single
component these errors davr .depth are 0.656 mm and 0.692 mm, respectively. For the
fitted multiple components these errors are 0.423 mm and 0.444 mm, respectively. So
even our single component fits are more accurate then those of Blanz et al.
Our time to process a raw scan requires ≈3 seconds for the face segmentation, ≈1
second for the coarse fitting, and ≈30 seconds for the fine fitting on a Pentium IV 2.8
GHz. Blanz method reported ≈4 minutes on a 3.4 GHz Xeon processor, but includes
texture fitting as well. Huang et al. [9] report for their deformation model a matching
error of 1.2 mm after a processing time of 4.6 minutes.
the ranked lists are reported, to elaborate on the retrieval of all relevant faces, i.e. all
faces from the same subject.
Four 3D face recognition experiments were conducted, namely face recognition
based on landmark locations from the fitted single component and the fitted multiple
components, and based on contour curves from the fitted single component and the
fitted multiple components. Results in Table 2 show that the automatically selected an-
thropomorphic landmarks are not reliable enough for effective 3D face recognition with
85.8% and 85.2% recognition rates (RR). Notice that the landmarks obtained from the
single component fit perform better than those from the multiple component fit. This
is probably caused by three landmarks (outer eye corners and Sellion) lying close to
component boundaries, where the fitting can be less reliable.
The fitted face model is an accurate representation of the 3D scan data. This accuracy
allows the contour based method to achieve high recognition rates (see Table 2). For the
single component fits, the contour matching achieves a RR of 96.3% and for multiple
component fits even 97.5%. For a high recognition rate, only one of the relevant faces
in the dataset is required on top of each ranked list. The reported MAPs show that most
of the other relevant faces are retrieved before the irrelevant ones. Some of the queries
that were not identified, have a non-neutral expression (happy, angry, biting lips, etc.)
while its relevant faces have a neutral expression. A face recognition method invariant
to facial expressions, will most likely increase the performance even further.
Comparison. Blanz et al. [7] achieved a 96% RR for 150 queries in a set of 150 faces
(from the FRGC v.1). To determine the similarity of two face instances, they computed
the scalar product of the 1000 obtained model coefficients. Using a set of facial depth
curves, Samir et al. [17] reported a 90.4% RR for 270 queries in a set of 470 UND
scans. Mian et al. [18] reported a 86.4% RR for 277 queries in a set of 277 UND
scans.
7 Concluding Remarks
Where other methods need manual initialization, we presented a fully automatic 3D face
morphing method that produces a fast and accurate fit for the morphable face model to
3D scan data. Based on a global to local fitting scheme the face model is coarsely
fitted to the automatically segmented 3D face scan. After the coarse fitting, the face
model is either finely fitted as a single component or as a set of individual components.
Inconsistencies at the borders are resolved using an easy to implement post-processing
method. Our results show that the use of multiple components produces a tighter fit of
the face model to the face scan, but assigned anthropomorphic landmarks may lose their
reliability for 3D face identification. Face matching using facial contours, shows higher
recognition rates based on the multiple component fits then for the single component
fits. This means that the obtained 3D geometry after fitting multiple components has
a higher accuracy. With a recognition rate of 97.5% for a large dataset of 3D faces,
our model fitting method proves to produce highly accurate fits usable for 3D face
recognition.
664 F.B. ter Haar and R.C. Veltkamp
Acknowledgements
This research was supported by the FP6 IST Network of Excellence 506766 AIM@-
SHAPE and partially supported by FOCUS-K3D FP7-ICT-2007-214993. The authors
thank the University of South Florida for providing the USF Human ID 3D Database.
References
1. Bowyer, K.W., Chang, K., Flynn, P.: A survey of approaches and challenges in 3D and multi-
modal 3D + 2D face recognition. CVIU 101(1), 1–15 (2006)
2. Scheenstra, A., Ruifrok, A., Veltkamp, R.C.: A Survey of 3D Face Recognition Methods.
In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 891–899.
Springer, Heidelberg (2005)
3. Bronstein, A.M., Bronstein, M.M., Kimmel, R.: Three-dimensional face recognition.
IJCV 64(1), 5–30 (2005)
4. Berretti, S., Del Bimbo, A., Pala, P., Silva Mata, F.: Face Recognition by Matching 2D and
3D Geodesic Distances. In: Sebe, N., Liu, Y., Zhuang, Y.-t., Huang, T.S. (eds.) MCAM 2007.
LNCS, vol. 4577, pp. 444–453. Springer, Heidelberg (2007)
5. Davis, J., Marschner, S.R., Garr, M., Levoy, M.: Filling holes in complex surfaces using
volumetric diffusion. 3DPVT, 428–861 (2002)
6. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. SIGGRAPH, 187–194
(1999)
7. Blanz, V., Scherbaum, K., Seidel, H.P.: Fitting a Morphable Model to 3D Scans of Faces. In:
ICCV, pp. 1–8 (2007)
8. Lu, X., Jain, A.: Deformation Modeling for Robust 3D Face Matching. PAMI 30(8), 1346–
1356 (2008)
9. Huang, X., Paragios, N., Metaxas, D.N.: Shape Registration in Implicit Spaces Using Infor-
mation Theory and Free Form Deformations. PAMI 28(8), 1303–1318 (2006)
10. Kakadiaris, I., Passalis, G., Toderici, G., Murtuza, N., Theoharis, T.: 3D Face Recognition.
In: BMVC, pp. 869–878 (2006)
11. Sarkar.S.: USF HumanID 3D Face Database. University of South Florida
12. Chang, K.I., Bowyer, K.W., Flynn, P.J.: An Evaluation of Multimodal 2D+3D Face Biomet-
rics. PAMI 27(4), 619–624 (2005)
13. ter Haar, F.B., Veltkamp, R.C.: A 3D Face Matching Framework. In: Proc. Shape Modeling
International (SMI 2008), pp. 103–110 (2008)
14. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Sci-
ence 220, 4598, 671–680 (1983)
15. Besl, P.J., McKay, N.D.: A method for registration of 3D shapes. PAMI 14(2), 239–256
(1992)
16. Kimmel, R., Sethian, J.: Computing geodesic paths on manifolds. Proc. of National Academy
of Sciences 95(15), 8431–8435 (1998)
17. Samir, C., Srivastava, A., Daoudi, M.: Three-Dimensional Face Recognition Using Shapes
of Facial Curves. PAMI 28(11), 1858–1863 (2006)
18. Mian, A.S., Bennamoun, M., Owens, R.: Matching Tensors for Pose Invariant Automatic 3D
Face Recognition. IEEE A3DISS (2005)
A Multi-scale Vector Spline Method for
Estimating the Fluids Motion on Satellite
Images
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 665–676, 2008.
c Springer-Verlag Berlin Heidelberg 2008
666 T. Isambert, J.-P. Berroir, and I. Herlin
based on a compactly supported and rapidly decaying radial basis function, thus
adapted to multiscale representation. The solution is obtained by solving a sparse
and well-conditioned linear system. The motion is computed on a pyramidal
representation of images, as the sum of a coarse scale motion and increments from
one scale to the immediately finer one. Results are presented to demonstrate the
effectiveness of the characteristics of the multiscale vector spline: use of control
points, div-curl regularity and multiscale coarse-to-fine motion estimation.
This paper is organized as follows: the section 2 recalls the vector spline
theory applied to fluid motion estimation; the proposed multiscale vector spline
is presented in section 3. Results are analyzed in section 4, and conclusions and
prospects for future work are given in section 5.
Vector splines have been initially introduced [11] for the interpolation and ap-
proximation of vector observations. In this context, the vector spline model is
defined from: (1) a set of n control points xi in a spatial domain Ω; (2) a vec-
tor observation wi at each control point. The vector spline is solution of the
following minimization problem:
⎧ 1
⎨
min w2d
or: min (w(xi ) − wi ) + λ
2
wd2
⎩ w(x )Ω= w ∀i Ω (1)
i i i
Interpolation Approximation
In equation (1), the parameter λ of the approximating spline controls the com-
promise between regularity and confidence in data, wd denotes the 2nd order
div-curl semi-norm, defined as:
It is a semi-norm which zero-space is the set of affine vector fields. It has been
proven [11] that this minimization problem admits a unique solution: a thin-plate
spline based on the harmonic radial basis function φ:
−1
φ(x) = (128π) x4 log x (3)
for the interpolation and approximation cases. It has been proven [12] that the
solution of (5) exists and is unique if the observation operators Li are linear and
non zero, and if the control points are non aligned. The solution is a thin-plate
spline, with the same basis function φ as in equation (3):
n
6
w= ci Li φ(x − xi ) + di pi (x) (6)
i=1 i=1
as possible, but there is, to our knowledge, no criterion for defining an optimal
distribution of control points.
Rather than exactly solving equation (8), which would lead to the thin-plate
spline, the minimum is searched for among a set of spline functions suitable for
the multiscale formalism and satisfying the two following properties. (1) The
spline is defined from a unique bell-shaped radial basis function of unit support.
The choice of this function is not critical as long as it is positive, decreasing
and at least three times continuously differentiable in order to compute the 2nd
order div-curl semi-norm. We make use of the basis function ψ proposed by [14]
and defined as ψ(r) = (1 − r)6 (35r2 + 18r + 3) for |r| ≤ 1. (2) The spline is a
linear combination of translates of the basis function over a regular lattice of m
grid points, whose sampling defines the scale parameter h. These translates are
dilated by a factor γ proportional to h. The parameters defining the spline are
the m weights q = (qj ) (each weight qj being homogeneous to a motion vector
with u and v components) applied to the translates of the basis function. The
parametric expression of the vector spline is thus:
x − hvj
wq,h (x) = qj ψ( ) (9)
γ
vj ∈Z2 ,hvj ∈Ω
The link between the scale parameter h(p) and the real spatial scale of the evolv-
ing image structures is not obvious: at one level of the pyramid, the motion is
computed using a scale parameter h(p) corresponding to a basis function of sup-
port γ = 3h(p). The basis function is thus able to represent motion patterns with
spatial size less than 3h(p); but there is no guarantee that all motion patterns of
that size will be represented: this will occur only if enough control points have
been selected in the existing patterns.
4 Results
The first result intends to demonstrate the efficiency of accounting for the con-
servation only at control points. For this purpose, the motion is computed using
the multiscale vector spline and compared to the result of Corpetti’s method [15].
Both methods minimize the second order div-curl regularity constraint, make use
of either luminance or mass conservation and are solved in a multiscale scheme.
The two methods differ in the data confidence term of the minimized energy
(computed on control points selected by double thresholding for the multiscale
spline, on the whole image domain for Corpetti’s method) and in the numerical
minimization scheme (multiscale vector spline vs variational minimization). Two
comparisons are displayed. First, the motion is computed using the luminance
conservation equation on the synthetic ’OPA’ sequence (on the left in figure 1),
obtained by numerical simulation with the OPA ocean circulation model1 . The
2 2 0.3
1.5 1.5
0.2
1 1
0.5 0.5
0.1
0 0
0
−0.5 −0.5
−1 −1
−0.1
−1.5
−1.5
−2 −0.2
−2
Fig. 2. Motion fields estimated on the OPA sequence using luminance conservation.
Left to right: reference motion, multiscale spline, Corpetti and Mémin. Top to bottom:
motion field, streamlines, vorticity.
1
Thanks to Marina Levy, LOCEAN, IPSL, France.
A Multi-scale Vector Spline Method 673
Fig. 3. Motion fields estimated on the Meteosat sequence using mass conservation. Left:
multiscale spline, right: Corpetti and Mémin. Top to bottom: motion field, streamlines.
OPA sequence consists of simulated images of sea surface temperature, used for
computing motion. Additionaly the corresponding surface currents are available
and used as the reference field for validation purposes. The results are displayed
on figure 2. The mean angular error between the estimated and reference mo-
tion fields is 28 degrees for the multiscale spline and 42 degrees for Corpetti’s
method. The qualitative inspection of the motion field’s streamlines and vorticity
suggest that the motion of vortices is better assessed by the multiscale spline.
A similar comparison on a Meteosat-5 sequence2 acquired in the water vapor
band is displayed on figure 3. The mass conservation equation is used as the 2D
atmospheric flow can be considered as compressible to accomodate the effects
of vertical motion. For this sequence, a sole qualitative assessment of results is
possible. The multiscale spline is more accurate with respect to the location of
the central vortex. It furthermore succeeds in capturing a rotating motion in the
lower left part of the image, whereas Corpetti’s method incorrectly computes a
smooth laminar field.
The second comparison is intended to demonstrate that the 2nd order div-curl
regularity must be preferred to L2 regularity for fluid motion assessment. The lu-
minance conservation equation is considered and the motion is computed on the
OPA sequence by the multiscale spline and the Horn and Schunck method [2].
The results are displayed on figure 4. Three different results are presented corre-
sponding to different values of the λ coefficient assigned to the regularity compo-
nent, so that both methods are tested with low, medium and high regularization.
The angular errors for the multiscale spline are 30, 29 and 28 degrees (respec-
tively for low, medium and high regularity), for the Horn and Schunk method
43, 47 and 49 degrees. The spline method is much more efficient as far as the
detected location of eddies is concerned: only one vortex is detected by H&S
2
Copyright Eumetsat.
674 T. Isambert, J.-P. Berroir, and I. Herlin
method with low regularity, and none with medium and high regularity. This is
a consequence of the L2 regularization which favours laminar fields.
Figure 5 displays the motion fields estimated on the OPA sequence at three
different scales. At the coarsest scale, the main vortices appear in the upper part
of the image, and the large vortex in the bottom part is not detected at all. At
the intermediate scale, more vortices appear. At finest resolution the location
of vortices is improved and the large vortex in the bottom part of the image is
Fig. 5. Motion field estimated on the OPA sequence, from the coarsest to the finest
(full) resolution
A Multi-scale Vector Spline Method 675
even detected. This illustrates that the multiscale scheme actually links the size
of the spatial structure with the spatial scale of the spline, although this link is
not easy to interpret.
References
1. Korotaev, G., Huot, E., Le Dimet, F.X., Herlin, I., Stanichny, S., Solovyev, D.,
Wu, L.: Retrieving Ocean Surface Current by 4D Variational Assimilation of Sea
Surface Temperature Images. Remote Sensing of Environment (2007) (Special Issue
on Data Assimilation)
676 T. Isambert, J.-P. Berroir, and I. Herlin
2. Horn, B., Schunck, B.: Determining optical flow. AI 17(1-3), 185–203 (1981)
3. Béréziat, D., Herlin, I., Younes, L.: A generalized optical flow constraint and its
physical interpretation. In: CVPR 2000, pp. 487–492 (2000)
4. Wildes, R., Amabile, M.: Physically based fluid flow recovery from image sequences.
In: CVPR 1997, Puerto Rico, pp. 969–975 (June 1997)
5. Gupta, S., Princ, J.: Stochastic models for div-curl optical flow methods. IEEE
Signal Processing Letter 3(2) (1996)
6. Ruhnau, P., Schnoerr, C.: Optical Stokes Flow Estimation: An Imaging-based Con-
trol Approach. Experiments in Fluids 42, 61–78 (2007)
7. Anandan, P.: A computational framework and an algorithm for the measurement
of visual motion. International Journal of Computer Vision 2, 283–310 (1989)
8. Bergen, J.R., Anandan, P., Hanna, K.J., Hingorani, R.: Hierarchical model-based
motion estimation. In: Sandini, G. (ed.) ECCV 1992. LNCS, vol. 588, pp. 237–252.
Springer, Heidelberg (1992)
9. Enkelmann, W.: Investigation of multigrid algorithms for the estimation of optical
flow fields in image sequences. Computer Vision Graphics and Image Process-
ing 43(2), 150–177 (1988)
10. Moulin, P., Krishnamurthy, R., Woods, J.: Multiscale modeling and estimation of
motion fields for video coding (1997)
11. Amodei, L.: A vector spline approximation. Journal of approximation theory 67,
51–79 (1991)
12. Suter, D.: Motion estimation and vector splines. In: CVPR 1994 (1994)
13. Isambert, T., Herlin, I., Berroir, J., Huot, E.: Apparent motion estimation for
turbulent flows with vector spline interpolation. In: XVII IMACS, Scientific Com-
putation Applied Mathematics and Simulation, Paris, July 11-15 (2005)
14. Wendland, H.: Piecewise polynomial, positive definite and compactly supported ra-
dial basis functions of minimal degree. Advances in Computational Mathematics 4,
389–396 (1995)
15. Corpetti, T., Memin, E., Perez, P.: Dense estimation of fluid flows. PAMI 24(3),
365–380 (2002)
Continuous Energy Minimization
Via Repeated Binary Fusion
1 Introduction
Several fundamental problems in computer vision can be classified as inverse, ill-
posed problems, where a direct solution is not possible (e.g. deblurring, stereo,
optical flow). In such cases, a prior model of the forward process can help to infer
physically meaningful solutions via a maximum a posteriori (MAP) estimation.
Such MAP formulations naturally lead to energy minimization problems [1],
where an energy term Eprior , representing the prior model, penalizes unlikely
solutions and a data consistency term Edata enforces a close fit to the observed
data:
min {Eprior (u) + λEdata (u)} . (1)
u
Since we are dealing with spatially (and radiometrically) discrete images, at some
point any optimization approach for (1) has to take the spatial discretization into
account – there are two predominant strategies to do that. One currently very
popular approach is to state the problem as a discrete, combinatorial optimiza-
tion problem on a Markov Random Field (MRF). Since MRFs are a powerful tool
for solving most low level vision tasks, a considerable research effort has been
dedicated to exploring minimization methods for MRF energies (cf. [2] for a com-
parison of state-of-the-art algorithms). Generally, the optimization approaches
This work was supported by the Austrian Science Fund under grant P18110-B15,
the Austrian Research Promotion Agency within the VM-GPU project (no. 813396),
and the Hausdorff Center for Mathematics.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 677–690, 2008.
c Springer-Verlag Berlin Heidelberg 2008
678 W. Trobin et al.
are either based on message passing (e.g. loopy belief propagation by Pearl [3]
and sequential tree-reweighted message passing by Kolmogorov [4]) or on graph
cuts (α-β-swap and α-expansion, introduced by Boykov et al. [5], and the more
recent “range moves” by Veksler [6] and LogCut by Lempitsky et al. [7]). Re-
cently, Komodakis et al. proposed a fast optimization approach, based on the
duality theory of Linear Programming [8]. The second optimization strategy for
(1) uses the tools of the calculus of variations in a continuous context. Once
an optimality condition for the energy is derived, the differential operators are
discretized and a numerical scheme is used to minimize the energy. In contrast
to the aforementioned discrete MRF approach, in the variational approach the
discretization is postponed as long as possible.
This work will focus on limitations of the local optimization approaches used
in the variational context. In order to circumvent these limitations, we will in-
troduce a novel optimization strategy inspired by the discrete α-expansion and
LogCut algorithms. In contrast to local methods, such an optimization strategy
allows large moves and therefore is less likely to get stuck in bad local minima.
Unlike combinatorial optimization approaches, the solution space does not have
to be discretized and the algorithm does not induce systematic metrication er-
rors. The proposed variational technique also facilitates high-performance imple-
mentations on massively parallel GPUs and permits an extension to higher-order
“interactions” at low costs in time and memory.
After proposing the novel optimization strategy in Section 2, we evaluate this
technique for optical flow estimation in Section 3. Experiments in Section 4
illustrate state-of-the-art results.
For each subproblem, the global optimum can efficiently be computed using a
max-flow/min-cut algorithm [5].
Inspired by [9], where Chambolle shows close links between the Total Variation
(TV) model and binary MRFs in the context of image denoising, and [10], where
Nikolova et al. show how to find globally optimal solutions for certain nonconvex
optimization problems by restating them as convex problems, we will pose the
minimization problem (2) as a sequence of binary subproblems. Each of these
subproblems can conceptually be understood as a continuous version of an α-
expansion move, i.e. the current solution is changed to a proposed alternative
solution, wherever this is energetically favorable. Repeatedly solving this binary
problem for varying proposed solutions, i.e. performing cycles, as it is called in
the α-expansion terminology, yields increasingly accurate results.
In order to formulate (2) as a binary problem, let α, β : Ω → IRm be two
arbitrary but fixed candidate solutions we will subsequently refer to as “proposed
solutions” or just as “proposals.” Moreover, let φ : Ω → {0, 1} be a binary
function to selectively combine α and β to a new solution
The function φ is free to vary across Ω, as long as the fused solution u fulfills the
regularity requirements posed by the considered energy. Plugging the combined
solution (3) into the model (2) yields
5
min Ψ Du(φ(x), x), D2 u(φ(x), x), . . . dx + (4)
φ∈F Ω
6
λ (1 − φ(x)) ρ(α(x), x) + φ(x)ρ(β(x), x) dx
Ω
T
with u(x) = (u1 (x), u2 (x)) and ρ (u(x), x) = |I1 (x + u(x)) − I0 (x)|, where
I0 and I1 are the two input images. Since the data fidelity term ρ is nonlinear
in u, a local linearization is required at some point in the solution procedure,
which limits the approach to recovering small displacements. To circumvent this
restriction, the estimation procedure is typically performed on a scale pyramid
of the input images, cf. [14] for an in-depth discussion. Yet, such a scale pyramid
strategy often fails to recover the flow for small holes in foreground objects
(which allow to observe a distant background), or for fine structures in front of a
background which is moving due to ego motion. This effect can be seen around
the moving leafs in the “Schefflera” sequence of the Middlebury optical flow
evaluation dataset [17]. Figure 1(a) shows the input image 0 of the “Schefflera”
sequence; the corresponding ground truth flow can be seen as color-coded image1
in Fig. 1(b). Figure 1(c) shows a color-coded flow field, estimated using a scale
pyramid-based implementation of the TV-L1 optical flow model (6).
Before applying the new solution strategy, in a short technical detour, we
follow the approach of Aujol et al. [18] and introduce an auxiliary displacement
field v, yielding a strictly convex approximation of (6):
2 2 1
1 2
min |∇ud | dx + (ud − vd ) dx + λ ρ(v, x) dx , (7)
u,v Ω 2θ Ω Ω
d=1 d=1
1
The hue encodes the direction of the flow vector, while the saturation encodes its
magnitude. Regions of unknown flow (e.g. due to occlusions) are colored black.
Continuous Energy Minimization Via Repeated Binary Fusion 681
Fig. 1. (a) input image 0 of the Middlebury “Schefflera” sequence; (b) color-coded
ground truth flow (hue = direction, intensity = magnitude, black = unknown); color-
coded flows, estimated using: (c) a continuous TV-L1 flow model; (d) the proposed
optimization strategy (AAE = 2.91◦ )
The subproblem (9) is the well understood image denoising model of Rudin,
Osher, and Fatemi [19]. For this model, Chambolle proposed an efficient and
globally convergent numerical scheme, based on a dual formulation [20]. In prac-
tice, a gradient descent/reprojection variant of this scheme performs better [9],
although there is no proof for convergence. In order to make this paper self-
contained, we reproduce the relevant results from [20,9]:
Proposition 1. The solution of (9) is given by
Wherever α = β, ρ(α, x) = ρ(β, x), hence φ can arbitrarily be chosen in [0, 1].
Everywhere else, we can divide by (β − α)T (α − β), yielding (13).
Please note that the data residuals ρ(α, x) and ρ(β, x) are just constants. They
have to be calculated only once per fusion step, and the sole requirement is that
their range is IR+
0 , i.e. almost any cost function can be used. Once the relaxed
version of problem (8) is solved, a final thresholding of φ is required to obtain the
binary fusion of the two flow proposals α and β. Since the continuous solution of
φ is already close to binary, the threshold μ is not critical. Our heuristic solution
is to evaluate the energy of the original TV-L1 energy (8) for α, β, and a few
different thresholds μ ∈ (0, 1). Finally, we select the threshold yielding the flow
field with the lowest energy.
affinity. Since spatial second-order derivatives are not orthogonal and the local
information of orientation and shape are entangled, a decorrelation is necessary.
In [22], Danielsson et al. used circular harmonic functions to map the the second-
order derivative operators into an orthogonal space. In two spatial dimensions,
the decorrelated operator is given by
B 2
1 ∂2 ∂2 √ ∂ ∂2 √ ∂2 T
♦= + , 2 − , 8 . (17)
3 ∂x2 ∂y2 ∂x2 ∂y2 ∂x ∂y
measures the local deviation of a function u from being affine. Adapting the TV-
L1 flow model (6) to the new prior is a matter of replacing the TV regularization
in (6–8) with the Euclidean norm of the new operator (18). Instead of minimizing
the ROF energy (9), step 1 of the alternate optimization of u and φ now amounts
to solving
5 6
1 2
min ♦ud dx + (ud − [(1 − φ) αd + φβd ]) dx (19)
ud Ω 2θ Ω
where k is the iteration number, pd0 = 0, and τ ≤ 3/112. For a proof and further
details please refer to [21].
Moreover, we employ the following standard finite differences approximation of
the ♦ operator:
⎛ ⎞
1
(ui,j−1 + ui,j+1 + ui−1,j + ui+1,j − 4ui,j )
⎜ 3 ⎟
⎜ ⎟
(♦u)i,j = ⎜ 2
(u i−1,j + u i+1,j − u i,j−1 − u i,j+1 ) ⎟ , (22)
⎝ 3 ⎠
3 (ui,j + ui+1,j+1 − ui,j+1 − ui+1,j )
8
684 W. Trobin et al.
where (i, j) denote the indices of the discrete image domain, enforcing Dirichlet
boundary conditions on ∂Ω. For details on the discretization of ♦ · p, please
consult [21].
In the continuous setting, such an extension requires minor adaptions of the
solution procedure and incurs only a small increase of time and memory require-
ments. Most combinatorial optimization approaches, however, are limited to unary
and pairwise clique potentials, hence such a second-order prior can not be used.
Extending combinatorial optimization algorithms to higher-order cliques (e.g. as
proposed in [5,23]) is either expensive in time and space or imposes restrictions on
the potentials, e.g. [23] restricts the potentials to the Potts model.
4 Experiments
In this section, we first restrict the optical flow model to a single dimension
(rectified stereo) in order to analyze its behavior in a simplified setting. In Section
4.2 we will use image sets from the Middlebury optical flow database [17] to
illustrate that the proposed algorithm yields state-of-the-art flow estimates.
Most of the algorithm has been implemented in C++, with the exception of the
numerical schemes of the solvers, which have been implemented using CUDA 1.0.
All subsequent experiments have been performed on an Intel Core 2 Quad CPU
at 2.66 GHz (the host code is single-threaded, so only one core was used) with
an NVidia GeForce 8800 GTX graphics card, running a 32 bit Linux operating
system and recent NVidia display drivers. Unless noted otherwise, in all our
experiments the parameters were set to λ = 50 and θ = 0.1.
Fig. 2. (a) im2 of the Middlebury “Teddy” stereo pair; (b) the corresponding ground
truth disparity map; (c) a mask for the pixels, which are also visible in im6
Continuous Energy Minimization Via Repeated Binary Fusion 685
Moreover, this (rectified) stereo setting simplifies discussing the effects caused by
the relaxation and the seemingly asymmetric formulation.
All experiments in this section use the grayscale version of the “Teddy” stereo
pair [25] and a set of constant disparity proposals in the range 0 to 59 pixels in
0.5 pixel increments. Figure 2(a) shows im2 of the stereo pair, Fig. 2(b) the corre-
sponding ground truth disparity, and Fig. 2(c) is the mask of non-occluded regions.
Relaxation and Thresholding. For every fusion step (4), the binary function φ :
Ω → {0, 1} has to be optimized. Since this is a non-convex problem, we proposed
to relax φ to the range [0, 1], solve the continuous problem, and finally threshold
φ. Obviously, this only leads to reasonable fusions of α and β, if the optimal φ is
close to binary. Figure 3 shows, how a 64-bin histogram of the relaxed function φ
evolves during a typical optimization procedure. Since φ is initialized with 0, in the
beginning proposal α is chosen over β in the whole image. However, the “traces”
in Fig. 3 illustrate that in several image regions the value of φ flips to 1, i.e. in
these regions proposal β is energetically favored. Once the algorithm converged,
the histogram of φ is close to binary, just as expected.
Fig. 3. The evolution of a 64-bin histogram of φ during the solution procedure. Starting
at φ(x) = 0, i.e. with proposal α, several image regions flip to the alternative proposal
β. It is clearly apparent that the converged histogram is close to binary. Please note
that a logarithmic scale is used for the “iteration” axis.
686 W. Trobin et al.
Fig. 4. Two intermediate disparity maps, before (a) and after (b) a binary fusion with
the proposal β(x) = 20. (c) shows the corresponding continuous optimum of φ.
regions that were switched to proposal β during this fusion step, Fig. 4(c) shows
the continuous optimum of φ (before thresholding). Please note that φ is mostly
binary, except for regions where neither α nor β are close to the true disparity.
Repeating the experiment with α and β switched leads to visually indistinguish-
able results and an energy of 284727, i.e. the order of the proposals does not
matter in the binary fusion step.
1e+06
3rd cycle
4th cycle
1st cycle
sequential
Energy
4e+05
global optimum
2e+05
1 10 100
Iteration
Fig. 5. Decrease of the energy of the flow field with every successful fusion. Results
of randomized runs are shown as thin, colored lines; the thick, black line shows the
progress for a sequential run. The dashed vertical lines delimit fusion cycles, the thick
horizontal line marks the global optimum for the TV-L1 flow model (6).
Continuous Energy Minimization Via Repeated Binary Fusion 687
(d)
Fig. 6. First row: (a), (b) disparity maps, estimated by repeatedly fusing constant
proposals using the proposed optimization strategy; (c) global optimum for the TV-
L1 model. Second row: average end-point error results for the Middlebury benchmark
dataset; the proposed method, labeled “CBF”, was ranked 2nd at the time of sub-
mission. Third row: (e–g) show color-coded flow fields for the sequences “Schefflera”,
“Grove”, and “Yosemite”. Last row: (h) shows a color-coded flow field for the Mid-
dlebury “RubberWhale” sequence, estimated using the second-order prior (AAE =
3.14◦ ); (i) and (j) show the color-coded flow fields for a TV-regularized fusion of TV and
second-order prior flows for the “RubberWhale” (AAE = 2.87◦ ) and the “Dimetrodon”
(AAE = 3.24◦ ) sequences.
Continuous Energy Minimization Via Repeated Binary Fusion 689
5 Conclusion
The presented optimization strategy permits large optimization moves in a vari-
ational context, by restating the minimization problem as a sequence of binary
subproblems. After verifying that the introduced approximations are reasonable,
we showed that typical solutions for a stereo problem are within a few percent of
the global optimum (in energy as well as in the true error measure). Finally, we
showed that applying this optimization strategy to optical flow estimation yields
state-of-the-art results on the challenging Middlebury optical flow dataset.
References
1. Mumford, D.: Bayesian rationale for energy functionals. In: Geometry-driven dif-
fusion in Computer Vision, pp. 141–153. Kluwer Academic Publishers, Dordrecht
(1994)
2. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A.,
Tappen, M., Rother, C.: A comparative study of energy minimization methods for
Markov random fields with smoothness-based priors. IEEE Trans. Pattern Anal.
Mach. Intell. 30(6), 1068–1080 (2008)
3. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann, San Francisco (1988)
690 W. Trobin et al.
1 Introduction
The segmentation of crowds into individuals continues to be a challenging re-
search problem in computer vision [1, 2, 3, 4, 5]. The automation of video surveil-
lance systems in public venues such as airports, mass-transit stations and sports
stadiums requires the ability to detect and track individuals through complex
sites. We identify three challenges that make this problem particularly difficult:
(i) Partial occlusion. In many crowded scenes people can be partially occluded
by others. Monolithic detectors [2, 6, 7] that model the shape and appearance of
an entire person typically fail in such situations and hence cannot reliably detect
people in crowded environments. (ii) Dynamic backgrounds. When cameras are
fixed, statistical background models are commonly used to identify foreground
regions [8]. However, this approach fails when the background is dynamic. Fur-
ther, background modeling is not applicable for moving cameras, such as those
mounted on pan tilt devices or mobile platforms. (iii) Foreground clutter. The
presence of moving non-person objects such as luggage carts, shopping trolleys
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 691–704, 2008.
c Springer-Verlag Berlin Heidelberg 2008
692 P. Tu et al.
and cleaning equipment can clutter the foreground of the scene. A robust crowd
segmentation algorithm should be immune to foreground clutter without having
to explicitly model the appearance of every non-person object.
This paper presents a unified approach to crowd segmentation that effectively
addresses these three challenges. The proposed system combines bottom-up and
top-down approaches in a unified framework to create a robust crowd segmen-
tation algorithm. We first review a number of relevant approaches.
Low level feature grouping has been used to segment crowds [5, 9]. These ap-
proaches take advantage of the fact that the motion field for an individual is
relatively uniform and hence tracked corners with common trajectories can be
grouped together to form individuals. However, difficulties arise when multiple
individuals have similar trajectories. Monolithic classifiers capture the shape
and appearance space for the whole body using relatively simple learning meth-
ods [10,6,7]. The direct application of these classifiers to non-crowded scenes gen-
erates reasonable segmentations, however failure modes can occur when partial
occlusions are encountered. Part based constellation models [11, 12, 13] construct
boosted classifiers for specific body parts such as the head, the torso and the
legs, and each positive detection generates a Hough-like vote in a parametrized
person space. The detection of local maxima in this space constitutes a segmen-
tation. A similar approach [2] uses interest operators to nominate image patches
which are mapped to a learned code book. A drawback of these approaches is
that the identification of local maxima in the Hough space can be problematic
under crowded and cluttered environments - a global approach is required.
The previous approaches can be considered to be bottom-up methods where
local context is used. On the other hand, global approaches that rely on back-
ground segmentation has been proposed in [14, 4]. In [14], Markov Chain Monte
Carlo (MCMC) algorithms are used to nominate various crowd configurations
which are then compared with foreground silhouette images. However, this form
of random search can be computationally expensive. To address this issue an
Expectation Maximization (EM) based approach has been developed [4]. In this
framework, a hypothesis nomination scheme generates a set of possible person
locations. Image features are then extracted from foreground silhouettes and a
global search for the optimal assignment of features to hypotheses is performed.
The set of hypotheses that receive a significant number of assignments constitute
the final segmentation. Reliance on accurate foreground background segmenta-
tion is a weakness of both of these approaches.
model for every possible patch location, we show that a partial response from a
monolithic whole body classifier operating solely on a given patch can discrimi-
nate between valid and invalid patch assignments. The framework also allows for
the inclusion of grouping terms based on low level image cues so that concepts
such as uniform motion and intra-garment color constancy can be leveraged.
During the E-step we estimate a globally optimal assignment of patches to per-
son hypotheses. The M-step ensures that globally consistent patch assignments
are chosen. This can be viewed as a form of occlusion reasoning.
2 Segmentation
This section provides a detailed overview of the proposed crowd segmentation
algorithm. Figure 1 depicts the various stages used to generate the final seg-
mentation of a crowded scene. We assume that the head and shoulders of all
detectable individuals can be observed. Hence, an initial set of hypothesized
person locations are nominated using a head and shoulders detector (see section
3 for details). These K nominated hypotheses are denoted by C := {ci }. The
parameters of this head and shoulders detector are chosen to minimize missed
detections, hence many false detections are also generated(see Figure 1a). The
scene is partitioned into a set of N rectangular patches Z = {zi }, as shown in
Figure 1b. The segmentation of the scene into individuals is achieved by a glob-
ally optimal assignment of these image patches to the initial hypotheses. The
potential assignment of an image patch to a person hypothesis is evaluated using
both direct affinity and pairwise affinity terms, as described below.
Let gk (zi ) denote the affinity associated with the direct assignment of patch
zi to hypothesis ck . One of the main thrusts of this paper is a novel method for
computing this affinity function based on local shape and appearance informa-
tion - this will be the topic of section 3. Figure 1c illustrates this step for the
patch shown in green. The width and color of the arrow connecting the patch
to a hypothesis indicates the strength of the affinity. Using camera calibration
information and a ground plane assumption, certain direct assignments can be
ruled out based on geometric reasoning (shown with black arrows).
Let gk (zi , zj ) denote the affinity associated with pairwise assignment of patch
zi and zj to hypothesis ck . In this application, pairwise assignment affinity is
computed based on the fact that a given individual’s clothing often exhibits
a certain amount of color and motion constancy. Hence, affinity is based on
a similarity measure sim(zi , zj ) of low-level image cues such as motion fields
and color distributions. In this paper we use the Bhattacharya distance measure
between the color histograms associated with each patch. Given such a measure
of similarity, we define
Figure 1d shows two pairwise assignments. The pair of pink patches have a large
degree of pairwise affinity while the pair of blue patches exhibit relatively small
pairwise affinity.
694 P. Tu et al.
C4 C4 C4 C4
C1 C1 C1 C1
C2 C3 C2 C3 C2 C3 C2 C3
C5 C5 C5 C5
C6 C6 C6 C6
zi
Patches
z3 z3 C1 C1 C1
M M C2 C3
C2 C3 C2 C3
zN zN
C5 C5 C5
Iteration 0 Iteration 1
Hypotheses Hypotheses
c1 c2 c3 c4 c5 c6 Null c1 c2 c3 c4 c5 c6 Null
C6 C6 C6
z1 z1
z2 z2
Patches
Patches
z3 z3
M M
zN zN
Fig. 1. This figure depicts the different steps of the proposed algorithm. a) An initial set
of person hypotheses, b) a partitioning of the scene into a grid of patches, c) an example
of the direct association affinity between the green patch and all the hypotheses where
the width of the arrow is commensurate with the assignment affinity, d) shows two
patches with strong pairwise affinity (pink) and two patches with weak pairwise affinity
(blue), e) depicts the soft assign process where patches are assigned to hypotheses, f)
shows the assignment of patches to hypotheses after the first E-step, g) shows the
result of the M-step consistency analysis where red patch assignments are deemed to
be inconsistent based on occlusion reasoning, h) the final segmentation after multiple
iterations of the EM algorithm.
K
N
K
N
L(V |Z; X) ∝ γ1 xik gk (zi ) δck (vi ) + γ2 xik xjk gk (zi , zj ) δck (vi )δck (vj ) ,
k=1 i=1 k=1 i,j=1
i =j
(2)
where δck (vi ) is an indicator function which is one when vi = k and zero oth-
erwise and xik is a consistency parameter that is computed during the M-step
(see section 2.1). During the E-step, the consistency parameters are fixed and
a distribution for V is computed such that the expectation V p(V )L(V |Z; X)
is maximized. It was shown in [4] that a mechanism similar to soft-assign [15]
can be used to efficiently perform the E-step search. Figure 1e illustrates this
iterative process where the distribution of V is parametrized by a matrix of di-
rect assignment probabilities. The element in the ith row and k th column of this
matrix is the probability of the assignment of the it h patch to the k th hypoth-
esis. The sum along each row must be equal to 1 and there can be no negative
values. At the first iteration all matrix values are set uniformly. During each
iteration of the soft assign process, the matrix probabilities are updated based
on the gradient of the expectation function. After a number of iterations, the
assignment probabilities are forced to take on binary values and this defines an
estimate of the most likely value of V . If a hypothesis receives no patches then
it is deemed to be a spurious detection. A null hypothesis is created to allow for
the potential assignment of patches to the background and clutter in the scene.
In this application the direct assignment affinity between a patch and the null
hypothesis is set to a nominal value. The reader is directed to [4] for more detail
regarding the E-step process.
Given the current estimate of V all the patches that are currently assigned to a
given hypothesis ck can be identified. For each patch zi that is assigned to ck , a
path between it and the head location specified by ck can be constructed such
that the number of patches encountered on the path that are not assigned to ck
is minimal. This process takes advantage of the inherent grid like structure of
the patches and can be computed efficiently using dynamic programming. The
696 P. Tu et al.
value of xik is set to 1 unless the minimum cost path has a cost that is greater
than a threshold, in which case xik is set to a low value. Prior to the first E-step,
all the values of X are set to 1. Using this simple process, inconsistencies such
as the legs are visible but the torso is not, can be identified and addressed before
the next iteration of the E-step. Figure 1g shows the result of an M-step analysis
where consistent patch assignments are shown in white and the inconsistent
patch assignments are shown in red. By reducing the value of the consistency
parameters for the red assignments, their inclusion in subsequent estimates of V
will be inhibited.
The EM algorithm operates by iterating between the E-step and the M-step
operations. The process terminates when the estimates of V have converged.
Figure 1h shows the final segmentation for this example. By employing a global
optimization scheme, the system need not rely solely on local information for
making segmentation decisions, which is not the case for many greedy approaches
to crowd segmentation. In the next section, the critical question of how to com-
pute the affinity of direct patch to hypothesis assignments will be addressed.
3 Classification
In the previous section a detailed description of the overall crowd segmentation
process was given. The focus of this section is to describe how the direct patch
to hypothesis affinity function gk (z) can be computed based on local shape and
appearance information. For this purpose we use a whole body monolithic person
classifier consisting of a set of weak classifiers selected by boosting. We will show
that for certain types of weak classifiers, the whole body classifier response can be
computed for a specific patch and that this response can be used to characterize
the patch to hypothesis affinity. The section begins with a discussion of the
basic whole body classifier followed by details regarding the generation of patch
specific partial responses.
where si is the ith training sample, li is its label and pi is the probability as-
sociated with sample si . The sample probability distribution is modified in an
Unified Crowd Segmentation 697
The basic idea for generating patch specific responses is that each weak classifier
will only collect statistics over the intersection of R(s) and the patch z. Since
average statics are used, the thresholds learned during boosting remain valid.
However, instead of having a 1/ − 1 response, each weak classifier will have its
response modulated by the ratio of the areas of R(s) ∩ z and R(s). Based on this
idea, the partial response for a strong classifier with respect to a given patch z
and sample s is defined as:
M
Ri (s)∩z
dx
sc(s, z) = αi wci (s, z) , (6)
i=1 Ri (s)
dx
where
wci (s, z) = wc(s; Ri (s) ∩ z). (7)
698 P. Tu et al.
Note that if the region of interest associated with a particular weak classifier
does not intersect with the patch z, then this weak classifier will have no effect
on the strong classifier decision.
For a given person hypothesis ck , a sample sk can be constructed so that
for a particular patch zi , the direct patch to hypothesis affinity measure can be
defined as:
gk (zi ) = sc(sk , zi ) (8)
Figure 2 shows a set of cascaded classifiers that were used to construct the whole
body classifier. In this application the image statistic used is the magnitude of
the edge responses for pixels that exhibited an orientation similar to the preferred
orientation of the weak classifier. Edge magnitude and orientation are calculated
using the Sobel operator. Given such a whole body classifier the next question
is to determine the appropriate patch size. If the patch is too large, then there
is risk of contamination by occlusion. On the other hand, if the patch is too
small the ability to discriminate between correct and incorrect patch assignments
diminishes. To understand this tradeoff a training set of positive and negative
whole body samples was collected. Patches with widths ranging from 0.25W to
1.0W (W = person width) were evaluated across the entire bounding box for
each training sample. For each relative patch location, the average number of
positively responding strong classifiers from the cascaded whole body classifier
was recorded. As shown in Figure 3, when the patch width was reduced below
0.5W the ability to discriminate between positive and negative samples was
reduced significantly. Thus for this application a nominal patch width of 0.5W
is chosen.
Fig. 2. This figure shows the six strong classifiers that were constructed for the whole
body classifier plus all six cascades shown together. Each pink box represents the
region of interest for a weak classifier. The line interior to each region of interest depict
the weak classifier’s preferred orientation. Green lines represent positive features (the
average statistic must be above its threshold) and red lines are for negative features
(average statistic must be below its threshold).
Unified Crowd Segmentation 699
Avg. 2.7 Avg. 2.0 0.25 Avg. 3.7 Avg. 2.3 0.50
Avg. 4.4 Avg. 2.2 0.75 Avg. 4.9 Avg. 2.1 1.00
Fig. 3. This figure shows the effect of changing the patch size. Patch size is varied as a
function of W = person width. In each case the average number of positively responding
strong classifiers from the whole body cascaded classifier is shown as a function of patch
location for both positive (person) and negative (non-person) images. Note that when
the patch width is reduced below 0.5W the ability to discriminate between positive
and negative samples is significantly reduced.
4 Experiments
Unrehearsed imagery acquired at a mass transit site serves as the source of test
imagery for this paper. A whole body classifier was trained for this site. We
first illustrate the intermediate steps of our approach on a few representative
frames (see Figure 4). The “Initial Hypothesis” column of figure 4 shows the
initial set of hypotheses generated by the head and shoulders classifier. Note that
while an appropriate hypothesis was generated for each person in each image,
several false hypotheses were also generated. The “Single Assignment” column of
figure 4 illustrates the direct affinity between each patch and each hypothesis as
computed using equation 8. Each patch is color coded based on the hypothesis
700 P. Tu et al.
Fig. 4. Six stages of the crowd segmentation process are illustrated for four test images.
Overlapping patches of 0.5W are used. However, for clarity smaller patches are shown.
The initial hypotheses generated by the head and shoulder classifier are shown in
the first column. In the second column, the patches are shown color coded based on
their strongest direct assignment as calculated by the whole body classifier. The null
hypothesis is shown in black. In the third column, neighboring patches with strong
similarity measures based on color constancy are connected by green line segments.
The assignment of patches to hypotheses based on the first E-step is shown in the
fourth column. The assignment after multiple rounds of both the E and M steps are
shown in the fifth column. The final segmentation is shown in the last column.
for which it has the highest direct affinity. Patches that are black have the
greatest affinity for the null hypothesis. A significant number of patches have
the greatest direct affinity for their true hypothesis, however confusion occurs
when multiple hypotheses overlap. An example of this can be seen in row A
of the Single Assignment column. In addition, patches that are only associated
with false detections tend to have a greater affinity for the null hypothesis.
The “Grouping” column of figure 4 illustrates the effectiveness of the pairwise
assignment criteria. For purposes of clarity, only neighboring patches with high
similarity measures are shown to be linked in green (blue otherwise). Note that
Unified Crowd Segmentation 701
Fig. 5. This figure shows an assortment of crowd segmentation results. Note that the
algorithm produces the correct segmentation in case of severe partial occlusion (right
column), and in presence of cleaning equipment (bottom left) and a variety of suitcases
and bags.
patches associated with the same article of clothing tend to be grouped together.
Also, the background often exhibits continuity in appearance and such patches
tend to be grouped together.
The “E-step” column of figure 4 shows the patch assignment after the first
iteration of the “E-step”. Most of the false hypotheses have received very few
patch assignments, while the true hypotheses have been assigned patches in
an appropriate manner. However, inconsistencies have also been generated. For
example, in row A of the “E-step” column, a number of patches have been
assigned to the bottom of the green hypothesis. As seen in the “M-step” column,
these inconsistent assignments have been correctly removed.
The “Final” column of figure 4 shows the final segmentation. Figure 5 shows
similar results from a variety of images. Note that the algorithm is successful
when confronted with partial occlusion and clutter such as the janitor’s equip-
ment and various suitcases.
The algorithm was also applied to a video sequence (see supplemental ma-
terial). To measure overall performance 117 frames were processed. The initial
hypothesis generator produced 480 true detections, 32 false detections and 79
missed detections. After application of the crowd segmentation algorithm, the
number of false detections were reduced by 72 percent at a cost of falsely reject-
ing 2 percent of the true detections. For purposes of comparison, we applied the
Histogram of Oriented Gradients (HOG) [6] to this dataset. Our implementation
uses camera calibration information for automatic scale selection. The perfor-
mance tabulated in Table 1 shows that our crowd segmentation outperformed
HOG, arguably this is due to partial occlusion.
702 P. Tu et al.
Table 1. Comparison of HOG [6] person detector to the proposed crowd segmentation
algorithm
Fig. 6. Four example frames from tracking the results of the crowd segmentation
process
Fig. 7. This figure illustrates the effect of using motion fields in the pairwise patch
assignment. The example on the left shows a frame where the algorithm results in
both a false positive and a false negative. However, when the motion information from
dense optical flow is used the correct segmentation results, as shown on the right.
Unified Crowd Segmentation 703
The results thus far used pairwise patch similarity function based on color
constancy as defined in Equation 1. However, this is not always enough as shown
in the left image of Figure 7 where the crowd segmentation algorithm resulted
in both a false and a missed detection. An experiment was performed where
the pairwise patch similarity measures were augmented by the use of a motion
consistency measure based on dense optical flow. As can be seen from the right
image in Figure 7, this results in a correct segmentation.
5 Discussion
The framework presented in this paper has incorporated many of the strengths of
previously proposed crowd segmentation methods into a single unified approach.
A novel aspect of this paper is that monolithic whole body classifiers were used
to analyze partially occluded regions by considering partial responses associated
with specific image patches. In this way appearance information is incorporated
into a global optimization process alleviating the need for foreground background
segmentation. The EM framework was also able to consider low level image cues
such as color histograms and thus take advantage of the potential color constancy
associated with clothing and the background. Parametrization of the likelihood
function allowed for the enforcement of global consistency of the segmentation.
It was shown that these parameters can be estimated during the M-step and
that this facilitates consistency based on occlusion reasoning.
In the course of experimentation it was found that at various times, different
aspects of the crowd segmentation system proved to be the difference between
success and failure. For example when confronted with clutter, the appearance
based classifiers provide the saliency required to overcome these challenges. How-
ever, when multiple people having similar clothing are encountered, the motion
field can become the discriminating factor. A robust system must be able to take
advantage of its multiple strengths and degrade gracefully when confronted by
their weaknesses.
References
7. Tuzel, O., Porikli, F., Meer, P.: Pedestrian detection via classification on rieman-
nian manifolds. IEEE Computer Vision and Pattern Recognition (2007)
8. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time
tracking. IEEE Computer Vision and Pattern Recognition 2, 246–252 (1998)
9. Rabaud, V., Belongie, S.: Counting crowded moving objects. IEEE Computer Vi-
sion and Pattern Recognition, 705–711 (2006)
10. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and
appearance. International Journal of Computer Vision 2, 734–741 (2003)
11. Fergus, R., Perona, P., Zisserman, A.: A visual category filter for Google images.
In: European Conference on Computer Vision, vol. 1, pp. 242–256 (2004)
12. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a proba-
bilistic assembly of robust part detectors. In: European Conference on Computer
Vision (2004)
13. Wu, B., Nevatia, R.: Detection and tracking of multiple partially occluded humans
by bayesian combination of edgelet based part detectors. International Journal of
Computer Vision 75(2), 247–266 (2007)
14. Zhao, T., Nevatia, R.R.: Bayesian human segmentation in crowded situations.
IEEE Computer Vision and Pattern Recognition 2, 459–466 (2003)
15. Chui, H., Rangarajan, A.: A new point matching algorithm for non-rigid registra-
tion. Computer Vision and Image Understanding 89(3), 114–141 (2003)
16. Viola, P., Jones, M.J.: Robust real-time face detection. International Journal of
Computer Vision 57(2), 137–154 (2004)
17. Krahnstoever, N., Tu, P., Sebastian, T., Perera, A., Collins, R.: Multi-view detec-
tion and tracking of travelers and luggage in mass transit environments. In: Proc.
Ninth IEEE International Workshop on Performance Evaluation of Tracking and
Surveillance (PETS) (2006)
18. Leibe, B., Schindler, K., Gool, L.V.: Coupled detection and trajectory estimation
for multi-object tracking. In: International Conference on Computer Vision (ICCV
2007), Rio de Janeiro, Brasil (October 2007)
19. Blackman, S., Popoli, R.: Design and Analysis of Modern Tracking Systems. Artech
House Publishers (1999)
20. Rasmussen, C., Hager, G.: Joint probabilistic techniques for tracking multi-part
objects. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition,
pp. 16–21 (1998)
Quick Shift and Kernel Methods
for Mode Seeking
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 705–718, 2008.
c Springer-Verlag Berlin Heidelberg 2008
706 A. Vedaldi and S. Soatto
We address this issue in two ways. First, we propose using medoid shift to sim-
plify the data and initialize the more accurate mean shift algorithm (Sect. 5.2 and
Sect. 5.3). Second, we propose an alternative mode seeking algorithm that can
trade off mode over- and under-fragmentation (Sect. 3). This algorithm, related
to [13], is particularly simple and fast, yields surprisingly good segmentations,
and returns a one parameter family of segmentations where model selection can
be applied.
We demonstrate these algorithms on three tasks (Sect. 5): Clustering on a
manifold (Sect. 5.1), image segmentation (Sect. 5.2), and clustering image sig-
natures for automatic object categorization (Sect. 5.3). The relative advantages
and disadvantages of the various algorithms are discussed.
2 Mode Seeking
1
N
P (x) = k(x − xi ), x ∈ Rd (1)
N i=1
where k(x) can be a Gaussian or other window.1 Then each point xi is moved
towards a mode of P (x) evolving the trajectory yi (t), t > 0 uphill, starting from
yi (0) = xi and following the gradient ∇P (yi (t)). All the points that converge to
the same mode form a cluster.
A mode seeking algorithm needs (i) a numerical scheme to evolve the trajec-
tories yi (t), (ii) a halting rule to decide when to stop the evolution and (iii) a
clustering rule to merge the trajectory end-points. Next, we discuss two algo-
rithms of this family.
Mean Shift. Mean shift [9,5] is based on an efficient rule to evolve the tra-
jectories yi (t) when the window k(x) can be written as ψ(x22 ) for a convex
function ψ(z) (for instance the Gaussian window has ψ(z) ∝ exp(−z)). The
idea is to bound the window from below by the quadric k(z ) ≥ k(z) + (z 22 −
z22)ψ̇(z22 ). Substituting in (1) yields
1
N
P (y ) ≥ P (y) + (y − xj 22 − y − xj 22 )ψ̇(y − xj 22 ), (2)
N j=1
and maximizing this lower bound at y = yi (t) yields the mean-shift update rule
N
1 j=1 ψ̇(yi (t) − xj 2 )xj
N 2
yi (t + 1) = argmax y − xj 2 ψ̇(yi (t) − xj 2 ) = N
2 2
. (3)
j=1 ψ̇(yi (t) − xj 2 )
y N j=1 2
1
The term “kernel” is also used in the literature. Here we use the term “window” to
avoid confusion with the kernels introduced in Sect. 3.
708 A. Vedaldi and S. Soatto
If the profile ψ(z) is monotonically decreasing, then P (yi (t)) < P (yi (t + 1)) at
each step and the algorithm converges in the limit (since P is bounded [5]). The
complexity is O(dN 2 T ), where d is the dimensionality of the data space and T is
the number of iterations. The behavior of the algorithm is illustrated in Fig. 1.
Medoid Shift. Medoid shift [20] is a modification of mean shift in which the
trajectories yi (t) are constrained to pass through the points xi , i = 1, . . . , N .
The advantage of medoid shift are: (i) only one step yi (1), i = 1, . . . , N has to
be computed for each point xi (because yi (t + 1) = yyi (t) (1)), (ii) there is no
need for a stopping/merging heuristic (as these conditions are met exactly), and
(iii) the data space X may be non-Euclidean (since to maximize (4) there is no
need to compute derivatives). Eventually, points are linked by steps into a forest,
with clusters corresponding to trees. The algorithm is illustrated in Fig. 1.
According to [20], the main drawback of medoid shift is speed. In fact, maxi-
mizing (3) restricted to the dataset amounts to calculating
1 2
N
yi (1) = argmax d (y, xj )φ̇(d2 (xj , xi )) (4)
y∈{x1 ,...,xN } N j=1
N
yi (1) = argmax Dkj Fji = argmax e
k DF ei (5)
k=1,...,N j=1 k=1,...,N
where ei denotes the i-th element of the canonical basis.2 As noted in [20],
O(N 2.38 ) operations are sufficient by using the fastest matrix multiplication
algorithm available. Unfortunately the hidden constant of this algorithm is too
large to be practical (see [12], pag. 501). Thus, a realistic estimate of the time
required is more pessimistic than what suggested by the asymptotic estimate
O(dN 2 + N 2.38 ).
Here we note that a more delicate issue with medoid shift is that it may fail
to properly identify the modes of the density P (x). This is illustrated in Fig. 2,
where medoid shift fails to cluster three real points −1, +1 and +1/2, finding
two modes −1 and +1 instead of one. To overcome this problem, [20] applies
medoid shift iteratively on the modes (in the example −1 and +1). However, this
solution is not completely satisfactory because (i) the underlying model P (x) is
changed (similarly to blurry mean shift [9,3]) and (ii) the strategy does not work
in all cases (for instance, in Fig. 2 points −1 and +1 still fail to converge to a
single mode).
Finally, consider the interpretation of medoid shift. When X is a Hilbert space,
medoid (and mean) shift follow approximately the gradient of the density P (x)
2
# $
For instance e2 = 0 1 0 . . . 0 .
Quick Shift and Kernel Methods for Mode Seeking 709
0.2 0.2
0.1 0.1
0 0
−1 1 1.5 −1 1
(by maximizing the lower bound (3)). The gradient depends crucially on the
inner product and corresponding metric defined on X , which encodes the cost
of moving along each direction [22]. For general metric spaces X , the gradient
may not be defined, but the term d2 (x, y) in (4) has a similar direction-weighing
effect. In later sections we will make this connection more explicit.
3 Fast Clustering
Faster Euclidean Medoid Shift. We show that the complexity of Euclidean
medoid shift is only O(dN 2 ) (with a small# constant)$instead of O(dN 2 + N 2.38 )
(with a large constant) [20]. Let X = x1 . . . xN be the data matrix. Let
n = (X
+ X
)1 be the vector of the squared norms of the data, where 1
denotes the vector of all ones and + the Hadamard (component wise) matrix
product. Then we have
The term 1(n
F ) has constant columns and is irrelevant to the maximiza-
tion (5). Therefore, we need to compute
where I is the identity matrix.3 It is now easy to check that each matrix product
in (7) requires O(dN 2 ) operations only.
3
And we used the fact that (I AB)1 = (B
A)1.
710 A. Vedaldi and S. Soatto
1
N
yi (1) = argmin Dij , Pi = φ(Dij ). (8)
j:Pj >Pi N j=1
Quick shift has four advantages: (i) simplicity; (ii) speed (O(dN 2 ) with a small
constant); (iii) generality (the nature of D is irrelevant); (iv) a tuning parameter
to trade off under- and over-fragmentation of the modes. The latter is obtained
because there is no a-priori upper bound on the length Dij of the shifts yi (0) →
yi (1). In fact, the algorithm connects all the points into a single tree. Modes
are then recovered by breaking the branches of the tree that are longer than a
threshold τ . Searching τ amounts to performing model selection and balances
under- and over-fragmentation of the modes. The algorithm is illustrated in
Fig. 1.
Quick shift is related to the classic algorithm from [13]. In fact, we can
rewrite (8) as
sign(Pj − Pi ) Pj − Pi
yi (1) = argmax , and compare it to yi (1) = argmax (9)
j=1,...,N Dij j:d(xj ,xi )<τ Dij
4
The kernel K should not be confused with the Parzen window k(z) appearing in (1).
In the literature, it is common to refer to the Parzen window as “kernel”, but in most
cases it has rather different mathematical properties than the kernel K we consider
here. An exception is when the window is Gaussian, in which cases k(d2 (x, y)) is a
p.d. kernel. In this case, we point out an interesting interpretation of mean shift as
a local optimization algorithm that, starting from each data point, searches for the
pre-image of the global data average computed in kernel space. This explains the
striking similarity of the mean shift update Eq. (3) and Eq. (18.22) of [19].
5
K is centered if K1 = 0. If this is not the case, we can replace K by K = HKH,
where H = I − 11
/N is the so-called centering matrix. This operation translates
the origin of the kernel space, but does not change the corresponding distance.
Quick Shift and Kernel Methods for Mode Seeking 711
4 Cluster Refinement
1
N
P (y) = k(d2H (y, xj )), y∈H (10)
N j=1
where d2H (xj , y) = y, yH + xj , xj H − 2y, xj H . Notice that y ∈ H, unlike
standard mean shift, does not belong necessarily to the data space X (up to
the identification x ≡ K(x, ·)). However, if k(z) is monotonically decreasing,
then maximizing w.r.t. y can be restricted to the linear subspace spanH X =
spanH {x1 , . . . , xn } ⊂ H (if not, the orthogonal projection of y onto that space
decreases simultaneously all terms d2H (xj , y)).
Therefore, we can express all calculations relative to spanH X. In particular,
if Kij = K(xi , xj ) is the kernel matrix, we have d2H (xj , y) = y
Ky + e
j Kej −
2ej Ky where ej is the j-th vector of the canonical basis and y is a vector of N
coefficients. As in standard mean shift, the shifts are obtained by maximizing
the lower bound
6
The interpretation is discussed later.
712 A. Vedaldi and S. Soatto
Fig. 3. Kernel mean and medoid shift algorithms. We show basic MATLAB
implementations of two of the proposed algorithms. Here K = G
G is a low-rank
decomposition G ∈ Rd×N of the (centered) kernel matrix and sigma is the (isotropic)
standard deviation of the Gaussian Parzen window. Both algorithms are O(dN 2 ) (for
a fixed number of iterations of mean shift), reduce to their Euclidean equivalents by
setting G ≡ X and Z ≡ Y , and can be easily modified to use the full kernel matrix K
rather than a decomposition G
G (but the complexity grows to O(N 3 )).
N
yi (t + 1) = argmax (y
Ky + e
j Kej − 2ej Ky)φ̇(dH (xj , yi (t))).
2
y∈RN j=1
D = m1 + 1n − 2Y K = m1 + 1n − 2Z G;
where
1
Σ 2 = N diag(σ1 , . . . , σN ), with σ1 ≥ σ2 ≥ · · · ≥ σN . According to this decompo-
sition, vectors x, y ∈ spanH X can be identified with their coordinates
# $g, z ∈ R
N
so that x, yH = g, z. Moreover the data matrix G = g1 . . . gn has null
mean8 and covariance GG
/N = Σ/N = σ12 diag(λ21 , . . . , λ2N ). If λi decay fast,
the effective dimension of the data can be much smaller than N .
The simplest way to regularize the Parzen estimate is therefore to discard the
dimensions above some index d (which also improves efficiency). Another option
is to blur the coordinates z by adding a small Gaussian noise η of isotropic
standard deviation , obtaining a regularized variable z = z+η. The components
of z with smaller variance are “washed out” by the noise, and we canobtain a
consistent estimator of z by using the regularized Parzen estimate i=1 (g ∗
N
k)(zi ) (the same idea is implicitly used, for instance, in kernel Fisher discriminant
analysis [15], where the covariance matrix computed in kernel space is regularized
by the addition of 2 I). This suggests that using a kernel with sufficient isotropic
smoothing may be sufficient.
Finally, we note that, due to the different scalings λ1 , . . . , λN of the linear
dimensions, it might be preferable to use an adapted Parzen window, which
retains the same proportions [17]. This, combined . with the regularization ,
suggests us to scale each axis of the kernel by σ 2 λ2i + 2 .9
8
Because K is assumed to be centered, so that 1
G
(G1) = 1
K1 = 0.
9
So far we disregarded the normalization constant of the Parzen window k(x) as it
was irrelevant for our purposes. If, however, windows kσ (x) of variable width σ are
used [6], then the relative weights of the windows become important. Recall that in
the d dimensional Euclidean case one has kσ (0)/kσ (0) = (σ /σ)d . In kernel space
therefore one would have
?
@N
kσ (0) @ σ 2 λ2 + 2
=A i
2 λ2 + 2
.
kσ (0) i=1
σ i
714 A. Vedaldi and S. Soatto
5 Applications
[20] applies medoid shift to cluster data on manifolds, based on the distance
matrix D calculated by ISOMAP. If the kernel matrix K = HDH /2, H =
I − N1 11
is p.d., we can apply directly kernel mean or medoid shift to the same
problem. If not, we can use the technique from [4] to regularize the estimate and
enforce this property. In Fig. 4 this idea is used to compare kernel mean shift,
kernel medoid shift and quick shift in a simple test case.
Image segmentation is a typical test case for mode seeking algorithms [5,16,20].
Usually mode seeking is applied to this task by clustering data {(p, f (p)), p ∈ Ω},
where p ∈ Ω are the image pixels and f (p) their color coordinates (we use the
same color space of [5]).
As in [5], we apply mean shift to segment the image into super-pixels (mean
shift variants can be used to obtain directly full segmentations [2,16,24]). We
compare the speed and segmentation quality obtained by using mean shift,
medoid shift, and quick shift (see Fig. 5 for further details).
Mean shift is equivalent to [5] and can be considered a reference to evaluate the
other segmentations. Non-iterative medoid shift (first column) over-fragments
significantly (see also Fig. 2), which in [20] is addressed by reiterating the algo-
rithm. However, since our implementation is only O(dN 2 ), medoid shift has at
least the advantage of being much faster than mean shift, and can be used to
speed up the latter. In Fig. 5 we compare the time required to run mean shift
from scratch and from the modes found by medoid shift. We report the speedup
(as the number of modes found by medoid shift over the number of pixels), the
Quick Shift and Kernel Methods for Mode Seeking 715
Fig. 5. Image segmentation. We compare different mode seeking techniques for seg-
menting an image (for clarity we show only a detail). We report the computation time
in seconds (top-right corner of each figure). In order to better appreciate the intrin-
sic efficiency advantages of each method, we use comparable vanilla implementations
of the algorithms (in practice, one could use heuristics and advanced approximation
techniques [23] to significantly accelerate the computation). We use a Gaussian kernel
of isotropic standard deviation σ in the spatial domain and use only one optimization:
We approximate the support of the Gaussian window by a disk of radius 3σ (in the
spatial domain) which results in a sparse matrix F . Therefore the computational effort
increases with σ (top to bottom). The results are discussed in the text.
The interesting work [11] introduces a large family of positive definite kernels
for probability measures which includes many of the popular metrics: χ2 ker-
nel, Hellinger’s kernel, Kullback-Leibler kernel and l1 kernel. Leveraging on
these ideas, we can use kernel mean shift to cluster probability measures, and in
716 A. Vedaldi and S. Soatto
Fig. 6. Automatic visual categorization. We use kernel mean shift to cluster bag-
of-features image descriptors of 1600 images from Caltech-4 (four visual categories:
airplanes, motorbikes, faces, cars). Top. From left to right, iterations of kernel mean
shift on the bag-of-features signatures. We plot the first two dimensions of the rank-
reduced kernel space (z vectors) and color the points based on the ground truth labels.
In the rightmost panel the data converged to five points, but we artificially added
random jitter to visualize the composition of the clusters. Bottom. Samples from the
five clusters found (notice that airplane are divided in two categories). We also report
the clustering quality, as the percentage of correct labels compared to the ground truth
(we merge the two airplanes categories into one), and the execution time. We use basic
implementations of the algorithms, although several optimizations are possible.
B
xb yb
Kχ2 (x, y) = 2
xb + yb
b=1
interest points (of fixed orientation; see [25] and ref. therein) and calculate SIFT
descriptors [14], obtaining about 103 features per image. We then generate a vo-
cabulary of 400 visual words by clustering a random selection of such descriptors
by using k-means. For each image, we compute a bag-of-feature histogram x by
counting the number of occurrences of each visual word in that image. Finally,
we use the χ2 kernel to generate the kernel matrix, that we feed to our clustering
algorithms.
In Fig. 6 we compare kernel mean shift, kernel mean shift initialized by medoid
shift, and quick shift. The problem we solve is considerably harder than [10],
since in our case the number of clusters (categories) is unknown. All algorithms
discover five (rather than four) categories (Fig. 6), but the result is quite reason-
able since the category airplanes contains two distinct and visually quite different
populations (grounded and airborne airplanes). Moreover, compared to [10] we
do not try to explicitly separate an object from its background, but we use a
simple holistic representation of each image.
The execution time of the algorithms (Fig. 6) is very different. Mean shift
is relatively slow, at least in our simple implementation, and its speed greatly
improves when we use medoid shift to initialize it. However, consistently with
our image segmentation experiments, quick shift is much faster.
We also report the quality of the learned clusters (after manually merging the
two airplane subcategories) as the percentage of correct labels. Our algorithm
performs better than [10], that uses spectral clustering and reports 94% accuracy
on selected prototypes and as low as 85% when all the data are considered; our
accuracy in the latter case is at least 94%. We also study rescaling as proposed in
Sect. 4, showing that it (marginally) improves the results of mean/medoid shift,
but makes the convergence slower. Interestingly, however, the best performing
algorithm (not to mention the fastest) is quick shift.
6 Conclusions
In this paper we exploited kernels to extend mean shift and other mode seeking
algorithms to a non-Euclidean setting. This also clarifies issues of regularization
and data scaling when complex spaces are considered. In this context, we showed
how to derive a very efficient version of the recently introduced medoid shift
algorithm, whose complexity is lower than mean shift. Unfortunately, we also
showed that medoid shift often results in over-fragmented clusters. Therefore,
we proposed to use medoid shift to initialize mean shift, yielding a clustering
algorithm which is both efficient and accurate.
We also introduced quick shift, which can balance under- and over-fragmentation
of the clusters by the choice of a real parameter. We showed that, in practice, this
algorithm is very competitive, resulting in good (and sometimes better) segmenta-
tions compared to mean shift, at a fraction of the computation time.
References
1. Bach, F.R., Jordan, M.I.: Kernel independent componet analysis. Journal of Ma-
chine Learninig Research 3(1) (2002)
2. Carreira-Perpiñán, M.: Fast nonparametric clustering with gaussian blurring mean-
shift. In: Proc. ICML (2006)
3. Cheng, Y.: Mean shift, mode seeking, and clustering. PAMI 17(8) (1995)
4. Choi, H., Choi, S.: Robust kernel isomap. Pattern Recognition (2006)
5. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space
analysis. PAMI 24(5) (2002)
6. Comaniciu, D., Ramesh, V., Meer, P.: The variable bandwidth mean shift and
data-driven scale selection. In: Proc. ICCV (2001)
7. Csurka, G., Dance, C.R., Dan, L., Willamowski, J., Bray, C.: Visual categorization
with bags of keypoints. In: Proc. ECCV (2004)
8. Fine, S., Scheinberg, K.: Efficient SVM training using low-rank kernel representa-
tions. Journal of Machine Learninig Research (2001)
9. Fukunaga, K., Hostler, L.D.: The estimation of the gradient of a density function,
with applications in pattern recognition. IEEE Trans. 21(1) (1975)
10. Grauman, K., Darrell, T.: Unsupervised learning of categories from sets of partially
matching image features. In: Proc. CVPR (2006)
11. Hein, M., Bousquet, O.: Hilbertian metrics and positive definite kernels on proba-
bility measures. In: Proc. AISTAT (2005)
12. Knuth, D.: The Art of Computer Programming: Seminumerical Algorithms, 3rd
edn., vol. 2 (1998)
13. Koontz, W.L.G., Narendra, P., Fukunaga, K.: A graph-theoretic approach to non-
parametric cluster analyisis. IEEE Trans. on Computers c-25(9) (1976)
14. Lowe, D.: Implementation of the scale invariant feature transform (2007),
http://www.cs.ubc.ca/∼ lowe/keypoints/
15. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Müller, K.-R.: Fisher discrimi-
nant analysis with kernels. In: Proc. IEEE Neural Networks for Signal Processing
Workshop (1999)
16. Paris, S., Durand, F.: A topological approach to hierarchical segmentation using
mean shift. In: Proc. CVPR (2007)
17. Sain, S.R.: Multivariate locally adaptive density estimation. Comp. Stat. and Data
Analysis, 39 (2002)
18. Schölkopf, B.: The kernel trick for distances. In: Proc. NIPS (2001)
19. Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge (2002)
20. Sheikh, Y.A., Khan, E.A., Kanade, T.: Mode-seeking by medoidshifts. In: Proc.
CVPR (2007)
21. Subbarao, R., Meer, P.: Nonlinear mean shift for clustering over analytic manifolds.
In: Proc. CVPR (2006)
22. Sundaramoorthy, G., Yezzi, A., Mennucci, A.: Sobolev active contours. Int. J. Com-
put. Vision 73(3) (2007)
23. Yang, C., Duraiswami, R., Gumerov, N.A., Davis, L.: Improved fast Gauss trans-
form and efficient kernel density estimation. In: Proc. ICCV (2003)
24. Yuan, X., Li, S.Z.: Half quadric analysis for mean shift: with extension to a se-
quential data mode-seeking method. In: Proc. CVPR (2007)
25. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for
classification of texture and object categories: A comprehensive study IJCV (2006)
A Fast Algorithm for Creating a Compact and
Discriminative Visual Codebook
1 Introduction
Recently, patch-based object recognition has attracted particular attention and
demonstrated promising recognition performance [1,2,3,4]. Typically, a visual
codebook is created as follows. After extracting a large number of local patch
descriptors from a set of training images, k-means or hierarchical clustering is
often used to group these descriptors into n clusters, where n is a predefined
number. The center of each cluster is called “visual word”, and a list of them
forms a “visual codebook”. By labelling each descriptor of an image with the
most similar visual word, this image is characterized by an n-dimensional his-
togram counting the number of occurrences of each word. The visual codebook
can have critical impact on recognition performance. In the literature, the size
of a codebook can be up to 103 or 104 , resulting in a very high-dimensional
histogram.
A compact visual codebook has advantages in both computational efficiency
and memory usage. For example, when linear or nonlinear SVMs are used, the
complexity of computing the kernel matrix, testing a new image, or storing the
support vectors is all proportional to the codebook size, n. Also, many algo-
rithms working well in a low dimensional space will encounter difficulties such
National ICT Australia is funded by the Australian Government’s Backing Aus-
tralia’s Ability initiative, in part through the Australian Research Council. The au-
thors thank Richard I. Hartley for many insightful discussions.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 719–732, 2008.
c Springer-Verlag Berlin Heidelberg 2008
720 L. Wang, L. Zhou, and C. Shen
i=1
B= c
l%i (mi − m)(mi − m)
& (1)
i=1 li
A large class separability means small within-class scattering but large between-
class scattering. A combination of two of them can be used as a measure, for
example, tr(B)/tr(T) or |B|/|H|, where tr(·) and | · | denote the trace and de-
terminant of a matrix, respectively. In these measures the scattering of data
is evaluated through the mean and variance, which implicitly assumes a Gaus-
sian distribution for each class. This drawback is overcome by incorporating
the kernel trick and it makes the scatter-matrix based measure quite useful, as
demonstrated in Kernel based Fisher Discriminate Analysis (KFDA) [6].
m
2. j=1 Wij = 1 if requiring that each of the n visual words only be assigned
to one of the m visual words.
3. If no words are to be discarded, the constraint of ni=1 Wij ≥ 1 will be
imposed because each of the n visual words must be assigned to one of the
m visual words;
This results in a large-scale integer programming problem. Efficiently and opti-
mally solving it may be difficult for the state-of-the-art optimization techniques.
In this paper, we adopt a suboptimal approach that hierarchically merges two
words while maximally maintaining the class separability at each level.
where Atrs is a matrix defined as Atrs = Xtr (Xts )
, where Xtr is [xt1r , · · · , xtlr ].
Hence, it can be obtained that
t
rs = K + Ars + (Ars ) .
Kt−1 t t
(5)
A similar relationship exists between the class separability measures at t and
t − 1 levels. Let Bt−1 and Tt−1 be the matrices B and T computed with xt−1 .
It can be proven (the proof is omitted) that for a c-class problem,
c
1
Kt−1
rs,i 1 1
Kt−1
rs 1 1
Kt−1
rs 1
tr(Bt−1
rs ) = − ; rs ) = tr(Krs ) −
tr(Tt−1 t−1
(6)
i=1
l i l l
class i, and l is the total number. Note that 1
Atrs 1 = 1
(Atrs )
1, where 1 is
a vector consisting of “1”. By combining (5) and (6), we obtain that
1 Kti 1 1 Kt 1
c 1 Atrs,i 1 1 Atrs 1
tr(Bt−1
rs ) =
c
i=1 li
− l
+2 i=1 li
− l
c 1 Atrs,i 1 1 Atrs 1 (7)
= tr(Bt ) + 2 i=1 li
− l
where f (Xtr , Xts ) denotes the second term in the previous step. Similarly,
1 Kt 1 1 Atrs 1
rs ) = tr(K ) −
tr(Tt−1 t
+ 2 tr(Atrs ) +
l
l
1 Atrs 1 (8)
= tr(Tt ) + 2 tr(Atrs ) − l
tr(Tt ) + g(Xtr , Xts ) .
Since both tr(Bt ) and tr(Tt ) have been computed at level t before any merging
operation, the above results indicate that to evaluate the class separability after
merging two words, only f (Xtr , Xts ) and g(Xtr , Xts ) need to be calculated.
In the following, we further show that at any level t (m ≤ t < n), f (Xtr , Xts )
and g(Xtr , Xts ) can be worked out with little computation. Three cases are dis-
cussed in turn.
i) Neither the r-th nor the s-th visual word is newly generated at level t.
This means that both of them are directly inherited from level t+1. Assuming
that they are numbered as p and q at level t + 1, it can be known that
ii) Just one of the r-th and the s-th visual words is newly generated at level t.
Assume that the r-th visual word is newly generated by merging the u-th
and the v-th words at level t + 1, that is, Xtr = Xt+1
u + Xt+1
v . Furthermore,
assume that Xts is numbered as q at level t + 1. It can be shown that
= f (Xt+1 t+1
u , Xq ) + f (Xt+1 t+1
v , Xq );
iii) Both the r-th and the s-th visual words are newly generated at level t.
This case does not exist because only one visual word can be newly generated
at each level of a hierarchical clustering.
724 L. Wang, L. Zhou, and C. Shen
The above analysis shows that f (Xtr , Xts ) can be obtained either by directly
copying from level t + 1 or by a single addition operation. All of the analysis
applies to g(Xtr , Xts ). Hence, once the r-th and the s-th visual words are merged,
the class separability measure, tr(Bt−1 t−1
rs )/tr(Trs ), can be immediately obtained
by two addition and one division operations.
As a result, the point A must lie within the third quadrant of the Cartesian
coordinate system gOf and above the line of f − g = 0. The domain of A is
marked as a hatched region in Fig. 1.
ii) The coordinator of B(g t , f t ) must satisfy the following constraints:
Fig. 1. Illustration of the region where A(−tr(Tt ), −tr(Bt )) and B(g t , f t ) reside
Indexing structure. To realize the fast search, a polar coordinate based indexing
structure is used to index the t(t − 1)/2 points of B(g, f ) at level t, as illustrated
in Fig. 2. Each point B is assigned into a bin (i,j) according to its distance from
the origin and its polar angle, where i = 1, · · · , K and j = 1, · · · , S. The K is
the number of bins with respect to the distance from the origin, whereas S is the
number of bins with respect to the polar angle. In Fig. 2, this indexing structure
is illustrated by K concentric circles, each of which is further divided into S
segments. The total number of bins is KS. Through this indexing structure, we
can know which points B reside in a given bin. In this paper, the number of
circles K is set as 40, and their radius are arranged as ri = ri+1 /2. The S is set
as 36, which evenly divides [0, 2π) into 36 bins.
726 L. Wang, L. Zhou, and C. Shen
Fig. 2. The point A is fixed when searching for B which makes the line AB have the
largest slope. The line AD is tangent to the second largest circle CK−1 at D, and it
divides the largest circle CK into two parts, region I and II. Clearly, a point B in region
I always gives AB a larger slope than any point in region II. Therefore, if the region I
is not empty, the best point B must reside there and searching region I is sufficient.
Search strategy. As shown in Fig. 2, let D denote the point where the line AD is
tangent to the second largest circle, CK−1 . The line AD divides the largest circle
CK into two parts. When connected with A, a point B lying above AD (denoted
by region I) always gives a larger slope than any point below it (denoted by
region II). Therefore, if the region I is not empty, all points in the region II
can be safely ignored. The search is merely to find the best point B from the
region I which gives AB the largest slope. To carry out this search, we have to
know which points reside in the region I. Instead of exhaustively checking each of
the t(t−1)
2 points against AD, this information is conveniently obtained via the
above indexing structure. Let θE and θF be the polar angles of E and F where
the line AD and CK intersect. Denote the bins (with respect to the polar angle)
into which they fall by S1 and S2 , respectively. Thus, searching the region I can
be accomplished by searching the bin (i, j) with i = K and j = S1 , · · · , S2 .3
Clearly, the area of the searched region is much smaller than the area of CK
for moderate K and S. Therefore, the number of points B(g, f ) to be tested
can be significantly reduced, especially when the point B distributes sparsely in
the areas away from the origin. If the region I is empty, move the line AD to
be tangent to the next circle, CK−2 , and repeat the above steps. After finding
the optimal pair of words and merging them, all points B(g, f ) related to the
two merged words will be removed. Meanwhile, new points related to the newly
3
The region that is actually searched is slightly larger than the region I. Hence, the
found best point B will be rejected if it is below the line AD. This also means that
the region I is actually empty.
A Fast Algorithm 727
generated word will be added and indexed. This process is conveniently realized
in our algorithm by letting one word “absord” the other. Then, we finish the
operation at level t and move to level t−1. Our algorithm is described in Table 1.
Before ending this section, it is worth noting that this search problem may
be tackled by the dynamic convex hull [8] in computational geometry. Given
the point A, the best point B must be a vertex of the convex hull of the
points B(g, f ). At each level t, part of points B(g, f ) are updated, resulting in a
dynamically changing convex hull. The technique of dynamic convex hull can be
used to update the vertex set accordingly. This will be explored in future work.
Input: The l training images represented as {(xi , yi )}li=1 (xi ∈ Rn , yi ∈ {1, · · · , c}).
The n is the size of an initial visual codebook and yi is the class label of xi
m: the size of the target visual codebook.
Output: The n − m level merging hierarchy
Initialization:
compute f (Xn i , Xj ) and g(Xi , Xj ) (1 ≤ i < j ≤ n) and store them in memory
n n n
n(n−1)
Index the 2
points of B(g, f ) with a polar coordinate quantized into bins
Compute A(−tr(Tn ), −tr(Bn ))
Merging operation:
for t = n, n − 1, · · · , m
(1) fast search for the point B(g , f ) that gives the line AB the largest
slope, where f = f (Xtr , Xts ) and g = g (Xtr , Xts )
(2) compute tr(Bt−1 ) and tr(Tt−1 ) and update the point A:
tr(Bt−1 ) = tr(Bt ) + f (Xtr , Xts ); tr(Tt−1 ) = tr(Tt ) + g (Xtr , Xts )
(3) update f (Xtr , Xti ) and g(Xtr , Xti )
f (Xtr , Xti )=f (Xtr , Xti ) + f (Xts , Xti ); g(Xtr , Xti )=g(Xtr , Xti ) + g(Xts , Xti )
remove f (Xts , Xti ) and g(Xts , Xti )
(4) re-index f (Xtr , Xti ) and g(Xtr , Xti )
end
5 Experimental Result
The proposed class separability measure based fast algorithm is tested on four
classes of the Caltech-101 object database [9], including Motorbikes (798 images),
Airplanes (800), Faces easy (435), and BACKGROUND Google (520), as shown
in Fig. 3. A Harris-Affine detector [10] is used to locate interest regions, which
are then represented by the SIFT descriptor [11]. Other region detectors [12] and
descriptors [13] can certainly be used because our algorithm has no restriction
on this. The number of local descriptors extracted from the images of the four
classes are about 134K, 84K, 57K, and 293K, respectively. Our algorithm is
728 L. Wang, L. Zhou, and C. Shen
50 500
0 0
(a) (b)
Fig. 4. Time and peak memory cost Comparison of our CSM algorithm (using
the proposed fast search or an exhaustive search) and the PRO algorithm in [4]. The
horizontal axis is the size (in logarithm) of an initial visual codebook, while the vertical
axes are time and peak memory cost in (a) and (b), respectively. As shown, the CSM
algorithm with the fast search significantly reduces the time cost for a large-sized visual
codebook with acceptable memory usage.
word used in a histogram is randomly sampled from {0, 1, 2, · · · , 99}. In this ex-
periment, the CSM-based fast algorithm is compared with the PRO algorithm
which uses an exhaustive search to find the optimal pair of words to merge. We
implement the PRO algorithm according to [4], including a trick suggested to
speed up the algorithm by only updating the terms related to the two words to
be merged. Meanwhile, to explicitly show the efficiency of the fast search part in
our algorithm, we purposely replace the fast search in the CSM-based algorithm
with an exhaustive search to demonstrate the quick increase on time cost. A ma-
chine with 2.80GHz CPU and 4.0GB memory is used. The result is in Fig. 4. As
seen in sub-figure(a), the time cost of the PRO algorithm goes up quickly with
the increasing codebook size. It takes 1, 624 seconds to hierarchically cluster 1000
visual words to 2, whereas the CSM algorithm with an exhaustive search only
uses 9 seconds to accomplish this. The less time cost is attributed to the simplic-
ity of the CSM criterion and the fast evaluation method proposed in Section 4.1.
The CSM algorithm with the fast search achieves the highest computational ef-
ficiency. It only takes 1.55 minutes to hierarchically merge 10,000 visual words
to 2, and the time cost increases to 141.1 minutes when an exhaustive search
is used. As shown in sub-figure(b), the price is that the fast search needs more
memory (1.45GB for 10,000 visual words) to store the indexing structure. We
believe that such memory usage is acceptable for a personal computer today.
In the following experiments, the discriminative power of the obtained compact
visual codebooks is investigated.
Object recognition: Motorbikes vs. Airplanes Object recognition: Motorbikes vs. Airplanes
0.14 0.12
k−means clustering (KMS) k−means clustering (KMS)
0.1 0.08
0.08 0.06
0.06 0.04
0.04 0.02
0.02 0
900 700 500 300 100 80 60 40 20 0 900 700 500 300 100 80 60 40 20
The size of the obtained compact visual codebook The size of the obtained compact visual codebook
(a) (b)
images and 639 test images. An initial visual codebook of size 1, 000 is created
by using k-means clustering. The CSM algorithm with the fast search hierarchi-
cally clusters them into 2 words in 6 seconds, whereas the PRO algorithm takes
6, 164 seconds to finish this. Based on the obtained compact visual codebook, a
new histogram is created to represent each image. With the new histograms, a
classifier is trained on a training subset and evaluated on the corresponding test
subset. The average classification error rate is plotted in Fig. 5. The sub-figure
(a) shows the result when a linear SVM classifier is used. As seen, the compact
codebook generated by k-means clustering has poor discriminative power. Its
classification error rate goes up with the decreasing size of the compact code-
book. This is because k-means clustering uses the Euclidean distance between
clusters as the merging criterion, which is not related to the classification perfor-
mance. In contrast, the CSM and PRO algorithms achieve better classification
performance, indicating that they well preserve the discriminative power in the
obtained compact codebooks. For example, when the codebook size is reduced
from 1000 to 20, these two algorithms still maintain excellent classification per-
formance, with an increase of error rate less than 1%. Though the classification
error rate of our CSM algorithm is a little bit higher (about 1.5%) at the initial
stage, it soon drops to a level comparable to the error rate given by the PRO
algorithm with the decreasing codebook size. Similar results can be observed
from Fig. 5(b) where a nonlinear SVM classifier is employed.
This experiment aims to separate the images containing a face from the back-
ground images randomly collected from the Internet. In each training/test split,
there are 100 training images and 1, 498 test images. The number of initial visual
A Fast Algorithm 731
Object detection: Faces−easy vs. BACKGROUND_Google Object detection: Faces−easy vs. BACKGROUND_Google
0.35 0.26
k−means clustering (KMS) k−means clustering (KMS)
0.2
0.25
0.18
0.16
0.2
0.14
0.12
0.15
0.1
0.1 0.08
900 700 500 300 100 80 60 40 20 900 700 500 300 100 80 60 40 20
The size of the obtained compact viusal codebook The size of the obtained compact visual codebook
(a) (b)
words is 1, 000. They are hierarchically clustered into two words in 6 seconds by
our CSM algorithm with the fast search and in 1, 038 seconds by the PRO al-
gorithm. Again, with the newly obtained histograms, a classifier is trained and
evaluated. The averaged classification error rates are presented in Fig. 6. In this
experiment, the classification performance of the PRO algorithm is not as good
as before. This might be caused by the hyper-parameters used in the PRO al-
gorithm. Their values are preset according to [4] but may be task-dependent. In
contrast, our CSM algorithm achieves the best classification performance. The
small-sized compact codebooks consistently produce the error rate comparable to
that of the initial visual codebook. This indicates that our algorithm effectively
makes the compact codebooks preserve the discriminative power of the initial
codebook. An additional advantage of our algorithm is that the CSM criterion
is free of parameter setting. Meanwhile, a short “transition period” is observed
on the CSM algorithm in Fig. 6, where the classification error rate goes up and
then drops at the early stage. This interesting phenomenon will be looked into
in future work.
6 Conclusion
To obtain a compact and discriminative visual codebook, this paper proposes
using the separability of object classes to guide the hierarchical clustering of
initial visual words. Moreover, a fast algorithm is designed to avoid a lengthy
exhaustive search. As shown by the experimental study, our algorithm not only
ensures the discriminative power of a compact codebook, but also makes the
creation of a compact codebook very fast. This delivers an efficient tool for patch-
based object recognition. In future work, more theoretical and experimental
study will be conducted to analyze its performance.
732 L. Wang, L. Zhou, and C. Shen
References
1. Agarwal, S., Awan, A.: Learning to detect objects in images via a sparse, part-
based representation. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence 26(11), 1475–1490 (2004)
2. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categoriza-
tion with bags of keypoints. In: Proceedings of ECCV International Workshop on
Statistical Learning in Computer Vision, pp. 1–22 (2004)
3. Jurie, F., Triggs, B.: Creating efficient codebooks for visual recognition. In: Pro-
ceedings of the Tenth IEEE International Conference on Computer Vision, vol. 1,
pp. 604–610 (2005)
4. Winn, J., Criminisi, A., Minka, T.: Object categorization by learned universal
visual dictionary. In: Proceedings of the Tenth IEEE International Conference on
Computer Vision, vol. 2, pp. 1800–1807 (2005)
5. Varma, M., Zisserman, A.: A statistical approach to texture classification from
single images. International Journal of Computer Vision 62(1-2), 61–81 (2005)
6. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Müller, K.R.: Fisher discriminant
analysis with kernels. In: Hu, Y.H., Larsen, J., Wilson, E., Douglas, S. (eds.) Neural
Networks for Signal Processing IX, pp. 41–48. IEEE, Los Alamitos (1999)
7. Shen, C., Li, H., Brooks, M.J.: A convex programming approach to the trace quo-
tient problem. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007,
Part II. LNCS, vol. 4844, pp. 227–235. Springer, Heidelberg (2007)
8. Overmars, M.H., van Leeuwen, J.: Maintenance of configurations in the plane.
Journal of Computer and System Sciences 23(2), 166–204 (1981)
9. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few
training examples: an incremental bayesian approach tested on 101 object cate-
gories. In: Conference on Computer Vision and Pattern Recognition Workshop,
vol. 12, pp. 178–178 (2004)
10. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. In-
ternational Journal of Computer Vision 60(1), 63–86 (2004)
11. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings
of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp.
1150–1157 (1999)
12. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffal-
itzky, F., Kadir, T., Gool, L.V.: A comparison of affine region detectors. Interna-
tional Journal of Computer Vision 65(1-2), 43–72 (2005)
13. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE
Transactions on Pattern Analysis and Machine Intelligence 27(10), 1615–1630
(2005)
A Dynamic Conditional Random Field Model
for Joint Labeling of Object and Scene Classes
1 Introduction
Today, object class detection methods are capable of achieving impressive re-
sults on challenging datasets (e.g. PASCAL challenges [1]). Often these methods
combine powerful feature vectors such as SIFT or HOG with the power of dis-
criminant classifiers such as SVMs and AdaBoost. At the same time several
authors have argued that global scene context [2,3] is a valuable cue for ob-
ject detection and therefore should be used to support object detection. This
context-related work however has nearly exclusively dealt with static scenes. As
this paper specifically deals with highly dynamic scenes we will also model object
motion as an additional and important cue for detection.
Pixel-wise scene labeling has also been an active field of research recently. A
common approach is to use Markov or conditional random field (CRF) models to
improve performance by modeling neighborhood dependencies. Several authors
have introduced the implicit notion of objects into CRF-models [4,5,6,7]. The
interactions between object nodes and scene labels however are often limited to
uni-directional information flow and therefore these models have not yet shown
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 733–747, 2008.
c Springer-Verlag Berlin Heidelberg 2008
734 C. Wojek and B. Schiele
the full potential of simultaneously reasoning about objects and scene. By for-
mulating the problem as a joint labeling problem for object and scene classes,
this paper introduces a more general notion of object-scene interaction enabling
bidirectional information flow. Furthermore, as we are interested in dynamic
scenes, we make use of the notion of dynamic CRFs [8], which we extend to deal
with both moving camera and moving objects.
Therefore we propose a novel approach to jointly label objects and scene
classes in highly dynamic scenes for which we introduce a new real-world dataset
with pixel-wise annotations. Highly dynamic scenes are not only a scientific chal-
lenge but also an important problem, e.g. for applications such as autonomous
driving or video indexing where both the camera and the objects are moving
independently. Formulating the problem as a joint labeling problem allows 1) to
model the dynamics of the scene and the objects separately which is of particular
importance for the scenario of independently moving objects and camera, and 2)
to enable bi-directional information flow between object and scene class labels.
The remainder of this paper is structured as follows. Section 2 reviews related
work from the area of scene labeling and scene analysis in conjunction with
object detection. Section 3 introduces our approach and discusses how object
detection and scene labeling can be integrated as a joint labeling problem in
a dynamic CRF formulation. Section 4 introduces the employed features, gives
details on the experiments and shows experimental results. Finally, section 5
draws conclusions.
2 Related Work
In recent years, conditional random fields (CRFs) [9] have become a popular
framework for image labeling and scene understanding. However, to the best
of our knowledge, there is no work which explicitly models object entities in
dynamic scenes. Here, we propose to model objects and scenes in a joint label-
ing approach on two different layers with different information granularity and
different labels in a dynamic CRF [8].
Related work can roughly be divided into two parts. First, there is related
work on CRF models for scene understanding, and second there are approaches
aiming to integrate object detection with scene understanding.
In [10] Kumar&Hebert detect man-made structures in natural scenes using a
single-layered CRF. Later they extend this work to handle multiple classes in a
two-layered framework [5]. Kumar&Hebert also investigated object-context in-
teraction and combined a simple boosted object detector for side-view cars with
scene context of road and buildings on a single-scale database of static images.
In particular, they are running inference separately on their two layers and each
detector hypothesis is only modeled in a neighborhood relation with an entire
region on the second layer. On the contrary, we integrate multi-scale objects in
a CRF framework where inference is conducted jointly for objects and context.
Additionally, we propose to model edge potentials in a consistent layout by ex-
ploiting the scale given by a state-of-the-art object detector [11]. Torralba et al.
[7] use boosted classifiers to model unary and interaction potentials in order
A Dynamic CRF Model for Joint Labeling of Object and Scene Classes 735
to jointly label object and scene classes. Both are represented by a dictionary
of patches. However, the authors do not employ an object detector for entire
objects. In our work we found a separate object detector to be essential for im-
proved performance. Also Torralba et al. use separate layers for each object and
scene class and thus inference is costly due to the high graph connectivity, and
furthermore they also work on a single-scale database of static images. We intro-
duce a sparse layer to represent object hypotheses and work on dynamic image
sequences containing objects of multiple scales. Further work on simultaneous
object recognition and scene labeling has been conducted by Shotton et al. [6].
Their confusion matrix shows, that in particular object classes where color and
texture cues do not provide sufficient discriminative power on static images –
such as boat, chair, bird, cow, sheep, dog – achieve poor results. While their
Texton feature can exploit context information even from image pixels with a
larger distance, the mentioned object classes remain problematic due to the un-
known object scale. Furthermore, He et al. [4] present a multi-scale CRF which
contains multiple layers relying on features of different scales. However, they
do not model the explicit notion of objects and their higher level nodes rather
serve as switches to different context and object co-occurrences. Similarly, Ver-
beek&Triggs [12] add information about class co-occurrences by means of a topic
model. Finally, several authors proposed to adopt the CRF framework for ob-
ject recognition as a standalone task [13,14,15] without any reasoning about the
context and only report results on static single-scale image databases.
Dynamic CRFs are exploited by Wang&Ji [16] for the task of image segmenta-
tion with intensity and motion cues in mostly static image sequences. Similarly,
Yin&Collins [17] propose a MRF with temporal neighborhoods for motion seg-
mentation with a moving camera.
The second part of related work deals with scene understanding approaches
from the observation of objects. Leibe et al. [18] employ a stereo camera system
together with a structure-from-motion approach to detect pedestrians and cars in
urban environments. However, they do not explicitly label the background classes
which are still necessary for many applications even if all objects in the scene
are known. Hoiem et al. [3] exploit the detected scales of pedestrians and cars
together with a rough background labeling to infer the camera’s viewpoint which
in turn improves the object detections in a directed Bayesian network. Contrary
to our work, object detections are refined by the background context but not
the other way round. Also, only still images are handled while the presence of
objects is assumed. Similarly, Torralba [2] exploits filter bank responses to obtain
a scene prior for object detection.
3.1 Plain CRF: Single Layer CRF Model for Scene-Class Labeling
In general a CRF models the conditional probability of all class labels yt given an
input image xt . Similar to others, we model the set of neighborhood relationships
N1 up to pairwise cliques to keep inference computationally tractable. Thus, we
model
log(PpCRF (yt |xt , N1 , Θ)) = Φ(yit , xt ; ΘΦ )+ Ψ (yit , yjt , xt ; ΘΨ )− log(Z t )
i (i,j)∈N1
(1)
Z t denotes the so called partition function, which is used for normalization.
N1 is the set of all spatial pairwise neighborhoods. We refer to this model as
plain CRF.
Unary Potentials. Our unary potentials model local features for all classes C
including scene as well as object classes. We employ the joint boosting framework
[19] to build a strong classifier H(c, f(xti ); ΘΦ ) = M m=1 hm (c, f(xi ); ΘΦ ). Here,
t
f(xi ) denotes the features extracted from the input image for grid point i. M is
t
the number of boosting rounds and c are the class labels. hm are weak learners
with parameters ΘΦ and are shared among the classes for this approach. In
order to interpret the boosting confidence as a probability we apply a softmax
transform [5]. Thus, the potential becomes:
exp H(k, f(xti ); ΘΦ )
Φ(yit = k, xt ; ΘΦ ) = log (2)
c exp H(c, f(xi ); ΘΦ )
t
Edge Potentials. The edge potentials model the interaction between class
labels at two neighboring sites yit and yjt in a regular lattice. The interaction
strength is modeled by a linear discriminative classifier with parameters ΘΨ =
wT and depends on the difference of the node features dtij := |f(xti ) − f(xtj )|.
1
Ψ (yi , yj , x ; ΘΨ ) =
t t t
w T
δ(yit = k)δ(yjt = l) (3)
dtij
(k,l)∈C
3.2 Object CRF: Two Layer Object CRF for Joint Object and
Scene Labeling
Information that can be extracted from an image patch locally is rather limited
and pairwise edge potentials are too weak to model long range interactions.
Ideally, a complete dense layer of hidden variables would be added to encode
possible locations and scales of objects, but since inference for such a model
is computationally expensive we propose to inject single hidden variables ot =
{ot1 , . . . , otD } (D being the number of detections) as depicted in figure 1(a). To
instantiate those nodes any multi-scale object detector can be employed.
The additional nodes draw object appearance from a strong spatial model
and are connected to the set of all corresponding hidden variables {yt }otn whose
A Dynamic CRF Model for Joint Labeling of Object and Scene Classes 737
κt
ot ot+1
Λ Λ
t
Δ
Ω yt y t+1 Ω
Φ Φ
xt xt+1
(a) (b)
Fig. 1. (a) Graphical model for the object CRF ; note that different edge colors denote
different potentials; (b) Graphical model for our full dynamic CRF ; observed nodes are
grey, hidden variables are white, for the sake of readability we omit the spatial layout
of yt with the corresponding edge potential Ψ
evidence {xt }otn support the object hypotheses. The new nodes’ labels in this
work are comprised of O = {object, background}; but the extension to multiple
object classes is straight forward. Thus, we introduce two new potentials into
the CRF model given in equation (1) and yield the object CRF :
n (i,j,n)∈N3
1
Ω(otn , xt ; ΘΩ ) = log (5)
1 + exp(s1 · (vT · g({xt }otn ) + b) + s2 )
1
Λ(yit , yjt , otn , xt ; ΘΛ ) = T
u δ(yit = k)δ(yjt = l)δ(otn = m) (6)
dtij
(k,l)∈C;m∈O
It is important to note, that the inter-layer interactions are anisotropic and scale-
dependent. We exploit the scale given by the object detector to train different
weights for different scales and thus can achieve real multi-scale modeling in the
CRF framework. Furthermore, we use different sets of weights for different parts
of the detected object enforcing an object and context consistent layout [15].
3.3 Dynamic CRF: Dynamic Two Layer CRF for Object and Scene
Class Labeling
While the additional information from an object detector already improves the
classification accuracy, temporal information is a further important cue. We
propose two temporal extensions to the framework introduced so far. For highly
dynamic scenes – such as the image sequences taken by a driving car, which we
will use as an example application to our model, it is important to note that
objects and the remaining scene have different dynamics and thus should be
modeled differently. For objects we estimate their motion and track them with
a temporal filter in 3D space. The dynamics for the remaining scene is mainly
caused by the camera motion in our example scenario. Therefore, we use an
estimate of the camera’s ego motion to propagate the inferred scene labels at
time t as a prior to time step t + 1.
Since both – object and scene dynamics – transfer information forward to fu-
ture time steps, we employ directed links in the corresponding graphical model
as depicted in figure 1(b). It would have also been possible to introduce undi-
rected links, but those are computationally more demanding. Moreover, those
might not be desirable from an application point of view, due to the backward
flow of information in time when online processing is required.
4 Experiments
To evaluate our model’s performance we conducted several experiments on two
datasets. First, we describe our features which are used for texture and location
740 C. Wojek and B. Schiele
based classification of scene labels on the scene label CRF layer. Then we in-
troduce features employed for object detection on the object label CRF layer.
Next, we briefly discuss the results obtained on the Sowerby database and fi-
nally we present results on image sequences on a new dynamic scenes dataset,
which consist of car traffic image sequences recorded from a driving vehicle under
challenging real-world conditions.
Texture and Location Features. For the unary potential Φ at the lower
level as well as for the edge potentials Ψ and inter-layer potentials Λ we employ
texture and location features. The texture features are computed from the 16 first
coefficients of the Walsh-Hadamard transform. This transformation is a discrete
approximation of the cosine transform and can be computed efficiently [24,25] –
even in real-time (e.g. on modern graphics hardware). The features are extracted
at multiple scales from all channels of the input image in CIE Lab color space.
As a preprocessing step, a and b channels are normalized by means of a gray
world assumption to cope with varying color appearance. The L channel is mean-
variance normalized to fit a Gaussian distribution with a fixed mean to cope with
global lighting variations. We also found that normalizing the transformation’s
coefficients according to Varma&Zisserman [26] is beneficial. They propose to
L1 -normalize each filter response first and then locally normalize the responses
at each image pixel. Finally, we take the mean and variance of the normalized
responses as feature for each node in the regular CRF lattice. Additionally, we
use the grid point’s coordinates within the image as a location cue. Therefore,
we concatenate the pixel coordinates to the feature vector.
4.2 Results
Sowerby Dataset. The Sowerby dataset is a widely used benchmark for CRFs,
which contains 7 outdoor rural landscape classes. The dataset comprises 104
images at a resolution of 96×64 pixels. Following the protocol of [5] we randomly
selected 60 images for training and 44 images for testing. Some example images
with inferred labels are shown in figure 2. However, this dataset does neither
contain image sequences nor cars that can be detected with an object detector
A Dynamic CRF Model for Joint Labeling of Object and Scene Classes 741
Pixel-wise accuracy
Unary classification plain CRF model
He et al. [4] 82.4% 89.5%
Kumar&Hebert [5] 85.4% 89.3%
Shotton et al. [6] 85.6% 88.6%
This paper 84.5% 91.1%
and thus we can only compare our plain CRF model (equation 1) with previous
work on this set.
The experiments show that our features and CRF parameter estimation is
competitive to other state-of-the-art methods. Table 1 gives an overview of pre-
viously published results and how those compare to our model (see figure 3).
While the more sophisticated Textons features [6] do better for unary classifica-
tion, our CRF model can outperform those since our edge potentials are learned
from training data. For this dataset we use a grid with one node for each input
pixel, while the Gaussian prior σ was set to 1.25. The Walsh-Hadamard trans-
form was run on the input images at the aperture size of 2, 4, 8 and 16 pixels.
Moreover, we used a global set of weights for the isotropic linear classifiers of
the edge potentials, but distinguish between north-south neighborhood relations
and east-west neighborhood relations.
Dynamic Scenes Dataset. To evaluate our object and dynamic CRF we set
up a new dynamic scenes dataset with image sequences consisting of overall 1936
images1 . The images are taken from a camera inside a driving car and mainly
show rural roads with high dynamics of driving vehicles at an image resolution
of 752 × 480 pixels. Cars appear at all scales from as small as 15 pixels up to 200
pixels. The database consists of 176 sequences with 11 successive images each.
It is split into equal size training and test sets of 968 images.
1
The dataset is available at http://www.mis.informatik.tu-darmstadt.de.
742 C. Wojek and B. Schiele
To evaluate pixel level labeling accuracy the last frame of each sequence is
labeled pixel-wise, while the remainder only contains bounding box annotations
for the frontal and rear view car object class. Overall, the dataset contains the
eight labels void, sky, road, lane marking, building, trees & bushes, grass and
car. Figure 3 shows some sample scenes. For the following experiments we used
8 × 8 pixels for each CRF grid node and texture features were extracted at the
aperture sizes of 8, 16 and 32 pixels.
We start with an evaluation of the unary classifier performance on the scene
class layer. Table 2 lists the pixel-wise classification accuracy for different varia-
tions of the feature. As expected location is a valuable cue, since there is a huge
variation in appearance due to different lighting conditions. Those range from
bright and sunny illumination with cast shadows to overcast. Additionally, mo-
tion blur and weak contrast complicate the pure appearance-based classification.
Further, we observe that normalization [26] as well as multi-scale features are
helpful to improve the classification results.
Normalization
on off
multi-scale single-scale multi-scale single-scale
on 82.2% 81.1% 79.7% 79.7%
Location
off 69.1% 64.1% 62.3% 62.3%
Next, we analyze the performance of the different proposed CRF models. On the
one hand we report the overall pixel-wise accuracy. On the other hand the pixel-
wise labeling performance on the car object class is of particular interest. Overall,
car pixels cover 1.3% of the overall observed pixels. Yet, those are an important
fraction for many applications and thus we also report those for our evaluation.
For the experiments we used anisotropic linear edge potential classifiers with
16 parameter sets, arranged in four rows and four columns. Moreover, we dis-
tinguish between north-south and east-west neighborhoods. For the inter-layer
edge potentials we trained different weight sets depending on detection scale
(discretized in 6 bins) and depending on the neighborhood location with respect
to the object’s center.
Table 3 shows recall and precision for the proposed models. Firstly, the em-
ployed detector has an equal error rate of 78.8% when the car detections are eval-
uated in terms of precision and recall. When evaluated on a pixel-wise basis the
performance corresponds to 60.2% recall. The missing 39.8% are mostly due to
the challenging dataset. It contains cars with weak contrast, cars at small scales
and partially visible cars leaving the field of view. Precision for the detector eval-
uated on pixels is 37.7%. Wrongly classified pixels are mainly around the objects
and on structured background on which the detector obtains false detections.
A Dynamic CRF Model for Joint Labeling of Object and Scene Classes 743
Table 3. Pixel-wise recall and precision for the pixels labeled as Car and overall
accuracy on all classes
Let us now turn to the performance of the different CRF models. Without
higher level information from an object detector plain CRFs in combination
with texture-location features achieve a recall of 50.1% with a precision of 57.7%.
The recognition of cars in this setup is problematic since CRFs optimize a global
energy function, while the car class only constitutes a minor fraction of the data.
Thus, the result is mainly dominated by classes which occupy the largest regions
such as sky, road and trees.
With higher level object information (object CRF ) recall can be improved up
to 62.9% with slightly lower precision resulting from the detector’s false positive
detections. However, when objects are additionally tracked with a Kalman filter,
we achieve a recall of 70.4% with a precision of 57.8%. This proves that the
object labeling for the car object class leverages from the object detector and
additionally from the dynamic modeling by a Kalman filter.
Additionally, we observe an improvement of the overall labeling accuracy.
While plain CRFs obtain an accuracy of 88.3%, the object CRF achieves 88.6%
while also including object dynamics further improves the overall labeling accu-
racy to 88.7%. The relative number of 0.4% might appear low, but considering
that the database overall only has 1.3% of car pixels, this is worth noting. Thus,
we conclude that not only the labeling on the car class is improved but also the
overall scene labeling quality.
When the scene dynamics are modeled additionally and posteriors are prop-
agated over time (dynamic CRF ), we again observe an improvement of the
achieved recall from 25.5% to 75.7% with the additional object nodes. And also
the objects’ dynamic model can further improve the recall to 78.0% correctly
labeled pixels. Thus, again we can conclude that the CRF model exploits both
the information given by the object detector as well as the additional object
dynamic to improve the labeling quality.
Finally, when the overall accuracy is analyzed while the scene dynamic is
modeled we observe a minor drop compared to the static modeling. However, we
again consistently observe that the object information and their dynamics allow
to improve from 86.5% without object information to 87.1% with object CRFs
and to 88.1% with the full model.
The consistently slightly worse precision and overall accuracy for the dynamic
scene models need to be explained. Non-car pixels wrongly labeled as car are
mainly located at the object boundary, which are mainly due to artifacts of the
744 C. Wojek and B. Schiele
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Dynamic scenes dataset example result scene labels and corresponding detec-
tions in left-right order (best viewed in color); note that detections can be overruled
by the texture location potentials and vice versa
scene label forward propagation. Those are introduced by the inaccuracies of the
speedometer and due to the inaccuracies of the projection estimation.
A confusion matrix for all classes of the dynamic scenes database can be
found in table 4. Figure 3 shows sample detections and scene labelings for the
different CRF models to illustrate the impact of the different models and their
improvements. In example (d) for instance the car which is leaving the field of
view is mostly smoothed out by a plain CRF and object CRF, while the dynamic
CRF is able to classify almost the entire area correctly. Additionally, the smaller
cars which get smoothed out by a plain CRF are classified correctly by the object
and dynamic CRF. Also note that false object detections as in example (c) do
not result in a wrong labeling of the scene.
A Dynamic CRF Model for Joint Labeling of Object and Scene Classes 745
Table 4. Confusion matrix in percent for the dynamic scenes dataset; entries are
row-normalized
Building
marking
Inferred
Trees &
bushes
Grass
Road
Lane
Void
Car
Sky
True class Fraction
Sky 10.4% 91.0 0.0 0.0 7.7 0.5 0.4 0.3 0.1
Road 42.1% 0.0 95.7 1.0 0.3 1.1 0.1 0.5 1.3
Lane marking 1.9% 0.0 36.3 56.4 0.8 2.9 0.2 1.8 1.6
Trees & bushes 29.2% 1.5 0.2 0.0 91.5 5.0 0.2 1.1 0.4
Grass 12.1% 0.4 5.7 0.5 13.4 75.3 0.3 3.5 0.9
Building 0.3% 1.6 0.2 0.1 37.8 4.4 48.4 6.3 1.2
Void 2.7% 6.4 15.9 4.1 27.7 29.1 1.4 10.6 4.8
Car 1.3% 0.3 3.9 0.2 8.2 4.9 2.1 2.4 78.0
5 Conclusions
In this work we have presented a unifying model for joint scene and object class
labeling. While CRFs greatly improve unary pixel-wise classification of scenes
they tend to smooth out smaller regions and objects such as cars in landscape
scenes. This is particularly true when objects only comprise a minor part of the
amount of overall pixels. We showed that adding higher level information from
a state-of-the-art HOG object detector ameliorates this shortcoming. Further
improvement – especially when objects are only partially visible – is achieved
when object dynamics are properly modeled and when scene labeling information
is propagated over time. The improvement obtained is bidirectional, on the one
hand the labeling of object classes is improved, but on the other hand also the
remaining scene classes benefit from the additional source of information.
For future work we would like to investigate how relations between different
objects such as partial occlusion can be modeled when multiple object classes
are detected. Additionally, we seek to improve the ego-motion estimation of the
camera to further improve the performance. This will also allow us to employ
motion features in the future. Finally, we assume that the integration of different
sensors such as radar allow for a further improvement of the results.
References
1. Everingham, M., Zisserman, A., Williams, C., van Gool, L.: The pascal visual
object classes challenge results. Technical report, PASCAL Network (2006)
2. Torralba, A.: Contextual priming for object detection. IJCV, 169–191 (2003)
746 C. Wojek and B. Schiele
3. Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. In: CVPR
(2006)
4. He, X., Zemel, R.S., Carreira-Perpiñán, M.Á.: Multiscale conditional random fields
for image labeling. In: CVPR (2004)
5. Kumar, S., Hebert, M.: A hierarchical field framework for unified context-based
classification. In: ICCV (2005)
6. Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost: Joint appearance,
shape and context modeling for multi-class object recognition and segmentation. In:
Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951. Springer,
Heidelberg (2006)
7. Torralba, A., Murphy, K.P., Freeman, W.T.: Contextual models for object detec-
tion using boosted random fields. In: NIPS (2004)
8. McCallum, A., Rohanimanesh, K., Sutton, C.: Dynamic conditional random fields
for jointly labeling multiple sequences. In: NIPS Workshop on Syntax, Semantics
(2003)
9. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Proba-
bilistic models for segmenting and labeling sequence data. In: ICML (2001)
10. Kumar, S., Hebert, M.: Discriminative random fields: A discriminative framework
for contextual interaction in classification. In: ICCV (2003)
11. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR (2005)
12. Verbeek, J., Triggs, B.: Region classification with markov field aspect models. In:
CVPR (2007)
13. Quattoni, A., Collins, M., Darrell, T.: Conditional random fields for object recog-
nition. In: NIPS (2004)
14. Kapoor, A., Winn, J.: Located hidden random fields: Learning discriminative parts
for object detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
LNCS, vol. 3954. Springer, Heidelberg (2006)
15. Winn, J., Shotton, J.: The layout consistent random field for recognizing and seg-
menting partially occluded objects. In: CVPR (2006)
16. Wang, Y., Ji, Q.: A dynamic conditional random field model for object segmenta-
tion in image sequences. In: CVPR (2005)
17. Yin, Z., Collins, R.: Belief propagation in a 3D spatio-temporal MRF for moving
object detection. In: CVPR (2007)
18. Leibe, B., Cornelis, N., Cornelis, K., Van Gool, L.: Dynamic 3D scene analysis
from a moving vehicle. In: CVPR (2007)
19. Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing features: Efficient boosting
procedures for multiclass object detection. In: CVPR (2004)
20. Platt, J.: Probabilistic outputs for support vector machines and comparison to
regularized likelihood methods. In: Smola, A.J., Bartlett, P., Schoelkopf, B., Schu-
urmans, D. (eds.) Advances in Large Margin Classifiers, pp. 61–74 (2000)
21. Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans-
actions of the ASME–Journal of Basic Engineering 82, 35–45 (1960)
22. Sutton, C., McCallum, A.: Piecewise training for undirected models. In: 21th An-
nual Conference on Uncertainty in Artificial Intelligence (UAI 2005) (2005)
23. Vishwanathan, S.V.N., Schraudolph, N.N., Schmidt, M.W., Murphy, K.P.: Accel-
erated training of conditional random fields with stochastic gradient methods. In:
ICML (2006)
24. Hel-Or, Y., Hel-Or, H.: Real-time pattern matching using projection kernels.
PAMI 27, 1430–1445 (2005)
A Dynamic CRF Model for Joint Labeling of Object and Scene Classes 747
25. Alon, Y., Ferencz, A., Shashua, A.: Off-road path following using region classifica-
tion and geometric projection constraints. In: CVPR (2006)
26. Varma, M., Zisserman, A.: Classifying images of materials: Achieving viewpoint
and illumination independence. In: Tistarelli, M., Bigun, J., Jain, A.K. (eds.)
ECCV 2002. LNCS, vol. 2359. Springer, Heidelberg (2002)
27. Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: A database
and web-based tool for image annotation. IJCV 77, 157–173 (2008)
Local Regularization for Multiclass Classification
Facing Significant Intraclass Variations
1 Introduction
Object recognition systems, viewed as learning systems, face three major chal-
lenges: First, they are often required to discern between many objects; second,
images taken under uncontrolled settings display large intraclass variation; and
third, the number of training images provided is often small.
Previous attempts to overcome these challenges use prior generic knowledge
on variations within objects classes [1], employ large amounts of unlabeled data
(e.g., [2]), or reuse previously learned visual features [3]. Here, we propose a more
generic solution, that does not assume nor benefit from the existence of prior
learning stages or of an additional set of training images.
To deal with the challenge of multiple classes, we propose a Canonical Cor-
relation Analysis (CCA) based classifier, which is a regularized version of a
recently proposed method [4], and is highly related to Fisher Discriminant Anal-
ysis (LDA/FDA). We treat the other two challenges as one since large intraclass
variations and limited training data both result in a training set that does not
capture well the distribution of the input space. To overcome this, we propose a
new local learning scheme which is based on the principle of decisiveness.
In local learning schemes, some of the training is deferred to the prediction
phase, and a new classifier is trained for each new (test) example. Such schemes
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 748–759, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Local Regularization for Multiclass Classification 749
have been introduced by [5] and were recently advanced and shown to be ef-
fective for modern object recognition applications [6] (see references therein for
additional references to local learning methods). One key difference between our
method and the previous contribution in the field is that we do not select or di-
rectly weigh the training examples by their proximity to the test point. Instead,
we modify the objective function of the learning algorithm to reward components
in the resulting classifier that are parallel to the test example. Thus, we encour-
age the classification function (before thresholding takes place) to be separated
from zero.
Runtime is a major concern for local learning schemes, since a new clas-
sifier needs to be trained or adjusted for every new test example. We show
how the proposed classifier can be efficiently computed by several rank-one up-
dates to precomputed eigenvectors and eigenvalues of constant matrices, with
the resulting time complexity being significantly lower than that of a full eigen-
decomposition. We conclude by showing the proposed methods to be effective
on four varied datasets which exhibit large intraclass variations.
n
min A
xi − V
zi 2 (2)
A,V
i=1
subject to A
XX
A = V
ZZ
V = I (4)
This problem is solved through Canonical Correlation Analysis (CCA) [7]. A
simple solution involves writing the corresponding Lagrangian and setting the
partial derivatives to zero, yielding the following generalized eigenproblem:
0 XZ
ai XX
0 ai
= λ (5)
ZX
0 vi i
0 ZZ
vi
where λi , i = 1..l are the leading generalized eigenvalues, ai are the columns of
A, and vi are, as defined above, the columns of V . To classifying a new sample
x, it is first transformed to A
x, and then compared to the k class vectors, i.e.,
the predicted class is given by arg min1≤j≤k ||A
x − vj ||.
This classification scheme is readily extendable to non-linear functions that
satisfy Mercer’s conditions by using Kernel CCA [8,9]. Kernel CCA is also equiv-
alent to solving a generalized eigenproblem of the form of Equation 5, so although
we refer directly to linear CCA throughout this paper, our conclusions are equally
valid for Kernel CCA.
In Kernel CCA, or in the linear case when m > n, and in many other com-
mon scenarios, the problem is ill-conditioned and regularization techniques are
required [10]. For linear regression, ridge regularization is often used, as is its
equivalent in CCA and Kernel CCA [8]. This involves replacing XX
and ZZ
in Equation 5 with XX
+ ηX I and ZZ
+ ηZ I, where ηX and ηZ are regular-
ization parameters. In the CCA case presented here, for multiclass classification,
since the number of training examples n is not smaller than the number of
classes k, regularization need not be used for Z and we set ηZ = 0. Also, since
the X regularization is relative to the scale of the matrix XX
, we scale the
regularization parameter ηX as a fraction of the largest eigenvalue of XX
.
The multiclass classification scheme via CCA presented here is equivalent to
Fisher Discriminant Analysis (LDA). We provide a brief proof of this equivalence.
A previous lemma was proven by Yamada et al [4] for the unregularized case.
Lemma 1. The multiclass CCA classification method learns the same linear
transformation as multiclass LDA.
Proof. The generalized eigenvalue problem in Equation 5, with added ridge reg-
ularization, can be represented by the following two coupled equations:
(XX
+ ηIm )−1 XZ
v = λa (6)
(ZZ
)−1 ZX
a = λv (7)
Any solution (a, v, λ) to the above system satisfies:
(XX
+ ηIm )−1 XZ
(ZZ
)−1 ZX
a = (XX
+ ηIm )−1 XZ
λv = λ2 a (8)
(ZZ
)−1 ZX
(XX
+ ηIm )−1 XZ
v = (ZZ
)−1 ZX
λa = λ2 v (9)
Local Regularization for Multiclass Classification 751
Thus the columns of the matrix A are the eigenvectors corresponding to the
largest eigenvalues
of (XX
+ ηIm )−1 XZ
(ZZ
)−1 ZX
. Examine the prod-
uct ZZ = i=n eyi e
yi . It is a k × k diagonal matrix with the number of train-
ing samples in each class (denoted Ni ) along its diagonal. Therefore, (ZZ
)−1 =
n
diag( N11 , N12 , . . . , N1k ). Now examine XZ
: (XZ
)i,j = Xi,s Zj,s = Xi,s .
s=1 s:ys =j
Hence, the j’th column is the sum of all training samples of the class j. Denote by
X̄j the mean of the training samples belonging to the class j, then the j’th column
of XZ
is Nj X̄j . It follows that
k
Nj2
k
XZ
(ZZ
)−1 ZX
= X̄j X̄j
= Nj X̄j X̄j
= SB (10)
j=1
Nj j=1
Our analysis below uses the CCA formulation; the LDA case is equivalent, with
some minor modifications to the way the classification is done after the linear
transformation is applied.
subject to A XX A = V ZZ V = I (11)
tr(A
xx
A) = A
x2 , and the added term reflects the principle of decisive-
ness. ᾱ is a parameter corresponding to the trade-off between the correlation
term and the decisiveness term. Adding ridge regularization as before to the
752 L. Wolf and Y. Donner
solution of Equation 11, and setting α = ᾱλ−1 gives the following generalized
eigenproblem:
0 XZ
a XX
+ ηI − αxx
0 a
= λ (12)
ZX
0 v 0 ZZ
v
Note that this form if similar to the CCA based multiclass classifier presented
in Section 2 above, except that the ridge regularization matrix ηI is replaced by
the local regularization matrix ηI −αxx
. We proceed to analyze the significance
of this form of local regularization. In ridge regression, the influence of all eigen-
vectors is weakened uniformly by adding η to all eigenvalues before computation
of the inverse. This form of regularization encourages smoothness in the learned
transformation. In our version of local regularization, smoothness is still achieved
by the addition of η to all eigenvalues. The smoothing effect is weakened, how-
ever, by α, in the component parallel to x. This can be seen by the representation
xx
= Ux λx Ux , for Ux
Ux = Ux Ux
= I, with λx = diag(x2 , 0, . . . , 0). Now
ηI − αxx
= Ux (ηI − αλx )Ux
, and the eigenvalues of the regularization ma-
trix are (η − α, η, η, . . . , η). Hence, the component parallel to x is multiplied by
η − α while all others are multiplied by η. Therefore, encouraging decisiveness by
adding the term αA
x2 to the maximization goal is a form of regularization
where the component parallel to x is smoothed less than the other components.
4 Efficient Implementation
In this section we analyze the computational complexity of our method, and
propose an efficient update algorithm that allows it to be performed in time
comparable to standard CCA with ridge regularization. Our algorithm avoids
fully retraining the classifier for each testing example by training it once using
standard CCA with uniform ridge regularization, and reusing the results in the
computation of the local classifiers.
Efficient training of a uniformly regularized multiclass CCA classifier.
In the non-local case, training a multiclass CCA classifier consists of solving
Equations 6 and 7, or, equivalently, Equations 8 and 9. Let r = min(m, k), and
note that we assume m ≤ n, since the rank of the data matrix is at most n,
and if m > n we can change basis to a more compact representation. To solve
Equations 8 and 9, it is enough to find the eigenvalues and eigenvectors of a
r × r square matrix. Inverting (XX
+ ηIm )−1 and (ZZ
)−1 and reconstructing
the full classifier (A and V ) given the eigenvalues and eigenvectors of the r × r
matrix above can be done in O(m3 + k 3 ). While this may be a reasonable effort
if done once, it may become prohibitive if done repeatedly for each new test
example. This, however, as we show below, is not necessary.
Representing the local learning problem as a rank-one modification.
We first show the problem to be equivalent to the Singular Value Decomposition
(SVD) of a (non-symmetric) matrix, which is in turn equivalent to the eigen-
decomposition of two symmetric matrices. We then prove that one of these two
Local Regularization for Multiclass Classification 753
V̄ = (ZZ
) 2 V . By the constraints (Equation 11, with added ridge and local
1
Ā,V̄
subject to Ā Ā = V̄ V̄ = I (13)
Define:
JX αxx
JX
(XX
+ ηX Im − αxx
)−1 = (JX − αxx
)−1 = JX +
1 − αx
JX x
α
= JX + (JX x) (JX x)
= (XX
+ ηX Im )−1 + βbb
(16)
1 − αx
JX x
= (ZZ
)− 2 ZX
(XX
+ ηIm )−1 + βbb
XZ
(ZZ
)− 2
1 1
= M0
M0 + β(ZZ
)− 2 ZX
bb
XZ
(ZZ
)− 2
1 1
= M0
M0 + βcc
(17)
(18)
754 L. Wolf and Y. Donner
(S − λi I)−1 z
ξi = (20)
(S − λi I)−1 z
using O(k) operations for each eigenvector, and O(k 2 ) in total to arrive at the
representation
M
M = R0 R1 Σ1 R1
R0 (21)
Explicit evaluation of Equation 21 to find V̂ requires multiplying k × k, which
should be avoided to keep the complexity O(m2 + k 2 ). The key observation is
that we do not need to find V explicitly but only A
x − vi for i = 1, 2, . . . , k,
with vi being the i’th class vector (Equation 1). The distances we seek are:
with vi 2 = Ni (see Section 3). Hence, finding all exact distances can be done
by computation of x
AA
x−V A
x, since vi
is the i’th row of V . Transforming
back from V̄ to V gives V = (ZZ
)− 2 V̄ , where (ZZ
)− 2 needs to be computed
1 1
All the matrices in Equation 23 are known after the first O(k 3 + m3 ) compu-
tation and O(k 2 + m2 ) additional operations per test example, as we have shown
Local Regularization for Multiclass Classification 755
V A
x = (ZZ
)− 2 R0 R1 A
x
1
(24)
Thus, the distances of the transformed test vector x from all class vectors can
be computed in time O(m2 + k 2 ), which is far quicker than O(m3 + k 3 ) which
is required by training the classifier from scratch, using a full SVD. Note that
the transformation of a new vector without local regularization requires O(ml)
operations, and the classification itself O(kl) operations. The difference between
the classification times of a new test vector using local regularization, therefore,
is O(m2 + k 2 ) compared to O ((m + k)l) using uniform regularization.
5 Experiments
We report results on 3 data sets: a new Dog Breed data set, the CalPhotos
Mammals collection [15], and the “Labeled Faces in the Wild” face recognition
data set [16]. These data sets exhibit a large amount of intraclass variation.
The experiments in all cases are similar and consist of multiclass classifica-
tion. We compare the following algorithms: Nearest Neighbor, Linear All-Vs-All
SVM (a.k.a “pairwise”, ”All-Pairs”), Multiclass CCA (the method of Section 2),
and Local Multiclass CCA (Section 3). The choice of using All-Vs-All SVM is
based on its simplicity and relative efficiency. A partial set of experiments ver-
ified that One-Vs-All SVM classifiers perform similarly. It is well established in
the literature that the performance of other multiclass SVM schemes is largely
similar [6,17]. Similar to other work in object recognition we found Gaussian-
kernel SVM to be ineffective, and to perform worse than Linear SVM for every
kernel parameter we tried. Evaluating the performance of non-linear versions of
Multiclass CCA and Local Multiclass CCA is left for future work.
We also compare the conventional local learning scheme [5], which was devel-
oped further in [6]. In this scheme the k nearest neighbors of each test point are
used to train a classifier. In our experiments we have scanned over a large range
possible neighborhood sizes k to verify that this scheme does not outperform
our local learning method regardless of k. Due to the computational demands of
such tests, they were only performed on two out of the four data sets.
Each of the described experiments was repeated 20 times. In each repetition a
new split to training and testing examples was randomized, and the same splits
were used for all algorithms. Note that due to the large intraclass variation,
the standard deviation of the result is typically large. Therefore, we use paired
t-tests to verify that the reported results are statistically significant.
Parameter selection. The regularization parameter of the linear SVM algo-
rithm was selected by a 5-fold cross-validation. Performance, however, is pretty
stable with respect to this parameter. The regularization parameter of Multiclass
CCA and Local Multiclass CCA η was fixed at 0.1 times the leading eigenvalue
of XX
, a value which seems to be robust in a large variety of synthetic and real
756 L. Wolf and Y. Donner
Fig. 1. Sample images from the Dog Breed and CalPhoto Mammal data sets
data sets. The local regularization parameter β was set at 0.5η in all experiments,
except for the ones done to evaluate its effect on performance.
Image representation. The visual descriptors of the images in the Dog Breed
and CalPhotos Mammels data sets are computed by the Bag-of-SIFT implemen-
tation of Andrea Vendaldi [18]. This implementation uses hierarchical K-means
[19] for partitioning the descriptor space. Keypoints are selected at random lo-
cations [20]. Note that the dictionary for this representation was recomputed at
each run in order to avoid the use of testing data during training. Using the
default parameters, this representation results in vectors of length 11, 111.
The images in the face data set are represented using the Local Binary Pat-
tern [21] image descriptor, which were adopted to face identification by [22]. An
LBP is created at a particular pixel location by thresholding the 3 × 3 neighbor-
hood surrounding the pixel with the central pixels intensity value, and treating
the subsequent pattern as a binary number. Following [22], we set a radius of 2
and sample at the boundaries of 5 pixel blocks, and bin all patterns for which
there are more than 2 transition from 0 to 1 in just one bin. LBP representations
for a given image are generated by dividing an image into several windows and
creating histograms of the LBPs within each window.
Table 1. Mean (± standard deviation) recognition rates (in percents) for the Dog
Breed data set. Each column is for a different number of training and testing examples
per breed for the 34 dog breeds.
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(a) (b)
√
Fig. 2. Mean performance and standard deviation (normalized by 20) for additional
experiments on the Dog Breed data set. (a) k-nearest neighbors based local learning.
The x axis depicts k, the size of the neighberhood. Top line – the performance of
the Multiclass CCA classifier, Bottom dashed line – the performance of SVM. (b)
Performance for various values of the local regularization parameter. The x axis depicts
the ratio of β and η.
Table 2. Mean (± standard deviation) recognition rates (percents) for the Mammals
data set. Each column is for a different number of random classes per experiment. Each
experiment was repeated 20 times.
Table 3. Mean (± STD) recognition rates (percents) for “Labeled Faces in the Wild”.
Columns differ in the number of random persons per experiment.
Labeled Faces in the Wild. From the Labeled Faces in the Wild dataset [16],
we filtered out all persons which have less than four images. 610 persons and a
total of 6, 733 images remain. The images are partly aligned via funneling [23],
and all images are 256 × 256 pixels. We only use the center 100 × 100 sub-
image, and represent it by LBP features of a grid of non-overlapping 16 pixels
blocks.
The number of persons per experiment vary from 10 to 100. For each run, 10,
20, 50 or 100 random persons and 4 random images per person are selected. 2 are
used for training and 2 for testing. Table 3 compares the classification results.
While the differences may seem small, they are significant (p < 0.01) and Local
Multiclass CCA leads the performance table followed by Multiclass CCA and
either NN or SVM. Additional experiments conducted for the 50 persons split
show that k-nearest neighbors based local learning hurts performance for all
values of k, for both SVM and Multiclass CCA.
Acknowledgments
This research is supported by the Israel Science Foundation (grants No. 1440/06,
1214/06), the Colton Foundation, and a Raymond and Beverly Sackler Career
Development Chair.
Local Regularization for Multiclass Classification 759
References
1. Fei-Fei, L., Fergus, R., Perona, P.: A bayesian approach to unsupervised one-shot
learning of object categories. In: ICCV, Nice, France, pp. 1134–1141 (2003)
2. Belkin, M., Niyogi, P.: Semi-supervised learning on riemannian manifolds. Machine
Learning 56, 209–239 (2004)
3. Bart, E., Ullman, S.: Cross-generalization: learning novel classes from a single ex-
ample by feature replacement. In: CVPR (2005)
4. Yamada, M., Pezeshki, A., Azimi-Sadjadi, M.: Relation between kernel cca and
kernel fda. In: IEEE International Joint Conference on Neural Networks (2005)
5. Bottou, L., Vapnik, V.: Local learning algorithms. Neural Computation 4 (1992)
6. Zhang, H., Berg, A.C., Maire, M., Malik, J.: Svm-knn: Discriminative nearest
neighbor classification for visual category recognition. In: CVPR (2006)
7. Hotelling, H.: Relations between two sets of variates. Biometrika 28, 321–377 (1936)
8. Akaho, S.: A kernel method for canonical correlation analysis. In: International
Meeting of Psychometric Society (2001)
9. Wolf, L., Shashua, A.: Learning over sets using kernel principal angles. J. Mach.
Learn. Res. 4, 913–931 (2003)
10. Neumaier, A.: Solving ill-conditioned and singular linear systems: A tutorial on
regularization (1998)
11. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data
mining, inference and prediction. Springer, Heidelberg (2001)
12. Sherman, J., Morrison, W.J.: Adjustment of an inverse matrix corresponding to
changes in the elements of a given column or a given row of the original matrix.
Annals of Mathematical Statistics 20, 621 (1949)
13. Golub, G.: Some modified eigenvalue problems. Technical report, Stanford (1971)
14. Bunch, J.R., Nielsen, C.P., Sorensen, D.C.: Rank-one modification of the symmetric
eigenproblem. Numerische Mathematik 31, 31–48 (1978)
15. CalPhotos: A database of photos of plants, animals, habitats and other natural
history subjects [web application], animal–mammals collection. bscit, University
of California, Berkeley,
http://calphotos.berkeley.edu/cgi/
img query?query src=photos index&where-lifeform=Animal--Mammal
16. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild:
A database for studying face recognition in unconstrained environments. University
of Massachusetts, Amherst, Technical Report 07-49 (2007)
17. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. Journal of Machine
Learning Research 5 (2004)
18. Vedaldi, A.: Bag of features: A simple bag of features classifier (2007),
http://vision.ucla.edu/∼ vedaldi/
19. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR
(2006)
20. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classi-
fication. In: European Conference on Computer Vision. Springer, Heidelberg (2006)
21. Ojala, T., Pietikainen, M., Harwood, D.: A comparative-study of texture measures
with classification based on feature distributions. Pattern Recognition 29 (1996)
22. Ahonen, T., Hadid, A., Pietikainen, M.: Face recognition with local binary pat-
terns. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024. Springer,
Heidelberg (2004)
23. Huang, G.B., Jain, V., Learned-Miller, E.: Unsupervised joint alignment of complex
images. ICCV (2007)
Saliency Based Opportunistic Search for Object Part
Extraction and Labeling
Abstract. We study the task of object part extraction and labeling, which seeks
to understand objects beyond simply identifiying their bounding boxes. We start
from bottom-up segmentation of images and search for correspondences between
object parts in a few shape models and segments in images. Segments comprising
different object parts in the image are usually not equally salient due to uneven
contrast, illumination conditions, clutter, occlusion and pose changes. Moreover,
object parts may have different scales and some parts are only distinctive and
recognizable in a large scale. Therefore, we utilize a multi-scale shape repre-
sentation of objects and their parts, figural contextual information of the whole
object and semantic contextual information for parts. Instead of searching over a
large segmentation space, we present a saliency based opportunistic search frame-
work to explore bottom-up segmentation by gradually expanding and bounding
the search domain. We tested our approach on a challenging statue face dataset
and 3 human face datasets. Results show that our approach significantly outper-
forms Active Shape Models using far fewer exemplars. Our framework can be
applied to other object categories.
1 Introduction
We are interested in the problem of object detection with object part extraction and la-
beling. Accurately detecting objects and labeling their parts requires going inside the
object’s bounding box to reason about object part configurations. Extracting object parts
with the right configuration is very helpful for recognizing object details. For example,
extracting facial parts helps with recognizing faces and facial expressions, while under-
standing human activities requires knowing the pose of a person.
A common approach to solve this problem is to learn specific features for object
parts [1][2]. We choose a different path which starts with bottom-up segmentation and
aligns shape models to segments in test images. Our observation is that starting from
salient segments, it is unlikely to accidentally align object parts to background edges.
Therefore, we can search efficiently and avoid accidental alignment.
Our approach includes three key components: correpondence, contextual informa-
tion and saliency of segments. There exist algorithms incorporating correspondence
and contextual information such as pictorial structures [3] and contour context selec-
tion [4], both showing good performance on some object categories. The disadvantage
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 760–774, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Saliency Based Opportunistic Search for Object Part Extraction and Labeling 761
Fig. 1. Saliency of contours and segments. The second image is a group of salient contours from
contour grouping [5] by setting a lower threshold to the average edge strength, while the third one
contains all the contours from contour grouping. It shows that by thresholding the saliency of con-
tour segments, we either get some foreground contours missing (under-segmented) or have a lot
of clutter come in (over-segmented). The same thing happens to image segmentation. Segments
comprising object parts pop out in different segmentation levels, representing different saliencies
(cut costs). The last three images show such a case.
is that these methods ignore image saliency. Therefore, they cannot tell accidental align-
ment of faint segments in the background from salient object part segments. However, it
is not easy to incorporate saliency. A naive way of using saliency is to find salient parts
first, and search for less salient ones depending on these salient ones. The drawback is
that a hard decision has to be made in the first step of labeling salient parts, and mistakes
arising from this step cannot be recovered later. Moreover, object parts are not equally
hard to find. Segments belonging to different object parts may pop out at different seg-
mentation levels (with different numbers of segments), as shown in Figure 1. One could
start with over-segmentation to cover all different levels. Unfortunately, by introducing
many small segments at the same time, segment saliency will be lost, which defeats the
purpose of image segmentation. Fake segmentation boundaries will also cause many
false positives of accidentally aligned object parts.
We build two-level contexts and shape representations for objects and their parts,
with the goal of high distinctiveness and efficiency. Some large object parts (e.g. facial
silhouettes) are only recognizable as a whole in a large scale, rather than as a sum of
the pieces comprising them. Moreover, hierarchical representation is more efficient for
modeling contextual relationships among model parts than a single level representation
which requires a large clique potential and long range connections. Two different levels
of contextual information is explored: figural context and semantic context. The former
captures the overall shape of the whole object, and the latter is formed by semantic
object parts.
In this paper, we propose a novel approach called Saliency Based Opportunistic
Search for object part extraction and labeling, with the following key contributions:
1. Different levels of context including both figural and semantic context are used.
2. Bottom-up image saliency is incorporated into the cost function.
3. We introduce an effective and efficient method of searching over different segmen-
tation levels to extract object parts.
762 Y. Wu et al.
2 Related Work
It has been shown that humans recognize objects by their components [6] or parts [7].
The main idea is that object parts should be extracted and represented together with
the relationships among them for matching to a model. This idea has been widely used
for the task of recogize objects and their parts [8,9,3]. Figural and semantic contextual
information play an important role in solving this problem. Approaches that take ad-
vantage of figural context include PCA and some template matching algorithms such as
Active Shape Models (ASM) [10] and Active Appearance Models (AAM) [11]. Tem-
plate matching methods like ASM usually use local features (points or key points) as
searching cues, and constrain the search by local smoothness or acceptable variations
of the whole shape. However, these methods require good initialization. They are sen-
sitive to clutter and can be trapped in local minima. Another group of approaches are
part-based models, which focus on semantic context. A typical case is pictorial struc-
ture [3]. Its cost function combines both the individual part matching cost and pair-wise
inconsistency penalties. The drawback of this approach is that it has no figural context
measured by the whole object. It may end up with many “OK” part matches without a
global verification, especially when there are many faint object edges and occlusions in
the image. Recently, a multiscale deformable part model was proposed to detect objects
based on deformable parts [1], which is an example that uses both types of contextual
information. However, it focuses on training deformable local gradient-based features
for detecting objects, but not extracting object parts out of the images.
Problem definition. The problem we are trying to solve is to extract and label ob-
ject parts based on contextual information, given an image and its segmentations, as
shown in Figure 1. Multiple models are used to cover some variations of the object (see
Figure 2 for the models we have used on faces). Extracting and labeling object parts
requires finding the best matched model. The problem can be formulated as follows:
Input
Output
This can be formulated as a shape matching problem, which aims to find sets of
segments whose shapes match to part models. However, the segments comprising the
object parts are not equally hard to extract from the image, and grouping them to objects
Saliency Based Opportunistic Search for Object Part Extraction and Labeling 763
Fig. 2. Different models for faces. They are hand designed models obtained from 7 real images,
each of them representing one pose. Facial features are labeled in different colors.
and object parts also requires checking the consistency among them. We call these ef-
forts “grouping cost”, which is not measured by shape but can be helpful to differentiate
segments belonging to object parts from those belonging to the background. Therefore,
we combine these two into such a cost function:
C shape measures the shape matching cost between shape models and labeled segments
in the image, which relays much on correspondence and context. C grouping is the
grouping cost, which can be measured in different ways, but in this paper it is mainly
about the bottom-up saliency based editing cost.
The cost function above is based on the following three key issues.
Saliency based editing. Segmentation has problems when the image segments have
different saliencies. Under-segmentation could end up with unexpected leakages, while
over-segmentation may introduce clutter. A solution for this problem is to do some
local editings. For example, adding a small virtual edge at the leakage place can make
the segmentation much better without increasing the number of segments. Zoom-in in
a small area is also a type of editing that can be effective and efficient, as presented in
764 Y. Wu et al.
Figure 1. Small costs for editing can result in big improvement on shape matching
cost. This is based the shape integrity and the non-additive distance between shapes.
However, editings need the contextual information from the model.
Suppose there are a set of possible editing operations z which might lead to better
segmentation. zk = 1 means that editing k is chosen, otherwise zk = 0. Note that
usually it is very hard to discover and precompute all the editings beforehand. There-
fore, this editing index vector z is dynamic, and it appends on the fly. After doing some
editings, some new segments/(part hypotheses) will come out, meanwhile we can still
keep the original segments/parts. Therefore, a new variable yedit = yedit (y, z) is used
to denote all the available elements which includes both the original ones in y and the
new ones induced by editing z. Let Ckedit be the edit cost for editing k.
Our cost function (1) of object part labeling and extraction can be written as follows:
Na
Nb
M↔I
[β· uij Cij (x, yedit ) + CiF ↔M (x, u)] + Ckedit zk (2)
i=1 j=1 k
s.t. j uij ≤ 1, i = 1, ..., Na
The three summations in equation (2) correspond to three different types of cost: mis-
match cost C M↔I (x, yedit , u), miss cost C F ↔M (x, u) and edit cost C edit (z). The
M↔I
mismatch cost, Cij (x, yedit ) = fi (x) − fj (yedit ) denotes the feature dissim-
ilarity between two corresponding control points. To prevent the cost function from
biasing to fewer matches, we add the miss cost CiF ↔M (x) = fif ull − ( j uij )fi (x)
to denote how much of the model has not been matched by the image. It encourages
M↔I
more parts to be matched on the model side. There is a trade-off between Cij
F ↔M
and Ci , where β ≥ 0 is a controlling factor. Note that · can be any norm
function1.
The rest of this section focuses on the two parts of our cost function. Shape matching
will be performed on two levels of contexts and saliency based editing will result in the
opportunistic search approach.
We extend the shape matching method called contour context selection in [4] to two
different contextual levels: “figural context selection” and “semantic context selection”.
1
In our shape matching we used L1 norm.
Saliency Based Opportunistic Search for Object Part Extraction and Labeling 765
geo
s.t. uij ui j Ci,j,i ,j ≤ Ctol (3)
i,j,i ,j
where SCiM (x) and SCjI (y) is defined as the Shape Context centered at model control
geo
point ai and image control point bj . Ci,j,i ,j is the geometric inconsistent cost of cor-
[ eγ , 1], γ ∈ [0, 1] as its weight. Then the cost function for semantic context
1
selection is:
Fig. 3. Semantic context selection. Left: Part hypothesizing. a) A local part region around the eye
in the image, with segments and control points. c) A model template of the eye with control points.
Selection result on the image is shown in b). Right: Consistent part grouping. Semantic-level
shape context centered on the left eye captures semantic contextual information of the image. A
subset of those parts form a mutually consistent context and we group them by matching with the
semantic-level shape context on the model shown in the middle.
766 Y. Wu et al.
(4)
The variable definitions are similar to figural context selection, except for two differ-
ences: 1) selection variables depend on the correspondences and 2) Shape Context no
longer counts edge points, but object part labels.
The desired output of labeling L(S) is implicitly given in the optimization variables.
During part hypothesis generation, we put labels of candidate parts onto the segments.
Then after semantic context selection, we confirm some labels and discard the others
using the correspondence uij between part candidates and object part models.
Labeling object parts using saliency based editing potentially requires searching over a
very large state space. Matching object shape and its part configuration requires com-
puting correspondences and non-local context. Both of them have exponentially many
choices. On top of that, we need to find a sequence of editings, such that the resulting
segments and parts produced by these editings are good enough for matching.
The key intuition of our saliency based opportunistic search is that we start from
coarse segmentations which produce salient segments and parts to guarantee low
saliency cost. We iteratively match configuration of salient parts to give a sequence
of bounds to the search zone of the space which needs to be explored. The possible
spatial extent of the missing parts is bounded by their shape matching cost and the
edit cost (equally, saliency cost). Once the search space has been narrowed down, we
“zoom-in” to the finer scale segmentation to rediscover missing parts (hence with lower
saliency). Then we “zoom-out” to do semantic context selection on all the part hy-
potheses. Adding these new parts improves the bound on the possible spatial extent
and might suggest new search zones. This opportunistic search allows both high effi-
ciency and high accuracy of object part labeling. We avoid extensive computation by
narrowing down the search zone. Furthermore, we only explore less salient parts if there
exist salient ones supporting them, which avoids producing many false positives from
non-salient parts.
Search Zone. In each step t of the search, given (x(t−1) , y(t−1) , z(t−1) , u(t−1) ), we use
ΔC M↔I (x, yedit ) to denote the increment of C M↔I (x, yedit ) (the first summation in
equation (2)). ΔC F ↔M (x, u) and ΔC edit (z) are similarly defined. By finding missing
parts, we seek to decrease the cost (2). Therefore, we introduce the following criterion
for finding missing parts:
We write C M↔I (x, y, z)=C M↔I (x, yedit ) since yedit depends on editing vector z.
Saliency Based Opportunistic Search for Object Part Extraction and Labeling 767
The estimation of bounds is based on the intuition that if all the missing parts can be
found, then no miss cost is needed to pay any more. Therefore, according to equation (4):
ΔC F ↔M (x) ≥ − CiF ↔M (x, u). (6)
i
This is the upper bound for the increment of either one of the other two items in equation
(5) when any new object part is matched.
(t)
Suppose a new editing zα = 1|z(t−1) =0 matches a new object part ak to a part
α
(t) (t−1)
hypothesis in the image b . Let k ↔ # indicate uk = 1 and j ukj = 0. Then this
editing at least has to pay the cost of matching ak to b (we do not know whether others
will also match or not):
C|k↔ = βΔC M↔I (x, y, z)|k↔ + Cαedit . (7)
The first item on the right of equation (7) is the increment of mismatch ΔC M↔I
(x, yedit ) when a new object part ak get matched to b . It can be computed based
on the last state of the variables (x(t−1) , y(t−1) , z(t−1) , u(t−1) ). According to above
equations, we get
βΔC M↔I (x, y, z)|k↔ + Cαedit − CiF ↔M (x(t−1) , u(t−1) ) ≤ 0 (8)
i
Since we use Shape Context for representation and matching, the mismatch is non-
decreasing. And also the editing cost is nonnegative, so we abtain the bounds for the
(t)
new editing zα = 1|z(t−1) =0 . Let Z(k) denote the search zone for object part k. Then
α
we can compute two bounds for Z(k):
768 Y. Wu et al.
1 F ↔M (t−1) (t−1)
(Supremum) Z sup (k) = {zα |ΔC M↔I (x, y, z)|k↔ ≤ C (x ,u )}
β i i
(9)
(Infimium) Z inf
(k) = {zα |Cαedit ≤ CiF ↔M (x(t−1) , u(t−1) )} (10)
i
where Z sup gives the supremum of the search zone, i.e. upper bound of zoom-in win-
dow size, and Z inf gives the infimum of the search zone, i.e. lower bound of zoom-in
window size. When the number of segments is fixed, the saliency of the segments de-
creases as the window size becomes smaller. Z sup depends on mismatch and Z inf
depends on the edit cost (i.e. saliency). In practice, one can sample the space of the
search zone, and check which ones fall into these two bounds.
Our opportunistic search is summarized in Algorithm 1.
4 Implementation
4.1 A Typical Example
We present more details on the opportunistic search using faces as an example in
Figure 4. We found that usually the whole shape of the face is more salient than individ-
ual facial parts. Therefore, the procedure starts with figural context and then switchs to
semantic context. We concretize our algorithm for this problem in the following steps.
The same procedure can be applied to similar objects.
1. Initialization: Object Detection. Any object detection method can be used, but it
is not a necessary step2 . We used shape context voting [13] to do this task, which
can handle different poses using a small set of positive training examples.
2. Context Based Alignment. First, use C f igural in equation (3) to select the best
matched model Mk and generate the correspondences uf igural for rough align-
ment3 . When the loop comes back again, update the alignment based on usemantic .
Estimate locations for other still missing parts.
3. Part Hypotheses Generation. Zoom in on these potential part locations by crop-
ping the regions and do Ncut segmentation to get finer scale segmentation. Then
match them to some predefined part models. The resulting matching score is used
to prune out unlikely part hypotheses, according to the bound of the cost function.
4. Part Hypotheses Grouping. Optimize C semantic in equation (4). Note that the best
scoring group may consist of only a subset of the actual object parts.
5. Termination Checking. If no better results can be obtained, then we go to the next
step. Or else we update semantic context and go back to step 2.
6. Extracting Facial Contours. This is a special step for faces only. With the final set
of facial parts, we optimize C f igural again to extract the segments that correspond
to the face silhouette, which can be viewed as a special part of the face.
2
Figural context selection can also be used to do that [4].
3
In practice, we kept best two model hypotheses.
Saliency Based Opportunistic Search for Object Part Extraction and Labeling 769
Fig. 4. Saliency based opportunistic search, using faces as an example. Top: the flowchart. Bot-
tom: results of each step for 3 different examples. Typically the iteration converges after only one
or two rounds. Rectangles with different colors indicate the zoom-in search zones for different
parts. Note that when zoom-in is performed for the first time, two adjacent parts can be searched
together for efficiency. This figure is best viewed in color.
Fig. 5. Left: averaged models for ASM1. Right: averaged model for ASM3.
Method No. of Poses Silhouette No. of Training No. of Test Average point error
ASM1 7 w 138 86 0.2814
ASM2 5 w/o 127 81 0.2906
ASM3 3 w/o 102 70 0.3208
Ours 7 w 7+16 86 0.1503
Table 3. Average error, normalized by distance between eyes for ASM vs. our method
consistent with each other. These constraints are summarized by the table 1. The cost
function and constraints are linear. We relaxed the variables and solved it with LP.
Datasets. We tested our approach on both statue faces from the Emperor-I dataset [14]
and real faces from various widely used face databases (UMIST, Yale, and Caltech
Faces). Quantitative comparison was done on the Emperor-I dataset and we also show
some qualitative results on a sample set of all these datasets. The statue face dataset has
Saliency Based Opportunistic Search for Object Part Extraction and Labeling 771
0.5 0.5
0 0
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Image Number Image Number
Fig. 6. Average point error vs. image number. All the values are normalized by the estimated
distance of two eyes in each image. The vertical dot-dash lines separate images of different poses.
some difficulties that normal faces do not have: lack of color cue, low contrast, inner
clutter, and great intra-subject variation.
Comparison measurement. The comparison is between Active Shape Models [10] and
our approach. Since we extract facial parts by selecting contours, our desired result is
that the extracted contours are all in the right places and correctly labeled. However,
ASM generates point-wise alignment between the image and a holistic model. Due to
the differences, we chose to use “normalized average point alignment error” measure-
ment for alignment comparison.
Since our results are just labeled contours, we do not have point correspondences for
computing the point alignment error. Therefore, we relaxed the measurement to the dis-
tance between each ground truth key point and its closest point on the contours belong
to the same part. To make the comparison fair, we have exactly the same measurement
for ASM by using spline interpolation to generate “contours” for its facial parts. We use
0.35 times the maximum height of the ground truth key points as an approximation of
the distance between two eyes invariant to pose changes as the our normalizing factor.
Experiments. There are two aspects of our Emperor-I dataset that may introduce dif-
ficulties for ASM: few training examples with various poses and dramatic face silhou-
ette changes. Therefore, we designed three variants of ASM to compensate for these
challenges, denoted in our plots as “ASM1”,“ASM2”,“ASM3”. Table 2 shows the dif-
ferences. Basically, ASM2 and ASM3 disregard face silhouette and work on fewer
poses that may have relatively more exemplars. Note that ASM3 even combined the
training data of the three near-frontal poses as a whole. We used “leave-one-out” cross-
validation for ASM. For our method, we picked up 7 images for different poses (one for
each pose), labeled them and extracted the contours out to work as our holistic models.
Moreover, we chose facial part models (usually combined by 2 or 3 contours) from a
total of 23 images which also contained these 7 images. Our holistic models are shown
in Figure 2 and Figure 5 shows those averaged ones for ASM.
In Figure 6, we show the alignment errors for all the facial parts together and also
those only for the eyes. Other facial parts have similar results so we leave them out. In-
stead, we provide a summary in Table 3 and a comparison in the last column of Table 2,
where each entry is the mean error across the test set or test set fold, as applicable. We
772 Y. Wu et al.
Rough Alignment
Contour Grouping
ASM
2 1 2 1
42 1 3
3 2 1 7 4 3 8 5 7
Our Result
2 13 4 7
4
3
5
7 6 2
4 7
8 5 5 2 13 5
5 4 7
6 6 8 6 8
5 6
6
6
Final Alignment
Rough Alignment
Contour Grouping
ASM
Our Result
Final Alignment
Fig. 7. A subset of the results. Upper group is on the Emperor-I dataset and the lower is for
real faces from various face databases (1-2 from UMIST, 3-4 from Yale, and 5-7 from Caltech).
Matched models, control points and labeled segments are superimposed on the images.
Saliency Based Opportunistic Search for Object Part Extraction and Labeling 773
can see that our method performs significantly better than ASM on all facial parts with
significantly fewer training examples. We provide a qualitative evaluation of the results
in Figure 7, where we compare the result of ASM and our method on a variety of im-
ages containing both statue faces and real faces. These images show great variations,
especially of those statue faces. Note that the models are only trained on statue faces.
6 Conclusion
We proposed an object part extraction and labeling framework which incorporates two-
level contexts and saliency based opportunistic search. The combination of figural con-
text on the whole object shape and semantic context on parts enables robustly search
matching of object parts and image segments in cluttered images. Saliency further im-
proves this search by gradually exploring salient bottom-up segmentations and bound-
ing it via shape matching cost. Experimental results on several challenging face datasets
demonstrate that our approach can accurately label object parts such as facial features
and resist to accidental alignment.
References
1. Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, de-
formable part model. In: CVPR (2008)
2. Ferrari, V., Jurie, F., Schmid, C.: Accurate object detection with deformable shape models
learnt from images. In: CVPR, pp. 1–8 (2007)
3. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition.
IJCV 61(1), 55–79 (2005)
4. Zhu, Q., Wang, L., Wu, Y., Shi, J.: Contour context selection for object detection: A single
exemplar suffices. In: ECCV (2008)
5. Zhu, Q., Shi, J.: Untangling cycles for contour grouping. In: ICCV (2007)
6. Biederman, I.: Recognition by components: A theory of human image understanding. Psy-
chR 94(2), 115–147 (1987)
7. Pentland, A.: Recognition by parts. In: ICCV, pp. 612–620 (1987)
8. Amit, Y., Trouve, A.: Pop: Patchwork of parts models for object recognition. IJCV 75(2),
267–282 (2007)
9. Sudderth, E., Torralba, A., Freeman, W., Willsky, A.: Learning hierarchical models of scenes,
objects, and parts. In: ICCV, pp. 1331–1338 (2005)
10. Cootes, T., Taylor, C., Cooper, D., Graham, J.: Active shape models: Their training and ap-
plication. CVIU 61(1), 38–59 (1995)
11. Cootes, T., Edwards, G., Taylor, C.: Active appearance models. PAMI 23(6), 681–685 (2001)
774 Y. Wu et al.
12. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape con-
texts. IEEE Trans. Pattern Anal. Mach. Intell. (2002)
13. Wang, L., Shi, J., Song, G., fan Shen, I.: Object detection combining recognition and seg-
mentation. In: ACCV (1), pp. 189–199 (2007)
14. Chen, C.: The First Emperor of China. Voyager Company (1994)
Stereo Matching: An Outlier Confidence Approach
1 Introduction
One useful technique to reduce the matching ambiguity for stereo images is to incor-
porate the color segmentation into optimization [1,2,3,4,5,6]. Global segmentations im-
prove the disparity estimation in textureless regions; but most of them do not necessarily
preserve accurate boundaries. We have experimented that, when taking the ground
truth occlusion information into optimization, very accurate disparity estimation can
be achieved. This shows that partial occlusion is one major source of matching errors.
The main challenge of solving the stereo problems now is the appropriate outlier detec-
tion and handling.
In this paper, we propose a new stereo matching algorithm aiming to improve the
disparity estimation. Our algorithm does not assign each pixel a binary visibility value
indicating whether this pixel is partially occluded or not [7,4,8], but rather introduces
soft Outlier Confidence (OC) values to reflect how confident we regard one pixel as an
outlier. The OC values, in our method, are used as weights balancing two ways to infer
the disparities. The final energy function is globally optimized using Belief Propagation
(BP). Without directly labeling each pixel as “occlusion” or “non-occlusion”, our model
has considerable tolerance of errors produced in the occlusion detection process.
Another main contribution of our algorithm is the local disparity inference for out-
lier pixels, complementary to the global segmentation. Our method defines the disparity
similarity according to the color distance between pixels and naturally transforms color
sample selection to a general foreground or background color inference problem using
image matting. It effectively reduces errors caused by inaccurate global color segmen-
tation and gives rise to a reliable inference of the unknown disparity of the occluded
pixels.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 775–787, 2008.
c Springer-Verlag Berlin Heidelberg 2008
776 L. Xu and J. Jia
2 Related Work
A comprehensive survey of the dense two-frame stereo matching algorithms was given
in [10]. Evaluations of almost all stereo matching algorithms can be found in [9]. Here
we review previous work dealing with outliers because, essentially, the difficulty of
stereo matching is to handle the ambiguities.
Efforts of dealing with outliers are usually put in three stages in stereo matching –
that is, the cost aggregation, the disparity optimization, and the disparity refinement.
Most approaches use outlier truncation or other robust functions for cost computation
in order to reduce the influence of outliers [2,11].
Window-based methods aggregate matching cost by summing the color differences
over a support region. These methods [12,13] prevent depth estimation from aggre-
gating information across different depth layers using the color information. Yoon and
Kweon [14] adjusted the support-weight of a pixel in a given window based on the
CIELab color similarity and its spatial distance to the center of the support window.
Zitnick et al. [12] partitioned the input image and grouped the matching cost in each
color segment. Lei et al. [15] used segmentation to form small regions in a region-tree
for further optimization.
In disparity optimization, outliers are handled in two ways in general. One is to
explicitly detect occlusions and model visibility [7,4,8]. Sun et al. [4] introduced the
visibility constraint by penalizing the occlusions and breaking the smoothness between
the occluded and non-occluded regions. In [8], Strecha et al. modeled the occlusion as
a random outlier process and iteratively estimated the depth and visibility in an EM
framework in multi-view stereo. Another kind of methods suppresses outliers using
extra information, such as pixel colors, in optimization. In [16,6], a color weighted
smoothness term was used to control the message passing in BP. Hirschmuller [17] took
color difference as the weight to penalize large disparity differences and optimized the
disparities using a semi-global approach.
Post-process was also introduced to handle the remaining outliers after the global or
local optimization. Occluded pixels can be detected using a consistency check, which
validates the disparity correspondences in two views [10,4,17,6]. Disparity interpola-
tion [18] infers the disparities for the occluded pixels from the non-occluded ones by
setting the disparities of the mis-matched pixels to that of the background. In [1,3,4,5,6],
color segmentation was employed to partition images into segments, each of which is
refined by fitting a 3D disparity plane. Optimization such as BP can be further applied
after plane fitting [4,5,6] to reduce the possible errors.
Several disparity refinement schemes have been proposed for novel-view synthe-
sis. Sub-pixel refinement [19] enhances details for synthesizing a new view. In [12]
and [20], boundary matting for producing seamless view interpolation was introduced.
These methods only aim to synthesize natural and seamless novel-views, and cannot be
directly used in stereo matching to detect or suppress outliers.
Stereo Matching: An Outlier Confidence Approach 777
3 Our Model
Denoting the input stereo images as Il and Ir , and the corresponding disparity maps as
Dl and Dr respectively, we define the matching energy as
where Ed (Dl ; Il , Ir ) + Ed (Dr ; Il , Ir ) is the data term and Es (Dl , Dr ) defines the
smoothness term that is constructed on the disparity maps. In our algorithm, we not
only consider the spatial smoothness within one disparity map, but also model the con-
sistency of disparities between frames.
As the occluded pixels influence the disparity estimation, they should not be used in
stereo matching. In our algorithm, we do not distinguish between occlusion and image
noise, but rather treat all problematic pixels as outliers. Outlier Confidences (OCs) are
computed on these pixels, indicating how confident we regard one pixel as an outlier.
The outlier confidence maps Ul and Ur are constructed on the input image pair. The
confidence Ul (x) or Ur (x) on pixel x is a continuous variable with value between 0 and
1. Larger value indicates higher confidence that one pixel is an outlier, and vice versa.
Our model combines an initial disparity map and an OC map for two views. In the
following, we first introduce our data and smoothness terms. The construction of the
OC map will be described in Section 4.2.
where α and β are weights. f0 (x, d; Il , Ir ) denotes the color dissimilarity cost between
two views. f1 (x, d; Il ) is the term defined as the local color and disparity discontinuity
cost in one view. Ed (Dr ; Il , Ir ) on the right image can be defined in a similar way.
The above two terms, balanced by the outlier confidence Ul (x), model respectively
two types of processes in disparity computation. Compared to setting Ul (x) as a bi-
nary value and assigning pixels to either outliers or inliers, our cost terms are softly
combined, tolerating possible errors in pixel classification.
For result comparison, we give two definitions of f0 (x, dl ; Il , Ir ) respectively cor-
responding to whether the segmentation is incorporated or not. The first is to use the
color and distance weighted local window [14,6,19] to aggregate color difference be-
tween conjugate pixels:
(1)
f0 (x, dl ; Il , Ir ) = min(g(Il (x) − Ir (x − dl )1 ), ϕ), (3)
778 L. Xu and J. Jia
where g(·) is the aggregate function defined similarly to Equation (2) in [6]. We use the
default parameter values (local window size 33 × 33, βcw = 10 for normalizing color
differences, γcw = 21 for normalizing spatial distances). ϕ determines the maximum
cost for each pixel, whose value is set as the average intensity of pixels in the correlation
volume.
The second definition is given by incorporating the segmentation information.
Specifically, we use the Mean-shift color segmentation [21] with default parameters
(spatial bandwidth 7, color bandwidth 6.5, minimum region size 20) to generate color
segments. A plane fitting algorithm using RANSAC (similar to that in [6]) is then ap-
plied to producing the regularized disparity map dpf . We define
(2) (1)
f0 (x, dl ; Il , Ir ) = (1 − κ)f0 (x, dl ) + κα|d − dpf |, (4)
where δ(·) is the Dirac function, Ψ denotes the set of all disparity values between 0 and
N and ωi (x; Il ) is a weight function for measuring how disparity dl is likely to be i.
We omit subscript l in the following discussion of ωi (x; Il ) since both the left and right
views can use the similar definitions.
For ease of explanation, we first give a general definition of weight ωi (x; I), which,
in the following descriptions, will be slightly modified to handle two extreme situations
with values 0 and 1. We define
L(I(x), Ii (Wx ))
ωi (x; I) = 1 − , (6)
L(I(x), Ii (Wx )) + L(I(x), I=i (Wx ))
where I(x) denotes the color of pixel x and Wx is a window centered at x. Suppose
after initialization, we have collected a set of pixels x detected as inliers within each
Wx (i.e., U (x ) = 0), and have computed disparities for these inliers. We denote by Ii
the set of inliers whose disparity values are computed as i. Similarly, I=i are the inliers
with the corresponding disparity values not equal to i. L is a metric measuring the color
difference between I(x) and its neighboring pixels Ii (Wx ) and I=i (Wx ). One example
is shown in Figure 1(a) where a window Wx is centered at an outlier pixel x. Within
Wx , inlier pixels are clustered into I1 and I=1 . ω1 (x; I) is computed according to the
color similarity between x and other pixels in the two clusters.
(6) is a function to assign an outlier pixel x a disparity value, constrained by the color
similarity between x and the clustered neighboring pixels. By and large, if the color
distance between x and its inlier neighbors with disparity i is small enough compared
to the color distance to other inliers, ωi (x; I) should have a large value, indicating high
chance to let dl = i in (5).
Now the problem is on how to compute a metric L that appropriately measures
the color distance between pixels. In our method, we abstract color sets Ii (Wx ) and
Stereo Matching: An Outlier Confidence Approach 779
Wx R
I(x)
I =i(∗)
I1
I(x) G
I1 I i(∗)
B
(a) (b)
R R
I(x)
I(x) I =i(∗)
I =i(∗) I i(∗)
G G
I i(∗)
B B
(c ) (d)
Fig. 1. Computing disparity weight ω . (a) Within a neighborhood window Wx , inlier pixels are
clustered into I1 and I =1 . (b)-(d) illustrate the color projection. (b) The projection of I(x) on
vector I i(∗) − I =i(∗) is between two ends. (c-d) The projections of I(x) are out of range, thereby
are considered as extreme situations.
I=i (Wx ) by two representatives I i(∗) and I =i(∗) respectively. Then L is simplified
to a color metric between pixels. We adopt the color projection distance along vector
I i(∗) − I =i(∗) and define
where ·, · denotes the inner product of two color vectors and c can be either I i(∗) or
I =i(∗) . We regard I i(∗) − I =i(∗) as a projection vector because it measures the absolute
difference between two representative colors, or, equivalently, the distance between sets
Ii (Wx ) and I=i (Wx ).
Projecting I(x) to vector I i(∗) − I =i(∗) also makes the assignment of two extreme
values 0 and 1 to ωi (x; I) easy. Taking Figure 1 as an example, if the projection of I(x)
on vector I i(∗) − I =i(∗) is between two ends, its value is obviously between 0 and 1, as
shown in Figure 1 (b). If the projection of I(x) is out of one end point, its value should
be 0 if it is close to I i(∗) or 1 otherwise (Figure 1 (c) and (d)). To handle the extreme
cases, we define the final ωi (x; I) as
⎧
⎨0 if I − I =i(∗) , I i(∗) − I =i(∗) < 0
ωi (x; I) = 1 if I i(∗) − I, I i(∗) − I =i(∗) < 0
⎩
ωi (x; I) Otherwise
where ⎧
⎨0 x < 0
T (x) = 1 x > 1 (9)
⎩
x otherwise
(I−I =i(∗) )T (I i(∗) −I =i(∗) )
Note that term I i(∗) −I =i(∗) 22
defined in (8) is quite similar to an alpha matte
model used in image matting [22,23] where the representative colors I i(∗) and I =i(∗)
are analogous to the unknown foreground and background colors. The image matting
problem is solved by color sample collection and optimization. In our problem, the
color samples are those clustered neighboring pixels Ii (Wx ) and I=i (Wx ).
With the above analysis, computing the weight ωi is naturally transformed to an
image matting problem where the representative color selection is handled by applying
an optimization algorithm. In our method, we employ the robust matting with optimal
color sample selection approach [23]. In principle, I i(∗) and I =i(∗) are respectively
selected from Ii (Wx ) and I=i (Wx ) based on a sample confidence measure combining
two criteria. First, either I i(∗) or I =i(∗) should be similar to the color of the outlier pixel
I, which makes weight ωi approach either 0 or 1 and the weight distribution hardly
uniform. Second, I is also expected to be a linear combination of I i(∗) and I =i(∗) .
This is useful for modeling color blending since outlier pixels have chance to be the
interpolation of color samples, especially for those on the region boundary.
Using the sample confidence definition, we get two weights and a neighborhood
term, similar to those in [23]. Then we apply the Random Walk method [24] to com-
pute weight ωi . This process is repeated for all ωi ’s, where i = 0, · · · , N . The main
benefit that we employ this matting method is that it provides an optimal way to select
representative colors while maintaining spatial smoothness.
where N1 (x) represents the N possible corresponding pixels of x in the other view and
N2 (x) denotes the 4-neighborhood of x in the image space. f2 is defined as
f2 (x, x , di ) = min(|di (x) − di (x ))|, τ ), i ∈ {l, r}, (11)
where τ is a threshold set as 2. To define (11), we have also experimented with using
color weighted smoothness and observed that the results are not improved.
We define f3 (·) as the disparity correlations between two views:
f3 (x, x , dl , dr ) = min(|dl (x) − dr (x )|, ζ) and
f3 (x, x , dr , dl ) = min(|dr (x) − dl (x )|, ζ) , (12)
Stereo Matching: An Outlier Confidence Approach 781
4 Implementation
The overview of our framework is given in Algorithm 1, which consists of an initial-
ization step and a global optimization step. In the first step, we initialize the disparity
maps by minimizing an energy with the simplified data and smoothness terms. Then we
compute the Outlier Confidence (OC) maps. In the second step, we globally refine the
disparities by incorporating the OC maps.
2. Global Optimization:
2.1 Compute data terms using the estimated outlier confidence maps.
2.2 Global optimization using BP.
Because of introducing the inter-frame disparity consistency in (12), our Markov Ran-
dom Field (MRF) based on the defined energy is slightly different from the regular-grid
MRFs proposed in other stereo approaches [2,25]. In our two-frame configuration, the
MRF is built on two images with (4 + N ) neighboring sites for each node. N is the
total number of the disparity levels. One illustration is given in Figure 2 where a pixel
x in Il not only connects to its 4 neighbors in the image space, but also connects to all
possible corresponding pixels in Ir .
We minimize the energy defined in (13) using Belief Propagation. The inter-frame
consistency constraint makes the estimated disparity maps contain less noise in two
frames. We show in Figure 3(a) the initialized disparity result using the standard 4-
connected MRF without defining f3 in (10). (b) shows the result using our (4 + N )-
connected MRF. The background disparity noise is reduced.
782 L. Xu and J. Jia
Il Ir
Fig. 2. In our dual view configuration, x (marked with the cross) is not only connected to 4
neighbors in one image, but also related to N possible corresponding pixels in the other image.
The total number of neighbors of x is 4 + N .
(1) (2)
Depending on using f0 in (3) or f0 in (4) in the data term definition, we obtain
two sets of initializations using and without using global color segmentation. We shall
compare in the results how applying our OC models in the following global optimiza-
tion improves both of the disparity maps.
We estimate the outlier confidence map U on the initial disparity maps. Our following
discussion focuses on estimating Ul on the left view. The right view can be handled in
a similar way. The outlier confidences, in our algorithm, are defined as
⎧
⎨1 |dl (x) − dr (x − dl (x))| ≥ 1
bx (d∗ )−bmin
Ul (x) = T ( bo −bmin ) bx (d∗ ) > t ∧ |dl (x) − dr (x − dl (x))| = 0 (14)
⎩
0 Otherwise
considering 2 cases.
Case 1: Our MRF enforces the disparity consistency between two views. After dis-
parity initialization, the remaining pixels with inconsistent disparities are likely to be
occlusions. So we first set the outlier confidence Ul (x) = 1 for pixel x if the inter-frame
consistency is violated, i.e., |dl (x) − dr (x − dl (x))| ≥ 1.
Case 2: Besides the disparity inconsistency, pixel matching with large matching cost
is also unreliable. In our method, since we use BP to initialize the disparity maps, the
matching cost is embedded in the output disparity belief bx (d) for each pixel x. Here,
we introduce some simple operations to manipulate it. First, we extract bx (d∗ ), i.e.,
the smallest belief, for each pixel x. If bx (d∗ ) < t, where t is a threshold, the pixel
should be regarded as an inlier given the small matching cost. Second, a variable bo is
computed as the average of the minimal beliefs regarding all occluded pixels detected
in Case 1, i.e., bo = Ul (x)=1 bx (d∗ )/K where K is the total number of the occluded
pixels. Finally, we compute bmin as the average of top n% minimal beliefs among all
pixels. n is set to 10 in our experiments.
Using the computed bx (d∗ ), bo , and bmin , we estimate Ul (C x) for pixels neither de-
tected as occlusions nor treated as inliers by setting
bxC (d∗ ) − bmin
x) = T
Ul (C , (15)
bo − bmin
Stereo Matching: An Outlier Confidence Approach 783
Fig. 3. Intermediate results for the “Tsukuba” example. (a) and (b) show our initial disparity maps
by the 4-connected and (4 + N )-connected MRFs respectively without using segmentation. The
disparity noise in (b) is reduced for the background. (c) Our estimated OC map. (d) A disparity
map constructed by combining the inlier and outlier information. The disparities for the outlier
pixels are set as the maximum weight ωi . The inlier pixels are with initially computed disparity
values.
where T is the function defined in (9), making the confidence value in range [0, 1]. (15)
indicates if the smallest belief bx (d∗ ) of pixel x is equal to or larger than the average
smallest belief of the occluded pixels detected in Case 1, the outlier confidence of x
will be high, and vice versa.
Figure 3(c) shows the estimated outlier coefficient map for the “tsukuba” example.
The pure black pixels represent inliers where Ul (x) = 0. Generally, the region consist-
ing of pixels with Ul (x) > 0 is wider than the ground truth occluded region. This is
allowed in our algorithm because Ul (x) is only a weight balancing pixel matching and
color smoothness. Even if pixel x is mistakenly labeled as an outlier, the disparity esti-
mation in our algorithm will not be largely influenced because large Ul (x) only makes
the disparity estimation of x rely more on neighboring pixel information, by which d(x)
still has a large chance to be correctly inferred.
To illustrate the efficacy of our OC scheme, we show in Figure 3(d) a disparity map
directly constructed with the following setting. Each inlier pixel is with initially com-
puted disparity value and each outlier pixel is with the disparity i corresponding to the
maximum weight ωi among all ωj ’s, where j = 0, · · · , N . It can be observed that
even without any further global optimization, this simple maximum-weight disparity
calculation already makes the object boundary smooth and natural.
With the estimated OC maps, we are ready to use global optimization to compute the
final disparity maps combining costs (2) and (10) in (1). Two forms of f0 (·) ((3) and
(4)) are independently applied in our experiments for result comparison.
The computation of f1 (x, d; I) in (5) is based on the estimated OC maps and the
initial disparities for the inlier pixels, which are obtained in the aforementioned steps.
To compute ωi for outlier pixel x with Ul (x) > 0, robust matting [23] is performed
as described in Section 3.1 for each disparity level. The involved color sampling is
performed in each local window with size 60 × 60. Finally, the smoothness terms are
embedded in the message passing of BP. An acceleration using distance transform [25]
is adopted to construct the messages.
784 L. Xu and J. Jia
5 Experiments
In experiments, we compare the results using and without using the Outlier Confidence
maps. The performance is evaluated using the Middlebury dataset [10]. All parameters
used in implementation are listed in Table 1 where α, β and κ are the weights defined
in the data term. γ and λ are for intra-frame smoothness and inter-frame consistency
respectively. ϕ, τ , and ζ are the truncation thresholds for different energy terms. t is
the threshold for selecting possible outliers. As we normalize the messages after each
message passing iteration by subtracting the mean of the messages, the belief bmin is
negative, making t = 0.9bmin > bmin .
A comparison of the state-of-the-art stereo matching algorithms is shown in
Table 2 extracted from the Middlebury website [9]. In the following, we give detailed
explanations.
Table 1. The parameter values used in our experiments. N is the number of the disparity levels.
c is the average of the correlation volume. bmin is introduced in (15).
Parameters α β κ γ λ ϕ τ ζ t
value ϕ 0.8 0.3 5.0 5N c 2.0 1.0 0.9bmin
Table 2. Algorithm evaluation on the Midellbury data set. Our method achieves overall rank 2 at
the time of data submission.
Table 3. Result comparison on the Middlebury dataset using (1st and 3rd rows) and without using
(2nd and 4th rows) OC Maps. The segmentation information has been incorporated for the last
two rows.
(a)
(b)
Fig. 4. Disparity result comparison. (a) Disparity results of “SEG” (b) Our final disparity results
using the Outlier Confidence model (“SEG+OC”).
We show in the first row of Table 3 (denoted as “COLOR”) the statistics of the
initial disparities. The algorithm is detailed in Section 4.1. We set U (x) = 0 for all x’s
and minimize the energy defined in (13). Then we estimate the OC maps based on the
initial disparities and minimize the energy defined in (1). We denote the final results as
“COLOR+OC” in the second row of Table 3.
Comparing the two sets of results, one can observe that incorporating the outlier in-
formation significantly improves the quality of the estimated disparity maps. The over-
all rank jumps from initial No. 16 to No. 5, which is the highest position for all results
produced by the stereo matching algorithms without incorporating segmentation.
In analysis, for the “Teddy” example, however, our final disparity estimate does not
gain large improvement over the initial one. It is because that the remaining errors are
mostly caused by matching large textureless regions, which can be addressed by color
segmentation.
(a)
(b)
Fig. 5. Error comparison on the “Cones” example. (a) shows the disparity error maps for “SEG”
and “SEG+OC” respectively. (b) Comparison of three magnified patches extracted from (a). The
“SEG+OC” results are shown on the right of each patch pair.
Finally, the framework of our algorithm is general. Many other existing stereo match-
ing methods can be incorporated into the outlier confidence scheme by changing f0 to
other energy functions.
6 Conclusion
In this paper, we have proposed an Outlier-Confidence-based stereo matching algo-
rithm. In this algorithm, the Outlier Confidence is introduced to measure how likely
that one pixel is an outlier. A model using the local color information is proposed for
inferring the disparities of possible outliers and is softly combined with other data terms
to dynamically adjust the disparity estimate. Complementary to global color segmenta-
tion, our algorithm locally gathers color samples and optimizes them using the matting
techniques in order to reliably measure how one outlier pixel can be assigned a disparity
value. Experimental results on the Middlebury data set show that our proposed method
is rather effective in disparity estimation.
Acknowledgements
This work was fully supported by a grant from the Research Grants Council of Hong
Kong (Project No. 412708) and is affiliated with the Microsoft–CUHK Joint Laboratory.
References
1. Tao, H., Sawhney, H.S., Kumar, R.: A global matching framework for stereo computation.
In: ICCV, pp. 532–539 (2001)
2. Sun, J., Zheng, N.N., Shum, H.Y.: Stereo matching using belief propagation. IEEE Trans.
Pattern Anal. Mach. Intell. 25(7), 787–800 (2003)
Stereo Matching: An Outlier Confidence Approach 787
3. Hong, L., Chen, G.: Segment-based stereo matching using graph cuts. In: CVPR (1), pp.
74–81 (2004)
4. Sun, J., Li, Y., Kang, S.B.: Symmetric stereo matching for occlusion handling. In: CVPR (2),
pp. 399–406 (2005)
5. Klaus, A., Sormann, M., Karner, K.F.: Segment-based stereo matching using belief propaga-
tion and a self-adapting dissimilarity measure. In: ICPR (3), pp. 15–18 (2006)
6. Yang, Q., Wang, L., Yang, R., Stewénius, H., Nistér, D.: Stereo matching with color-weighted
correlation, hierarchical belief propagation and occlusion handling. In: CVPR (2), pp. 2347–
2354 (2006)
7. Kang, S.B., Szeliski, R.: Extracting view-dependent depth maps from a collection of images.
International Journal of Computer Vision 58(2), 139–163 (2004)
8. Strecha, C., Fransens, R., Van Gool, L.J.: Combined depth and outlier estimation in multi-
view stereo. In: CVPR (2), pp. 2394–2401 (2006)
9. Scharstein, D., Szeliski, R.: http://vision.middlebury.edu/stereo/eval/
10. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo corre-
spondence algorithms. International Journal of Computer Vision 47(1-3), 7–42 (2002)
11. Zhang, L., Seitz, S.M.: Parameter estimation for mrf stereo. In: CVPR (2), pp. 288–295
(2005)
12. Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S.A.J., Szeliski, R.: High-quality video
view interpolation using a layered representation. ACM Trans. Graph. 23(3), 600–608 (2004)
13. Yoon, K.J., Kweon, I.S.: Stereo matching with the distinctive similarity measure. In: ICCV
(2007)
14. Yoon, K.J., Kweon, I.S.: Adaptive support-weight approach for correspondence search. IEEE
Trans. Pattern Anal. Mach. Intell. 28(4), 650–656 (2006)
15. Lei, C., Selzer, J.M., Yang, Y.H.: Region-tree based stereo using dynamic programming op-
timization. In: CVPR (2), pp. 2378–2385 (2006)
16. Strecha, C., Fransens, R., Gool, L.J.V.: Wide-baseline stereo from multiple views: A proba-
bilistic account. In: CVPR (1), pp. 552–559 (2004)
17. Hirschmüller, H.: Accurate and efficient stereo processing by semi-global matching and mu-
tual information. In: CVPR (2), pp. 807–814 (2005)
18. Hirschmüller, H., Scharstein, D.: Evaluation of cost functions for stereo matching. In: CVPR
(2007)
19. Yang, Q., Yang, R., Davis, J., Nistér, D.: Spatial-depth super resolution for range images. In:
CVPR (2007)
20. Hasinoff, S.W., Kang, S.B., Szeliski, R.: Boundary matting for view synthesis. Computer
Vision and Image Understanding 103(1), 22–32 (2006)
21. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space analysis. IEEE
Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
22. Chuang, Y.Y., Curless, B., Salesin, D., Szeliski, R.: A bayesian approach to digital matting.
In: CVPR (2), pp. 264–271 (2001)
23. Wang, J., Cohen, M.F.: Optimized color sampling for robust matting. In: CVPR (2007)
24. Grady, L.: Random walks for image segmentation. IEEE Trans. Pattern Anal. Mach. In-
tell. 28(11), 1768–1783 (2006)
25. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision. In:
CVPR (1), pp. 261–268 (2004)
Improving Shape Retrieval by Learning Graph
Transduction
Xingwei Yang1 , Xiang Bai2,3 , Longin Jan Latecki1 , and Zhuowen Tu3
1
Dept. of Computer and Information Sciences, Temple University, Philadelphia
{xingwei,latecki}@temple.edu
2
Dept. of Electronics and Information Engineering, Huazhong University of Science
and Technology, P.R. China
xiang.bai@gmail.com
3
Lab of Neuro Imaging, University of California, Los Angeles
zhuowen.tu@loni.ucla.edu
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 788–801, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Improving Shape Retrieval by Learning Graph Transduction 789
Fig. 1. Existing shape similarity methods incorrectly rank shape (b) as more similar
to (a) than (c)
Fig. 2. A key idea of the proposed distance learning is to replace the original shape
distance between (a) and (e) with a geodesic path in the manifold of know shapes,
which is the path (a)-(e) in this figure
large if the distance measure cannot capture the intrinsic property of the shape.
It appears to us that all published shape distance measures [1,2,3,4,5,6,7] are
unable to address this issue. For example, based on the inner distance shape
context (IDSC) [3], the shape in Fig. 1(a) is more similar to (b) than to (c),
but it is obvious that shape (a) and (c) belong to the same class. This incorrect
result is due to the fact that the inner distance is unaware that the missing tail
and one front leg are irrelevant for this shape similarity judgment. On the other
hand, much smaller shape details like the dog’s ear and the shape of the head
are of high relevance here. No matter how good a shape matching algorithm is,
the problem of relevant and irrelevant shape differences must be addressed if we
want to obtain human-like performance. This requires having a model to capture
the essence of a shape class instead of viewing each shape as a set of points or a
parameterized function.
In this paper, we propose to use a graph-based transductive learning algo-
rithm to tackle this problem, and it has the following properties: (1) Instead
of focusing on computing the distance (similarity) for a pair of shapes, we take
advantage of the manifold formed by the existing shapes. (2) However, we do not
explicitly learn the manifold nor compute the geodesics [8], which are time con-
suming to calculate. A better metric is learned by collectively propagating the
similarity measures to the query shape and between the existing shapes through
graph transduction. (3) Unlike the label propagation [9] approach, which is semi-
supervised, we treat shape retrieval as an unsupervised problem and do not re-
quire knowing any shape labels. (4) We can build our algorithm on top of any
existing shape matching algorithm and a significant gain in retrieval rates can
be observed on well-known shape datasets.
Given a database of shapes, a query shape, and a shape distance function,
which does not need to be a metric, we learn a new distance function that is
790 X. Yang et al.
expressed by shortest paths on the manifold formed by the know shapes and the
query shape. We can do this without explicitly learning this manifold. As we
will demonstrate in our experimental results, the new learned distance function
is able to incorporate the knowledge of relevant and irrelevant shape differences.
It is learned in an unsupervised setting in the context of known shapes. For
example, if the database of known shapes contains shapes (a)-(e) in Fig. 2, then
the new learned distance function will rank correctly the shape in Fig. 1(a) as
more similar to (c) than to (b). The reason is that the new distance function
will replace the original distance (a) to (c) in Fig.1 with a distance induced by
the shortest path between in (a) and (e) in Fig.2.
In more general terms, even if the difference between shape A and shape C
is large, but there is a shape B which has small difference to both of them, we
still claim that shape A and shape C are similar to each other. This situation is
possible for most shape distances, since they do not obey the triangle inequality,
i.e., it is not true that d(A, C) ≤ d(A, B) + d(B, C) for all shapes A, B, C [10].
We propose a learning method to modify the original shape distance d(A, C).
If we have the situation that d(A, C) > d(A, B) + d(B, C) for some shapes
A, B, C, then the proposed method is able to learn a new distance d (A, C) such
that d (A, C) ≤ d(A, B) + d(B, C). Further, if there is a path in the distance
space such that d(A, C) > d(A, B1 ) + . . . + d(Bk , C), then our method learns
a new d (A, C) such that d (A, C) ≤ d(A, B1 ) + . . . + d(Bk , C). Since this path
represents a minimal distortion morphing of shape A to shape C, we are able to
ignore irrelevant shape differences, and consequently, we can focus on relevant
shape differences with the new distance d .
Our experimental results clearly demonstrate that the proposed method can
improve the retrieval results of the existing shape matching methods. We ob-
tained the retrieval rate of 91% on part B of the MPEG-7 Core Experiment
CE-Shape-1 data set [11], which is the highest ever bull’s eye score reported in
the literature. As the input to our method we used the IDSC, which has the
retrieval rate of 85.40% on the MPEG-7 data set [3]. Fig. 3 illustrates the ben-
efits of the proposed distance learning method. The first row shows the query
shape followed by the first 10 shapes retrieved using IDSC only. Only two flies
are retrieved among the first 10 shapes. The results of the learned distance for
the same query are shown in the second row. All of the top 10 retrieval results
Fig. 3. The first column shows the query shape. The remaining 10 columns show the
most similar shapes retrieved from the MPEG-7 data set. The first row shows the
results of IDSC [3]. The second row shows the results of the proposed learned distance.
Improving Shape Retrieval by Learning Graph Transduction 791
are correct. The proposed method was able to learn that the shape differences
in the number of fly legs and their shapes are irrelevant. The remainder of this
paper is organized as follows. In Section 2, we briefly review some well-known
shape matching methods and the semi-supervised learning algorithms. Section 3
describes the proposed approach to learning shape distances. Section 4 relates
the proposed approach to the class of machine learning approaches called label
propagation. The problem of the construction of the affinity matrix is addressed
in Section 5. Section 6 gives the experimental results to show the advantage of
the proposed approach. Conclusion and discussion are given in Section 7.
2 Related Work
convex optimization problem. Bar-Hillel et al. [23] also use a weight matrix W to
estimate the distance by relevant component analysis (RCA). Athitsos et al. [24]
proposed a method called BoostMap to estimate a distance that approximates a
certain distance. Hertz’s work [25] uses AdaBoost to estimate a distance function
in a product space, whereas the weak classifier minimizes an error in the original
feature space. All these methods’ focus is a selection of suitable distance from
a given set of distance measures. Our method aims at improving the retrieval
performance of a given distance measure.
We iterate steps (1) and (2) until the step t = T for which the change is
below a small threshold. We then rank the similarity to the query x1 with simT .
Our experimental results in Section 6 demonstrate that the replacement of the
original similarity measure sim with simT results in a significant increase in the
retrieval rate.
The steps (1) and (2) are used in label propagation, which is described in
Section 4. However, our goal and our setting are different. Although label prop-
agation is an instance of semi-supervised learning, we stress that we remain in
the unsupervised learning setting. In particular, we deal with the case of only
one known class, which is the class of the query object. This means, in particular,
that label propagation has a trivial solution in our case limt→∞ ft (xi ) = 1 for all
i = 1, . . . , n, i.e., all objects will be assigned the class label of the query shape.
Since our goal is ranking of the database objects according to their similarity to
the query, we stop the computation after a suitable number of iterations t = T .
As is the usual practice with iterative processes that are guaranteed to converge,
the computation is halted if the difference ||ft+1 − ft || becomes very slow, see
Section 6 for details.
If the database of known objects is large, the computation with all n objects
may become impractical. Therefore, in practice, we construct the matrix w using
only the first M < n most similar objects to the query x1 sorted according to
the original distance function sim.
where Pij is the probability of transit from node i to node j. Also define a l × C
label matrix YL , whose ith row is an indicator vector for yi , i ∈ L: Yic = δ(yi,c ).
794 X. Yang et al.
The label propagation computes soft labels f for nodes, where f is a n×C matrix
whose rows can be interpreted as the probability distributions over labels. The
initialization of f is not important. The label propagation algorithm is as follows:
1. Initially, set f (xi ) = yi for i = 1, . . . , l and f (xj ) arbitrarily (e.g., 0) for
xj ∈ Xu n
wij f (xj )
2. Repeat until convergence: Set f (xi ) = n
j=1
wij , ∀xi ∈ Xu and set
j=1
f (xi ) = yi for i = 1, . . . , l (the labeled objects should be fixed).
In step 1, all nodes propagate their labels to their neighbors for one step. Step 2 is
critical, since it ensures persistent label sources from labeled data. Hence instead
of letting the initial labels fade way, we fix the labeled data. This constant push
from labeled nodes, helps to push the class boundaries through high density
regions so that they can settle in low density gaps. If this structure of data fits
the classification goal, then the algorithm can use unlabeled data to improve
learning.
f
Let f = ( L ). Since fL is fixed to YL , we are solely interested in fU . The
fU
matrix P is split into labeled and unlabeled sub-matrices
! "
PLL PLU
P = (5)
PUL PUU
As proven in [9] the label propagation converges, and the solution can be com-
puted in closed form using matrix algebra:
fU = (I − PUU )−1 PUL YL (6)
However, as the label propagation requires all classes be present in the labeled
data, it is not suitable for shape retrieval. As mentioned in Section 3, for shape
retrieval, the query shape is considered as the only labeled data and all other
shapes are the unlabeled data. Moreover, the graph among all of the shapes is
fully connected, which means the label could be propagated on the whole graph.
If we iterate the label propagation infinite times, all of the data will have the
same label, which is not our goal. Therefore, we stop the computation after a
suitable number of iterations t = T .
Previous research has shown that the propagation results highly depend on the
kernel size σij selection [17]. In [15], a method to learn the proper σij for the kernel
is introduced, which has excellent performance. However, it is not learnable in
the case of few labeled data. In shape retrieval, since only the query shape has
the label, the learning of σij is not applicable. In our experiment, we use use an
adaptive kernel size based on the mean distance to K-nearest neighborhoods [28]:
where mean({knnd(xi ), knnd(xj )}) represents the mean distance of the K-nearest
neighbor distance of the sample xi , xj and C is an extra parameter. Both K and
C are determined empirically.
6 Experimental Results
In this section, we show that the proposed approach can significantly improve
retrieval rates of existing shape similarity methods.
1 1 1
0 0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
number of most similar shapes number of most similar shape number of most similar shape
Fig. 4. (a) A comparison of retrieval rates between IDSC [3] (blue circles) and the
proposed method (red stars) for MPEG-7. (b) A comparison of retrieval rates between
visual parts in [4] (blue circles) and the proposed method (red stars) for MPEG-7. (c)
Retrieval accuracy of DTW (blue circles) and the proposed method (red stars) for the
Face (all) dataset.
Table 1. Retrieval rates (bull’s eye) of different methods on the MPEG-7 data set
simT for only those 300 shapes. Here we assume that all relevant shapes will be
among the 300 most similar shapes. Thus, by using a larger affinity matrix we
can improve the retrieval rate but at the cost of computational efficiency.
In addition to the statistics presented in Fig. 4, Fig. 5 illustrates also that
the proposed approach improves the performance of IDSC. A very interesting
case is shown in the first row, where for IDSC only one result is correct for the
query octopus. It instead retrieves nine apples as the most similar shapes. Since
the query shape of the octopus is occluded, IDSC ranks it as more similar to an
apple than to the octopus. In addition, since IDSC is invariant to rotation, it
confuses the tentacles with the apple stem. Even in the case of only one correct
shape, the proposed method learns that the difference between the apple stem is
relevant, although the tentacles of the octopuses exhibit a significant variation
in shape. We restate that this is possible because the new learned distances are
induced by geodesic paths in the shape manifold spanned by the known shapes.
Consequently, the learned distances retrieve nine correct shapes. The only wrong
results is the elephant, where the nose and legs are similar to the tentacles of
the octopus.
As shown in the third row, six of the top ten IDSC retrieval results of lizard are
wrong. since IDSC cannot ignore the irrelevant differences between lizards and
sea snakes. All retrieval results are correct for the new learned distances, since the
proposed method is able to learn the irrelevant differences between lizards and
the relevant differences between lizards and sea snakes. For the results of deer
(fifth row), three of the top ten retrieval results of IDSC are horses. Compared
Improving Shape Retrieval by Learning Graph Transduction 797
Fig. 5. The first column shows the query shape. The remaining 10 columns show the
most similar shapes retrieved by IDSC (odd row numbers) and by our method (even
row numbers).
to it, the proposed method (sixth row) eliminates all of the wrong results so that
only deers are in the top ten results. It appears to us that our new method learned
to ignore the irrelevant small shape details of the antlers. Therefore, the presence
of the antlers became a relevant shape feature here. The situation is similar for
the bird and hat, with three and four wrong retrieval results respectively for
IDSC, which are eliminated by the proposed method.
An additional explanation of the learning mechanism of the proposed method
is provided by examining the count of the number of violations of the triangle
inequality that involve the query shape and the database shapes. In Fig. 6(a),
the curve shows the number of triangle inequality violations after each iteration
of our distance learning algorithm. The number of violations is reduced signif-
icantly after the first few hundred iterations. We cannot expect the number of
violations to be reduced to zero, since cognitively motivated shape similarity may
sometimes require triangle inequality violations [10]. Observe that the curve in
Fig. 6(a) correlates with the plot of differences ||ft+1 − ft || as a function of t
shown in (b). In particular, both curves decrease very slow after about 1000
798 X. Yang et al.
3000 0.7
0.6
2500
0.5
2000
0.4
1500
0.3
1000
0.2
500
0.1
0 0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
(a) (b)
Fig. 6. (a) The number of triangle inequality violations per iteration. (b) Plot of dif-
ferences ||ft+1 − ft || as a function of t.
Algorithm 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
SC [30] 97 91 88 85 84 77 75 66 56 37
Shock Edit [30] 99 99 99 98 98 97 96 95 93 82
IDSC+DP [3] 99 99 99 98 98 97 97 98 94 79
Shape Tree [5] 99 99 99 99 99 99 99 97 93 86
our method 99 99 99 99 99 99 99 99 97 99
iterations, and at 5000 iterations they are nearly constant. Therefore, we se-
lected T = 5000 as our stop condition. Since the situation is very similar in all
our experiments, we always stop after T = 5000 iterations.
Besides MPEG-7, We also present experimental results on the Kimia Data
Set [30]. The database contains 99 shapes grouped into nine classes. As the
database only contains 99 shapes, we calculate the affinity matrix based on all
of the shape in the database. The parameters used to calculate the affinity matrix
are: C = 0.25 and the neighborhood size is K = 4. We changed the neighborhood
size, since the data set is much smaller than the MPEG-7 data set. The retrieval
results are summarized as the number of shapes from the same class among the
first top 1 to 10 shapes (the best possible result for each of them is 99). Table 2
lists the numbers of correct matches of several methods. Again we observe that
our approach could improve IDSC significantly, and it yields a nearly perfect
retrieval rate.
Besides the inner distance shape context [3], we also demonstrate that the pro-
posed approach can improve the performance of visual parts shape similarity [4].
We select this method since it is based on very different approach than IDSC.
In [4], in order to compute the similarity between shapes, first the best possible
correspondence of visual parts is established (without explicitly computing the
Improving Shape Retrieval by Learning Graph Transduction 799
visual parts). Then, the similarity between corresponding parts is calculated and
aggregated. The settings and parameters of our experiment are the same as for
IDSC as reported in the previous section except we set C = 0.4. The accuracy
of this method has been increased from 76.45% to 86.69% on the MPEG-7 data
set, which is more than 10%. This makes the improved visual part method one
of the top scoring methods in Table 1. A detailed comparison of the retrieval
accuracy is given in Fig. 4(b).
will focus on addressing this problem. We also observe that our method is not
limited to 2D shape similarity but can also be applied to 3D shape retrieval,
which will also be part of our future work.
Acknowledgements
We would like to thank Eamonn Keogh for providing us the Face (all) dataset.
This work was support in part by the NSF Grant No. IIS-0534929 and by the
DOE Grant No. DE-FG52-06NA27508.
References
1. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using
shape contexts. IEEE Trans. PAMI 24, 705–522 (2002)
2. Tu, Z., Yuille, A.L.: Shape matching and recognition - using generative models
and informative features. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS,
vol. 3024, pp. 195–209. Springer, Heidelberg (2004)
3. Ling, H., Jacobs, D.: Shape classification using the inner-distance. IEEE Trans.
PAMI 29, 286–299 (2007)
4. Latecki, L.J., Lakämper, R.: Shape similarity measure based on correspondence of
visual parts. IEEE Trans. PAMI 22(10), 1185–1190 (2000)
5. Felzenszwalb, P.F., Schwartz, J.: Hierarchical matching of deformable shapes. In:
CVPR (2007)
6. McNeill, G., Vijayakumar, S.: Hierarchical procrustes matching for shape retrieval.
In: Proc. CVPR (2006)
7. Bai, X., Latecki, L.J.: Path similarity skeleton graph matching. IEEE Trans.
PAMI 30, 1282–1292 (2008)
8. Srivastava, A., Joshi, S.H., Mio, W., Liu, X.: Statistic shape analysis: clustering,
learning, and testing. IEEE Trans. PAMI 27, 590–602 (2005)
9. Zhu, X.: Semi-supervised learning with graphs. In: Doctoral Dissertation. Carnegie
Mellon University, CMU–LTI–05–192 (2005)
10. Vleugels, J., Veltkamp, R.: Efficient image retrieval through vantage objects. Pat-
tern Recognition 35(1), 69–80 (2002)
11. Latecki, L.J., Lakämper, R., Eckhardt, U.: Shape descriptors for non-rigid shapes
with a single closed contour. In: CVPR, pp. 424–429 (2000)
12. Brefeld, U., Buscher, C., Scheffer, T.: Multiview dicriminative sequential learning.
In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML
2005. LNCS (LNAI), vol. 3720. Springer, Heidelberg (2005)
13. Lawrence, N.D., Jordan, M.I.: Semi-supervised learning via gaussian processes. In:
NIPS (2004)
14. Joachims, T.: Transductive inference for text classification using support vector
machines. In: ICML, pp. 200–209 (1999)
15. Zhu, X., Ghahramani, Z., Lafferty., J.: Semi-supervised learning using gaussian
fields and harmonic functions. In: ICML (2003)
16. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Scholkopf., B.: Learning with local
and global consistency. In: NIPS (2003)
17. Wang, F., Wang, J., Zhang, C., Shen., H.: Semi-supervised classification using
linear neighborhood propagation. In: CVPR (2006)
Improving Shape Retrieval by Learning Graph Transduction 801
18. Zhou, D., Weston, J.: Ranking on data manifolds. In: NIPS (2003)
19. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear em-
bedding. Science 290, 2323–2326 (2000)
20. Fan, X., Qi, C., Liang, D., Huang, H.: Probabilistic contour extraction using hier-
archical shape representation. In: Proc. ICCV, pp. 302–308 (2005)
21. Yu, J., Amores, J., Sebe, N., Radeva, P., Tian, Q.: Distance learning for similarity
estimation. IEEE Trans. PAMI 30, 451–462 (2008)
22. Xing, E., Ng, A., Jordanand, M., Russell, S.: Distance metric learning with appli-
cation to clustering with side-information. In: NIPS, pp. 505–512 (2003)
23. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning distance functions
using equivalence relations. In: ICML, pp. 11–18 (2003)
24. Athitsos, V., Alon, J., Sclaroff, S., Kollios, G.: Bootmap: A method for efficient
approximate similarity rankings. In: CVPR (2004)
25. Hertz, T., Bar-Hillel, A., Weinshall, D.: Learning distance functions for image
retrieval. In: CVPR, pp. 570–577 (2004)
26. Zelnik-Manor, L., Perona, P.: Self-tuning spectral clustering. In: NIPS (2004)
27. Hein, M., Maier, M.: Manifold denoising. In: NIPS (2006)
28. Wang, J., Chang, S.F., Zhou, X., Wong, T.C.S.: Active microscopic cellular image
annotation by superposable graph transduction with imbalanced labels. In: CVPR
(2008)
29. Mokhtarian, F., Abbasi, F., Kittler, J.: Efficient and robust retrieval by shape
content through curvature scale space. In: Smeulders, A.W.M., Jain, R. (eds.)
Image Databases and Multi-Media Search, pp. 51–58 (1997)
30. Sebastian, T.B., Klein, P.N., Kimia, B.: Recognition of shapes by editing their
shock graphs. IEEE Trans. PAMI 25, 116–125 (2004)
31. Keogh, E.: UCR time series classification/clustering page,
http://www.cs.ucr.edu/∼ eamonn/time series data/
32. Ratanamahatana, C.A., Keogh, E.: Three myths about dynamic time warping. In:
SDM, pp. 506–510 (2005)
Cat Head Detection - How to Effectively Exploit Shape
and Texture Features
Abstract. In this paper, we focus on the problem of detecting the head of cat-like
animals, adopting cat as a test case. We show that the performance depends cru-
cially on how to effectively utilize the shape and texture features jointly. Specifi-
cally, we propose a two step approach for the cat head detection. In the first step,
we train two individual detectors on two training sets. One training set is normal-
ized to emphasize the shape features and the other is normalized to underscore
the texture features. In the second step, we train a joint shape and texture fusion
classifier to make the final decision. We demonstrate that a significant improve-
ment can be obtained by our two step approach. In addition, we also propose a set
of novel features based on oriented gradients, which outperforms existing leading
features, e. g., Haar, HoG, and EoH. We evaluate our approach on a well labeled
cat head data set with 10,000 images and PASCAL 2007 cat data.
1 Introduction
Automatic detection of all generic objects in a general scene is a long term goal in im-
age understanding and remains to be an extremely challenging problem duo to large
intra-class variation, varying pose, illumination change, partial occlusion, and cluttered
background. However, researchers have recently made significant progresses on a par-
ticularly interesting subset of object detection problems, face [14,18] and human detec-
tion [1], achieving near 90% detection rate on the frontal face in real-time [18] using
a boosting based approach. This inspires us to consider whether the approach can be
extended to a broader set of object detection applications.
Obviously it is difficult to use the face detection approach on generic object detection
such as tree, mountain, building, and sky detection, since they do not have a relatively
fixed intra-class structure like human faces. To go one step at a time, we need to limit
the objects to the ones that share somewhat similar properties as human face. If we can
succeed on such objects, we can then consider to go beyond. Naturally, the closest thing
to human face on this planet is animal head. Unfortunately, even for animal head, given
the huge diversity of animal types, it is still too difficult to try on all animal heads. This
is probably why we have seen few works on this attempt.
In this paper, we choose to be conservative and limit our endeavor to only one type
of animal head detection, cat head detection. This is of course not a random selection.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 802–816, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Cat Head Detection - How to Effectively Exploit Shape and Texture Features 803
Our motivations are as follows. First, cat can represent a large category of cat-like an-
imals, as shown in Figure 1 (a). These animals share similar face geometry and head
shape; Second, people love cats. A large amount of cat images have been uploaded and
shared on the web. For example, 2,594,329 cat images had been manually annotated
in flickr.com by users. Cat photos are among the most popular animal photos on the
internet. Also, cat as a popular pet often appears in family photos. So cat detection can
find applications in both online image search and offline family photo annotation, two
important research topics in pattern recognition. Third, given the popularity of cat pho-
tos, it is easy for us to get training data. The research community does need large and
challenging data set to evaluate the advances of the object detection algorithm. In this
paper, we provide 10,000, well labeled cat images. Finally and most importantly, the cat
head detection poses new challenges for object detection algorithm. Although it shares
some similar property with human face so we can utilize some existing techniques, the
cat head do have much larger intra-class variation than the human face, as shown in
Figure 1 (b), thus is more difficult to detect.
Directly applying the existing face detection approaches to detect the cat head has
apparent difficulties. First, the cat face has larger appearance variations compared with
the human face. The textures on the cat face are more complicated than those on the
human face. It requires more discriminative features to capture the texture information.
Second, the cat head has a globally similar, but locally variant shape or silhouette. How
to effectively make use of both texture and shape information is a new challenging issue.
It requires a different detection strategy.
To deal with the new challenges, we propose a joint shape and texture detection ap-
proach and a set of new features based on oriented gradients. Our approach is a two step
approach. In the first step, we individually train a shape detector and a texture detector
to exploit the shape and appearance information respectively. Figure 2 illustrates our
basic idea. Figure 2 (a) and Figure 2 (c) are two mean cat head images over all training
images: one aligned by ears to make the shape distinct; the other is aligned to reveal the
texture structures. Correspondingly, the shape and texture detectors are trained on two
differently normalized training sets. Each detector can make full use of most discrimi-
native shape or texture features separately. Based on a detailed study of previous image
and gradient features, e.g., Haar [18], HoG [1], EOH [7], we show that a new set of
804 W. Zhang, J. Sun, and X. Tang
Shape Texture
Fig. 2. Mean cat head images on all training data. (a) aligned by ears. More shape information is
kept. (b) aligned by both eyes and ears using an optimal rotation+scale transformation. (c) aligned
by eyes. More texture information is kept.
carefully designed Haar-like features on oriented gradients give the best performance
in both shape and texture detectors.
In the second step, we train a joint shape and texture detector to fuse the outputs
of the above two detectors. We experimentally demonstrate that the cat head detection
performance can be substantially improved by carefully separating shape and texture
information in the first step, and jointly training a fusion classifier in the second step.
normalization. Contrarily, the gradient features are more robust to illumination changes.
The gradient features are extracted from the edge map [4,3] or oriented gradients, which
mainly include SIFT [8], EOH [7], HOG [1], covariance matrix[17], shapelet [15], and
edgelet [19]. Tuzel et al. [17] demonstrated very good results on human detection using
the covariance matrix of pixel’s 1st and 2nd derivatives and pixel position as features.
Shapelet [15] feature is a weighted combination of weak classifiers in a local region. It
is trained specifically to distinguish between the two classes based on oriented gradients
from the sub-window. We will give a detailed comparison of our proposed features with
HOG and EOH features in Section 3.1.
In the detection phase, we first run the shape and texture detectors independently.
Then, we apply the joint shape and texture fusion classifier to make the final decision.
Specifically, we denote {cs , ct } as output scores or confidences of the two detectors,
and {fs , ft } as extracted features in two detected sub-windows. The fusion classifier is
trained on the concatenated features {cs , ct , fs , ft }.
Using two detectors, there are three kinds of detection results: both detectors re-
port positive at roughly the same location, rotation, and scale; only the shape detector
reports positive; and only the texture detector reports positive. For the first case, we
directly construct the features {cs , ct , fs , ft } for the joint fusion classifier. In the sec-
ond case, we do not have {ct , ft }. To handle this problem, we scan the surrounding
locations to pick a sub-window with the highest scores by the texture detector, as il-
lustrated in Figure 3. Specifically, we denote the sub-window reported by the detector
as [x, y, w, h, s, θ], where (x, y) is window’s center, w, h are width and height, and s, θ
are scale and rotation level. We search sub-windows for the texture/shape detector in
the range [x ± w/4] × [y ± h/4] × [s ± 1] × [θ ± 1]. Note that we use real value score of
the texture detector and do not make 0-1 decision. The score and features of the picked
sub-window are used for the features {ct , ft }. For the last case, we compute {cs , fs } in
a similar way.
To train the fusion classifier, 2,000 cat head images in the validation set are used as
the positive samples, and 4,000 negative samples are bootstrapped from 10,000 non-cat
images. The positive samples are constructed as usual. The key is the construction of the
negative samples which consist of all incorrectly detected samples by either the shape
detector or the texture detector in the non-cat images. The co-occurrence relationship
of the shape features and texture features are learned by this kind of joint training. The
learned fusion classifier is able to effectively reject many false alarms by using both
shape and texture information. We use support vector machine (SVM) as our fusion
classifier and HOG descriptors as the representations of the features fs and ft .
The novelty of our approach is the discovery that we need to separate the shape
and texture features and how to effectively separate them. The latter experimental re-
sults clearly validate the superiority of our joint shape and texture detection. Although
the fusion method might be simple at a glance, this is exactly the strength of our ap-
proach: a simple fusion method already worked far better than previous non-fusion
approaches.
(a) (b)
Fig. 3. Feature extraction for fusion. (a) given a detected sub-window (left) by the shape detector,
we search a sub-window (right, solid line) with highest score by the texture detector in sur-
rounding region (right, dashed line). The score and features {ct , ft } are extracted for the fusion
classifier. (b) similarly, we extract the score and features {cs , fs } for the fusion.
Cat Head Detection - How to Effectively Exploit Shape and Texture Features 807
To effectively capture both shape and texture information, we propose a set of new
features based on oriented gradients.
where Gh and Gv are horizontal and vertical filters, and ⊗ is convolution operator. A
bank of oriented gradients {gok }K →
−
k=1 are constructed by quantifying the gradient g (x)
on a number of K orientation bins:
|−
→g (x)| θ(x) ∈ bink
go (x) =
k
, (2)
0 otherwise
where θ(x) is the orientation of the gradient − →g (x). We call the image gok oriented
gradients channel. Figure 4 shows the oriented gradients on a cat head image. In this
example, we quantify the orientation into four directions. We also denote the sum of
oriented gradients of a given rectangular region R as:
S k (R) = gok (x). (3)
x∈R
It can be very efficiently computed in a constant time using integral image technique [18].
Since the gradient information at an individual pixel is limited and sensitive to noise,
most of previous works aggregate the gradient information in a rectangular region to
form more informative, mid-level features. Here, we review two most successful fea-
tures: HOG and EOH.
HOG-cell. The basis unit in the HOG descriptor is the weighted orientation histogram
of a “cell” which is a small spatial region, e.g., 8 × 8 pixels. It can be represented as:
The overlapped cells (e.g., 4 × 4) are grouped and normalized to form a larger spatial
region called “block”. The concatenated histograms form the HOG descriptor.
In Dalal and Triggs’s human detection system [1], a linear SVM is used to classify
a 64 × 128 detection window consisting of multiple overlapped 16 × 16 blocks. To
achieve near real-time performance, Zhu et al. [21] used HOGs of variable-size blocks
in the boosting framework .
EOH. Levi and Weiss [7] proposed three kinds of features on the oriented gradients:
EOH1 (R, k1, k2) = (S k1 (R) + )/(S k2 (R) + ),
EOH2 (R, k) = (S k (R) + )/( j (S j (R) + )),
where R is the symmetric region of R with respect to the vertical center of the detection
window, and is a small value for smoothing. The first two features capture whether
one direction is dominative or not, and the last feature is used to find symmetry or the
absence of symmetry. Note that using EOH features only may be insufficient. In [7],
good results are achieved by combining EOH features with Haar features on image
intensity.
Fig. 5. Haar of Oriented Gradients. Left: in-channel features. Right: orthogonal features.
close look at Figure 4, we may notice many local patterns in each oriented gradients
channel which is sparser and clearer than the original image. We may consider that
the gradient filter separates different orientation textures and pattern edges into several
channels thus greatly simplified the pattern structure in each channel. Therefore, it is
possible to extract Haar features from each channel to capture the local patterns. For
example, in the horizontal gradient map in Figure 4, we see that the vertical textures
between the two eyes are effectively filtered out so we can easily capture the two eye
pattern using Haar features. Of course, in addition to capturing local patterns within a
channel, we can also capture more local patterns across two different channels using
Haar like operation. In this paper, we propose two kinds of features as follows:
In-channel features
S k (R1 ) − S k (R2 )
HOOG1 (R1 , R2 , k) = . (5)
S k (R1 ) + S k (R2 )
These features measure the relative gradient strength between two regions R1 and R2
in the same orientation channel. The denominator plays a normalization role since we
do not normalize S k (R).
Orthogonal-channel features
∗
S k (R1 ) − S k (R2 )
HOOG2 (R1, R2, k, k ∗) = , (6)
S k (R1 ) + S k∗ (R2 )
where k ∗ is the orthogonal orientation with respect to k, i.e., k ∗ = k + K/2. These fea-
tures are similar to the in-channel features but operate on two orthogonal channels. In
theory, we can define these features on any two orientations. But we decide to compute
only the orthogonal-channel features based on two considerations: 1) orthogonal chan-
nels usually contain most complementary information. The information in two channels
with similar orientations is mostly redundant; 2) we want to keep the size of feature pool
small. The AbaBoost is a sequential, “greedy” algorithm for the feature selection. If the
feature pool contains too many uninformative features, the overall performance may
be hurt. In practice, all features have to be loaded into the main memory for efficient
training. We must be very careful about enlarging the size of features.
Considering all combinations of R1 and R2 will be intractable. Based on the success
of Haar features, we use Haar patterns for R1 and R2 , as shown in Figure 5. We call the
features defined in (5) and (6), Haar of Oriented Gradients (HOOG).
4 Experimental Results
4.1 Data Set and Evaluation Methodology
Our evaluation data set includes two parts, the first part is our own data, which includes
10,000 cat images mainly obtained from flickr.com; the second part is from PASCAL
2007 cat data, which includes 679 cat images. Most of our own cat data are near frontal
view. Each cat head is manually labeled with 9 points, two for eyes, one for mouth,
and six for ears, as shown in Figure 6. We randomly divide our own cat face images
810 W. Zhang, J. Sun, and X. Tang
into three sets: 5,000 for training, 2000 for validation, and 3,000 for testing.We follow
the PASCAL 2007 original separations of training, validation and testing set on the cat
data. Our cat images can be downloaded from http://mmlab.ie.cuhk.edu.hk/ for research
purposes.
We use the evaluation methodology similar to PASCAL challenge for object detec-
tion. Suppose the ground truth rectangle and the detected rectangle are rg and rd , and
the area of those rectangles are Ag and Ad . We say we correctly detect a cat head only
when the overlap of rg and rd is larger than 50%:
(A ∩A )
1 if (Agg ∪Add ) > 50% ,
D(rg , rd ) = , (7)
0 otherwise
where D(rg , rd ) is a function used to calculate detection rate and false alarm rate.
0.9
0.9
0.8 0.8
0.7 0.7
Recall
Recall
0.6 0.6
0.5 0.5
Haar Haar
Haar+EOH 0.4 Haar+EOH
0.4
HOG HOG
our feature our feature
0.3 0.3
0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 400 450 500
Figure 7 shows the performances of the four kinds of features. The Haar feature on
intensity gives the poorest performance because of large shape and texture variations
of the cat head. With the help of oriented gradient features, Haar + EOH improves the
performance. As one can expect, the HOG features perform better on the shape detector
than on the texture detector. Using both in-channel and orthogonal-channel information,
the detectors based on our features produce the best results.
Fig. 8. Best features leaned by the AdaBoost. Left (shape detector): (a) best Haar feature on image
intensity. (b) best in-channel feature. (c) best orthogonal feature on orientations 60o and 150o .
Right (texture detector): (d) best Haar feature on image intensity. (e) best in-channel feature. (f)
best orthogonal-channel feature on orientations 30o and 120o .
In Figure 8, we show the best in-channel features in (b) and (e), and the best
orthogonal-channel features in (c) and (f), learned by two detectors. We also show the
best Haar features on image intensity in Figure 8 (a) and (d). In both detectors, the best
in-channel features capture the strength differences between a region with strongest
812 W. Zhang, J. Sun, and X. Tang
horizontal gradients and its neighboring region. The best orthogonal-channel features
capture the strength differences in two orthogonal orientations.
In the next experiment we investigate the role of in-channel features and orthogonal-
channel features. Figure 9 shows the performances of the detector using in-channel
features only, orthogonal-channel features only, and both kinds of features. Not surpris-
ingly, both features are important and complementary.
1 1
0.95 0.95
Precision
Precision
0.9 in-channel 0.9 in-channel
orthogonal-channel orthogonal-channel
0.85 0.85
in-channel + in-channel +
orthogonal-channel orthogonal-channel
0.8 0.8
0.5 0.6 Recall 0.7 0.8 0.5 0.6 Recall 0.7 0.8
(a) shape detector (b) texture detector
1
0.95
0.9
0.85
0.8
Recall
0.75
0.7
0.65
Shape
0.6
Texture
0.55 Optimal Align
Shape+Texture
0.5
0 100 200 300 400 500 600
False Alalm Count
1 1
PASCA2007 Best
Haar
0.9 Our approach
0.9 Haar+EOH
0.8 0.8 HOG
0.7
our approach
0.7
Precision
Precision
0.6 0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0 0. 1 0.2 0.3 0 .4 0.5 0 .6 0. 7 0.8 0.9 1
0 0. 1 0.2 0.3 0 .4 0. 5 0.6 0 .7 0 .8 0. 9 1
Recall Recall
(a) Competition 3 (b) Competition 4
Fig. 11. Experiments on PASCAL 2007 cat data. (a) our approach and best reported method on
Competition 3 (specified training data). (b) four detectors on Competition 4 (arbitrary training
data).
Figure 12 gives some detection examples having variable appearance, head shape,
illumination, and pose.
Fig. 12. Detection results. The bottom row shows some detected cats in PASCAL 2007 data.
same training data. The APs of four detectors (ours, HOG, Haar+EOH, Harr) are 0.632,
0.427, 0.401, and 0.357. Using larger training data, the detection performance is signif-
icantly improved. For example, the precision is improved from 0.40 to 0.91 for a fixed
recall 0.4. Note that the PASCAL 2007 cat data treat the whole cat body as the object
and only small fraction of the data contain near frontal cat face. However, our approach
still achieves reasonable good results (AP=0.632) on this very challenging data (the best
reported method’s AP=0.24).
secondly. The texture and shape detectors also greatly benefit from a set of new oriented
gradient features. Although we focus on the cat head detection problem in this paper,
our approach can be extended to detect other categories of animals. In the future, we
are planing to extend our approach to multi-view cat head detection and more animal
categories. We are also interest in exploiting other contextual information, such as the
presence of animal body, to further improve the performance.
References
1. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, vol. 1,
pp. 886–893 (2005)
2. Everingham, M., van Gool, L., Williams, C., Winn, J., Zisserman, A.: The PASCAL Visual
Object Classes Challenge (VOC 2007) Results (2007),
http://www.pascal-network.org/challenges/VOC/voc2007/
workshop/index.html
3. Felzenszwalb, P.F.: Learning models for object recognition. In: CVPR, vol. 1, pp. 1056–1062
(2001)
4. Gavrila, D.M., Philomin, V.: Real-time object detection for smart vehicles. In: CVPR, vol. 1,
pp. 87–93 (1999)
5. Heisele, B., Serre, T., Pontil, M., Poggio, T.: Component-based face detection. In: CVPR,
vol. 1, pp. 657–662 (2001)
6. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: CVPR,
vol. 1, pp. 878–885 (2005)
7. Levi, K., Weiss, Y.: Learning object detection from a small number of examples: the impor-
tance of good features. In: CVPR, vol. 2, pp. 53–60 (2004)
8. Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, vol. 2, pp.
1150–1157 (1999)
9. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic
assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS,
vol. 3021, pp. 69–82. Springer, Heidelberg (2004)
10. Mohan, A., Papageorgiou, C., Poggio, T.: Example-based object detection in images by com-
ponents. IEEE Trans. Pattern Anal. Machine Intell. 23(4), 349–361 (2001)
11. Munder, S., Gavrila, D.M.: An experimental study on pedestrian classification. IEEE Trans.
Pattern Anal. Machine Intell. 28(11), 1863–1868 (2006)
12. Papageorgiou, C., Poggio, T.: A trainable system for object detection. Intl. Journal of Com-
puter Vision 38(1), 15–33 (2000)
13. Ronfard, R., Schmid, C., Triggs, B.: Learning to parse pictures of people. In: ECCV, vol. 4,
pp. 700–714 (2004)
14. Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Trans.
Pattern Anal. Machine Intell. 20(1), 23–38 (1998)
15. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: CVPR
(2007)
16. Schneiderman, H., Kanade, T.: A statistical method for 3d object detection applied to faces
and cars. In: CVPR, vol. 1, pp. 746–751 (2000)
17. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on riemannian manifolds.
In: CVPR (2007)
18. Viola, P., Jones, M.J.: Robust real-time face detection. Intl. Journal of Computer Vision 57(2),
137–154 (2004)
816 W. Zhang, J. Sun, and X. Tang
19. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by
bayesian combination of edgelet part detectors. In: ICCV, vol. 1, pp. 90–97 (2005)
20. Xiao, R., Zhu, H., Sun, H., Tang, X.: Dynamic cascades for face detection. In: ICCV, vol. 1,
pp. 1–8 (2007)
21. Zhu, Q., Avidan, S., Yeh, M.-C., Cheng, K.-T.: Fast human detection using a cascade of
histograms of oriented gradients. In: CVPR, vol. 2, pp. 1491–1498 (2006)
Motion Context: A New Representation for
Human Action Recognition
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 817–829, 2008.
c Springer-Verlag Berlin Heidelberg 2008
818 Z. Zhang et al.
hand- hand-
boxing jogging running walking
clapping waving
Frame Group
Motion Context Motion Image
Fig. 1. Illustrations of the frame groups, motion images, and our motion context rep-
resentations on the KTH dataset. This figure is best viewed in color.
region around a reference point and thus summarize the local motion informa-
tion in a rich, local 3D MC descriptor. Fig.1 illustrates some MIs and their
corresponding MC representations using the video clips in the KTH dataset.
To describe an action, only one 3D descriptor is generated by summing up all
the MC descriptors of this action in the MIs. For action recognition, we employ
3 different approaches: pLSA [7], w3 -pLSA (a new direct graphical model by
extending pLSA) and SVM [8]. Our approach is tested on two human action
video datasets from KTH [2] and Weizmann Institute of Science [9], and the
performances are quite promising.
The rest of this paper is organized as follows: Section 2 reviews some related
works in human action recognition. Section 3 presents the details of our MC rep-
resentation. Section 4 introduces the 3 recognition approaches. Our experimental
results are shown in Section 5, and finally Section 6 concludes the paper.
2 Related Work
the optical flow from different frames to represent human actions. Recently,
Scovanner et al. [4] applied sub-histograms to encode local temporal and spatial
information to generate a 3D version of SIFT [13] (3D SIFT), and Savarese et al.
[14] proposed so-called “spatial-temporal correlograms” to encode flexible long
range temporal information into the spatial-temporal motion features.
However, a common issue behind these interesting point detectors is that the
detected points sometimes are too few to sufficiently characterize the human
action behavior, and hence reduce the recognition performance. This issue has
been avoided in [6] by employing the separable linear filter method [3], rather
than such space-time interesting point detectors, to obtain the motion features
using a quadrature pair of 1D Gabor filters temporally.
Another way of using temporal information is to divide a video into smaller
groups of consecutive frames as the basic units and represent a human action as
a collection of the features extracted from these units. In [15], [5], every three
consecutive frames in each video were grouped together and integrated with
their graphical models as a node to learn the spatial-temporal relations among
these nodes. Also in [16], the authors took the average of a sequence of binary
silhouette images of a human action to create the “Average Motion Energy”
representation. Similarly, [17] proposed a concept of “Motion History Volumes”,
an extension of “Motion History Images” [18], to capture the motion information
from a sequence of video frames.
After the human action representations have been generated, both discrim-
inative approaches (e.g. kernel approaches [2]) and generative approaches (e.g.
pLSA [19], MRF [15], [5], semi-LDA [10], hierarchical graphical models [6]) can
be employed to recognize them.
Calucate
Standard Deviation
Fig. 2. Illustration of the MI generation process for a frame group. The black dots
denote the pixel intensity values.
consecutive frames. Since stdev can measure the variances of the pixel intensity
values, it can definitely detect motions.
We would like to mention that the length of each group, V , should be long
enough to capture the motion information sufficiently but not too long. Fig.3
illustrates the effects of different V on the MIs of human running and walking.
If V = 5, the difference between the two actions is quite clear. With V increased
to 60, the motion information of both actions spreads in the MIs, making it
difficult to distinguish them. A further investigation of V will be essential in our
MC representation.
walking
Fig. 3. Illustration of effects of different lengths of frame groups on the MIs using
human running and walking
M o tio n
w o rd
Normalized
distance
gl ve
an at i
e
el
R
Fig. 4. Illustration of our MC representation (left) and its 3D descriptor (right). On
the left, P denotes a MW at an interesting point, O denotes the reference point, Θ and
S denote the relative angle and normalized distance between P and O in the support
region (the black rectangle), respectively, and the shaded sector (blue) denotes the
orientation of the whole representation. On the right, each MW is quantized into a
point to generate a 3D MC descriptor. This figure is best viewed in color.
is inspired by Shape Context (SC) [20], which has been widely used in object
recognition. The basic idea of SC is to locate the distribution of other shape
points over relative positions in a region around a pre-defined reference point.
Subsequently, 1D descriptors are generated to represent the shapes of objects.
In our representation, we utilize the polar coordinate system to capture the
relative angles and distances between the MWs and the reference point (the
pole of the polar coordinate system) for each action in the MIs, similar to SC.
This reference point is defined as the geometric center of the human motion, and
the relative distances are normalized by the maximum distance in the support
region, which makes the MC insensitive to changes in scale of the action. Here,
the support region is defined as the area which covers the human action in the
MI. Fig.4 (left) illustrates our MC representation. Suppose that the angular
coordinate is divided into M equal bins, the radial coordinate is divided into
N equal bins and there are K MWs in the dictionary, then each MW can be
put into one of the M *N bins to generate a 3D MC descriptor for each MC
representation, as illustrated in Fig.4 (right). To represent a human action in
each video sequence, we sum up all the MC descriptors of this action to generate
one 3D descriptor with the same dimensions.
When generating MC representations, another factor should also be consid-
ered, that is, the direction of the action, because the same action may occur in
different directions. E.g. a person may be running in one direction or the oppo-
site direction. In such cases, the distributions of the interesting points in the two
corresponding MIs should be roughly symmetric about the y-axis. Combining
the two distributions for the same action will reduce the discriminability of our
representation. To avoid this, we define the orientation of each MC representa-
tion as the sector where most interesting points are detected, e.g. the shaded
one (blue) in Fig. 4 (left). This sector can be considered to represent the main
characteristics of the motion in one direction. For the same action but in the
822 Z. Zhang et al.
orientations
Flipping
Orientation Orientation
y-axis y-axis
opposite direction, we then align all the orientations to the pre-defined side by
flipping the MC representations horizontally around the y-axis. Thus our repre-
sentation is symmetry-invariant. Fig.5 illustrates this process. Notice that this
process is done automatically without the need to know the action direction.
The entire process of modeling human actions using the MC representation
is summarized in Table 1.
Table 1. The main steps of modeling the human actions using the MC representation
4.1 pLSA
pLSA aims to introduce an aspect model, which builds an association between
documents and words through the latent aspects by probability. Here, we follow
the terminology of text classification where pLSA was used first. The graphical
model of pLSA is illustrated in Fig.6 (a).
Suppose D = {d1 , . . . , dI }, W = {w1 , . . . , wJ } and Z = {z1 , . . . , zK } denote a
document set, a word set and a latent topic set, respectively. pLSA models the
joint probability of documents and words as:
P (di , wj ) = P (di , wj , zk ) = P (wj |zk )P (zk |di )P (di ) (1)
k k
Motion Context: A New Representation for Human Action Recognition 823
w.
d z w d z
s
where n(di , wj ) denotes the document-word co-occurrence table, where the num-
ber of co-occurrences of di and wj is recorded in each cell.
To learn the probability distributions involved, pLSA employs the Expecta-
tion Maximization (EM) algorithm shown in Table 2 and records P (wj |zk ) for
recognition, which is learned from the training data.
E-step:
P (zk |di , wj ) ∝ P (wj |zk )P (zk |di )P (di )
M-step:
P (wj |zk ) ∝ i n(di , wj )P (zk |di , wj )
P (zk |di ) ∝ j n(di , wj )P (zk |di , wj )
P (di ) ∝ j n(di , wj )
4.2 w3 -pLSA
To bridge the gap between the human actions and our MC descriptors, we extend
pLSA to develop a new graphical model, called w3 -pLSA. See Fig.6 (b), where d
denotes human actions, z denotes latent topics, w, θ and s denote motion words,
and the indexes in the angular and radial coordinates in the polar coordinate
system, respectively.
Referring to pLSA, we model the joint probability of human actions, motion
words and their corresponding indices in the angular and radial coordinates as
P (di , wj , θm , sr ) = P (di , wj , θm , sr , zk ) = P (di )P (zk |di )P (wj , θm , sr |zk )
k k
(3)
824 Z. Zhang et al.
E-step:
P (zk |di , wj , θm , sr ) ∝ P (wj , θm , sr |zk )P (zk |di )P (di )
M-step:
P (wj , θm , sr |zk ) ∝ i n(di , wj , θm , sr )P (zk |di , wj , θm , sr )
P (zk |di ) ∝ j,m,r n(di , wj , θm , sr )P (zk |di , wj , θm , sr )
P (di ) ∝ j,m,r n(di , wj , θm , sr )
5 Experiments
Our approach has been tested on two human action video datasets from KTH
[2] and Weizmann Institute of Science (WIS) [9]. The KTH dataset is one of
the largest datasets for human action recognition containing six types of human
actions: boxing, handclapping, handwaving, jogging, running, and walking. For
Motion Context: A New Representation for Human Action Recognition 825
each type, there are 99 or 100 video sequences of 25 different persons in 4 differ-
ent scenarios: outdoors (S1), outdoors with scale variation (S2), outdoors with
different clothes (S3) and indoors (S4), as illustrated in Fig.7 (left). In the WIS
dataset, there are altogether 10 types of human actions: walk, run, jump, gallop
sideways, bend, one-hand wave, two-hands wave, jump in place, jumping jack,
and skip. For each type, there are 9 or 10 video sequences of 9 different persons
with the similar background, as shown in Fig.7 (right).
S1 S2 S3 S4
boxing
bend
side
clappin g
hand-
skip
jack
waving
hand-
jump
walk
joggin g
w ave1
pjum p
runn ing
w ave2
walking
run
Fig. 7. Some sample frames from the KTH dataset (left) and the WIS dataset (right)
5.1 Implementation
To generate MC representations for human actions, we need to locate the ref-
erence points and the support regions first. Some techniques in body tracking
(e.g. [21]) can be applied to locate the areas and the geometric centers of the
human bodies in each frame group of a video sequence. The integration of the
areas of a person can be defined as its support region and the mean of its centers
can be defined as the reference point for this action in the MI. However, this
issue is beyond the purpose of this paper. So considering that in our datasets
each video sequence only contains one person, we simply assume that in each MI
the support region of each human action covers the whole MI, and we adopted a
simple method to roughly locate the reference points. First, we generated one MI
from every 5-frame group of each video sequence empirically. Then a Gaussian
filter was applied to denoise these MIs so that the motion information from the
background was suppressed. Next, we used the Canny edge detector to locate
the edges in each MI, and finally took the geometric center of the edge points as
the reference point for the action.
After locating the reference points, we followed the steps in Table 1 to generate
the MC representations for human actions. The detector and descriptor involved
in Step 2 are the Harris-Hessian-Laplace detector [22] and the SIFT descriptor
826 Z. Zhang et al.
Table 4. Comparison (%) between our approach and others on the KTH dataset
[13], and the clustering method used here is K-means. Then based on the MWs
and the MC descriptors of the training data, we trained pLSA, w3 -pLSA and
SVM for each type of actions separately, and a test video sequence was classified
to the type of actions with the maximum likelihood.
Table 5. Comparison (%) between our approach and others on the WIS dataset. Notice
that “✕” denotes that this type of actions was not involved in their experiments.
Rec.Con. bend jack jump pjump run side skip walk wave1 wave2 ave.
MW+pLSA 77.8 100.0 88.9 88.9 70.0 100.0 60.0 100.0 66.7 88.9 84.1
MW+SVM 100.0 100.0 100.0 77.8 30.0 77.8 40.0 100.0 100.0 100.0 81.44
MC+w3 -pLSA 66.7 100.0 77.8 66.7 80.0 88.9 100.0 100.0 100.0 100.0 88.0
MC+SVM 100.0 100.0 100.0 88.9 80.0 100.0 80.0 80.0 100.0 100.0 92.89
Wang et al. [16] 100.0 100.0 89.0 100.0 100.0 100.0 89.0 100.0 89.0 100.0 96.7
Ali et al. [26] 100.0 100.0 55.6 100.0 88.9 88.9 100.0 100.0 100.0 92.6
Scovanner [4] 100.0 100.0 67.0 100.0 80.0 100.0 50.0 89.0 78.0 78.0 84.2
Niebles et al. [6] 100.0 100.0 100.0 44.0 67.0 78.0 56.0 56.0 56.0 72.8
conclusions: (1) MWs without any spatial information are not discriminative
enough to recognize the actions. MW+pLSA returns the best performance
(84.65%) using MWs, which is lower than the state of the art. (2) MC repre-
sentation usually achieves better performances than MWs, which demonstrates
that the distributions of the MWs are quite important for action recognition.
MC+w3 -pLSA returns the best performance (91.33%) among all the approaches.
Unlike the KTH dataset, the WIS dataset only has 9 or 10 videos for each
type of human actions, which may result in underfit when training the graphical
models. To utilize this dataset sufficiently, we only used the LOO training strat-
egy to learn the models for human actions and tested on all the video sequences.
We compare our average recognition rates with others in Table 5. The experi-
mental configuration of the MC representation is kept the same as that used on
the KTH dataset, while the number of MWs used in the BOW model is modified
empirically to 300. The number of latent topics is unchanged. From this table,
we can see that MC+SVM still returns the best performance (92.89%) among
the different configurations, which is comparable to other approaches and higher
than the best performance (84.1%) using MW. These results demonstrate that
our MC presentation can model the human actions properly with the distribu-
tions of the MWs.
6 Conclusion
and Weizmann Institute of Science (WIS). The performances are promising. For
the KTH dataset, all configurations using MC outperform existing approaches
where the best performances are obtained using w3 -pLSA (88.67% for SDE and
91.33% for LOO). For the WIS dataset, our MC+SVM returns the comparable
performance (92.89%) using the LOO strategy.
References
1. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV (2003)
2. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm ap-
proach. In: ICPR 2004, vol. III, pp. 32–36 (2004)
3. Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse
spatio-temporal features. In: VS-PETS (October 2005)
4. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application
to action recognition. ACM Multimedia, 357–360 (2007)
5. Wang, Y., Loe, K.F., Tan, T.L., Wu, J.K.: Spatiotemporal video segmentation
based on graphical models. Trans. IP 14, 937–947 (2005)
6. Niebles, J., Fei Fei, L.: A hierarchical model of shape and appearance for human
action classification. In: CVPR 2007, pp. 1–8 (2007)
7. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. In:
Mach. Learn., Hingham, MA, USA, vol. 42, pp. 177–196. Kluwer Academic Pub-
lishers, Dordrecht (2001)
8. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. In:
Data Mining and Knowledge Discovery, vol. 2, pp. 121–167 (1998)
9. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time
shapes. In: ICCV 2005, vol. II, pp. 1395–1402 (2005)
10. Wang, Y., Sabzmeydani, P., Mori, G.: Semi-latent dirichlet allocation: A hierar-
chical model for human action recognition. In: HUMO 2007, pp. 240–254 (2007)
11. Ikizler, N., Duygulu, P.: Human action recognition using distribution of oriented
rectangular patches. In: HUMO 2007, pp. 271–284 (2007)
12. Efros, A., Berg, A., Mori, G., Malik, J.: Recognizing action at a distance. In: ICCV
2003, pp. 726–733 (2003)
13. Lowe, D.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision 20, 91–110 (2003)
14. Savarese, S., Sel Pozo, A., Fei-Fei, J.N.L.: Spatial-temporal correlations for unsu-
pervised action classification. In: IEEE Workshop on Motion and Video Comput-
ing, Copper Mountain, Colorado (2008)
15. Wang, Y., Tan, T., Loe, K.: Video segmentation based on graphical models. In:
CVPR 2003, vol. II, pp. 335–342 (2003)
16. Wang, L., Suter, D.: Informative shape representations for human action recogni-
tion. In: ICPR 2006, vol. II, pp. 1266–1269 (2006)
17. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using mo-
tion history volumes. Computer Vision and Image Understanding 104 (Novem-
ber/December 2006)
18. Bobick, A., Davis, J.: The recognition of human movement using temporal tem-
plates. PAMI 23(3), 257–267 (2001)
19. Niebles, J., Wang, H., Wang, H., Fei Fei, L.: Unsupervised learning of human action
categories using spatial-temporal words. In: BMVC 2006, vol. III, p. 1249 (2006)
Motion Context: A New Representation for Human Action Recognition 829
20. Belongie, S., Malik, J., Puzicha, J.: Shape context: A new descriptor for shape
matching and object recognition. In: NIPS, pp. 831–837 (2000)
21. Bissacco, A., Yang, M.H., Soatto, S.: Fast human pose estimation using appearance
and motion via multi-dimensional boosting regression. In: CVPR (2007)
22. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE
Transactions on Pattern Analysis & Machine Intelligence 27, 1615–1630 (2005)
23. Chang, C., Lin, C.: Libsvm: a library for support vector machines, Online (2001)
24. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volu-
metric features. In: International Conference on Computer Vision, vol. 1, p. 166
(October 2005)
25. Wong, S., Kim, T., Cipolla, R.: Learning motion categories using both semantic
and structural information. In: CVPR 2007, pp. 1–6 (2007)
26. Ali, S., Basharat, A., Shah, M.: Chaotic invariants for human action recognition.
In: ICCV 2007, pp. 1–8 (2007)
Temporal Dithering of Illumination
for Fast Active Vision
1 Introduction
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 830–844, 2008.
c Springer-Verlag Berlin Heidelberg 2008
Temporal Dithering of Illumination for Fast Active Vision 831
DMD device has found applications in areas ranging widely from microscopy to
chemistry to holographic displays [13].
The operating principle of the DMD device has also been exploited in com-
puter vision and graphics. Nayar et al. [14] re-engineer a DLP projector into
a DMD-camera and demonstrate the notion of programmable imaging for ap-
plications including adaptive dynamic range and optical filtering and matching.
Based on the theory of compressive sampling, a single pixel camera has been
implemented where the DMD device used to compute optical projections of
scene radiance [15]. Jones et al. [16] modify a DLP projector using custom made
FPGA-based circuitry to obtain 1-bit projection at 4800Hz. Using this, they
generate high speed stereoscopic light field displays. McDowall and Bolas [17]
use a specially re-programmed high speed projector based on Multiuse Light
Engine (MULE) technology to achieve range finding at kilohertz rates.
In order to project a desired intensity value, the DLP projector emits a series of
light pulses of different time intervals [13]. A sensor aggregates the pulses of light
over the duration of its integration time (say, 1/30s in a video camera) to capture
the final gray-valued brightness. This Pulse-Width modulation (PWM) by the
projector is unique for every input intensity and can be termed as “temporal
dithering” of the illumination. As we shall show, this dithering allows us to
encode scene illumination in novel ways to achieve significant speedup in the
performance of virtually any active vision technique.
But how do we capture this high speed dithering? The exposure time (1/30s)
of a video camera is too long to observe the temporal illumination dithering
clearly. One possibility is to precisely synchronize the camera with a DLP pro-
jector and to expose the camera only for the duration of a single projected light
pulse (a few microseconds). Raskar et al [18] and Cotting et al [19] use this
technique to embed illumination patterns in the scene that cannot be observed
with the naked eye. The focus of these works is on intelligent office applications
with 30-60Hz performance requirements.
In contrast, our work focuses on exploiting the temporal dithering for fast ac-
tive vision. For this, we use a novel combination of a high speed camera and an
off-the-shelf DLP projector. Figure 1 illustrates the dithering of an 8-bit InFocus
IN38 DLP projector as observed by a Photron PCI-1024 high speed camera. A cal-
ibration image composed of 5 × 5 pixel blocks each with a different intensity value
from 0 to 255 is input to the projector. Each intensity at a pixel C in this calibra-
tion image is projected onto a flat screen using a unique temporal dithering DC (t),
over discrete time frames t. The high speed camera observes the projected im-
ages at 10 kHz. Notice the significant variation in the images recorded. The plot in
Figure 1(d) shows the patterns emitted by the projector for 4 input brightnesses
(165, 187, 215, 255), as measured over 100 camera frames. The temporal ditherings
corresponding to all the 256 input intensities in the calibration image are collated
into a photograph for better visualization of this principle. The temporal dithering
is stable and repeatable but varies for each projector-camera system.
832 S.G. Narasimhan, S.J. Koppal, and S. Yamazaki
Fig. 1. Reverse engineering a DLP Projector: (a) A DLP projector converts the input
intensity received into a stream of light pulses that is then projected onto a screen. A
high speed camera viewing the screen aggregates the brightness over the duration of
its integration time. (b) A calibration image composed of 5 × 5 pixel blocks each with
a different intensity from 0 to 255 is input to the projector. (c) The camera records the
projector output at 10 kHz. In (d) we show gray-valued intensities measured over time
by the high speed camera for 4 example intensities input to the projector. Notice the
significant variations in the plots. In (e), the temporal dithering for all 256 projector
input intensities is collated into an image. This temporal dithering is repeatable and
can be used to encode illumination in a novel way, enabling fast active vision.
Fig. 2. Illumination and acquisition setup for structured light based 3D reconstruction:
The Photron high speed camera is placed vertically above the Infocus DLP projector.
A vertical plane is placed behind the scene (statue) for calibration.
In methods (a)-(d), the projector receives a single image as input via a com-
puter, whereas the high speed camera acquires a sequence of frames. The effective
speedup achieved depends on the task at hand and the quality of the result desired
given the signal-to-noise ratio in the captured images. In addition, the intensity
variation due to dithering can be observed reliably even with camera frame rates
as low as 300 fps enabling applications with slower performance requirements. Un-
like previous work, our techniques do not require any projector-camera synchro-
nization, hardware modification or re-programming of the DMD device, or the
knowledge of proprietary dithering coding schemes. Thus, we believe this work to
be widely applicable. Better visualizations of all our results are available through
our website (http://graphics.cs.cmu.edu/projects/dlp-dithering).
Fig. 3. Results of 3D reconstruction using the DLP projector for a moving statue:
(a) Three frames captured by the high speed camera illustrate the fast modulation of
illumination incident on the scene. 20 continuous frames are used to match the inten-
sity variation observed on the scene point against the normalized intensity variation
observed on the vertical plane behind the object. (b) The best match finds correspon-
dences between projector and camera pixels. The error map is shown in (c). The (d)
disparity and (e) recovered shape of the object is shown from different viewpoints.
laptop. Let I(t) be the vector of intensities observed, over a set of frames, at a
scene point P . The normalized correlation between I(t) and temporal dithering
function DC (t) for each C (Section 1.1) is computed to obtain the projector pixel
C corresponding to the image pixel P . But how do we synchronize the frames
from the projector and the camera? One approach is to include a small planar
patch in the scene where correspondence between the corners of the patch can be
easily established (say, manually). This correspondence allows us to synchronize
the measured intensity vector with the temporal dithering.
We performed two experiments with a rotating statue and with a cloth waved
quickly in front of the high speed camera. For convenience, the camera and the
Temporal Dithering of Illumination for Fast Active Vision 835
Fig. 5. Reconstructions obtained using videos captured at reduced frame rates. Even
at 300Hz, the quality of the reconstruction obtained remains acceptable indicating that
temporal dithering can be exploited at this frame rate.
300Hz and 120Hz. Figure 5 shows the reconstructions obtained. The frame rate
of 120Hz is too low to capture the required intensity variation and hence, the
projector-camera pixel correspondences are unreliable. However, at 300Hz, the
reconstruction quality is still acceptable indicating that the temporal dithering
can be exploited even at this frame rate.
where, Dk (t) is the dithering intensity of the projector k at time t and Ek (t) is
the irradiance due to the scene as if illuminated only from projector k but with
Temporal Dithering of Illumination for Fast Active Vision 837
appear mixed in the multiplexed image I(t). For robustness, we use 10 frames
to solve the above linear system. Notice separation of the shadows in the de-
multiplexed images. As before, the effective rate of demultiplexing depends on
the SNR in the high speed camera. We have thus far ignored color information,
however, when the three DLP projectors emit intensities in different spectral
bands, the de-multiplexing algorithm can be used to colorize the acquired high
speed gray-scale video.
Photometric stereo is a widely used method to recover the surface normals and
albedos of objects that are photographed under different lighting directions.
There are many variants of this approach and we chose the one by Hertzmann
and Seitz [8] for its simplicity. In their work, the appearance of the scene under
varying lighting is matched with that of an example sphere made of the same
material (same BRDF) as the scene. The point on the sphere that produces the
best match is the normal of the scene point. We will extend this approach for
fast moving scenes that are simultaneously illuminated from different directions.
The scene in our experiments consists of a sphere and a falling pear both
painted in the same manner (Figure 7) and illuminated by three DLP projec-
tors simultaneously from different directions. The projectors and camera are far
enough away from the scene to assume orthographic viewing and distant lighting.
Since each projector must uniformly illuminate the scene, we provide a single
constant brightness image as input to each projector (with different brightness
values). The high speed camera records images at 3 kHz.
The projectors are de-synchronized and hence, the “multiplexed illumination”
results in significant variation in the observed intensities. The normalized inten-
sities at a scene point are compared to those observed on the sphere. The surface
normal of the scene point is that of the point on the sphere which produced the
best match. A matching length of 10 frames achieved robust results. A sliding
window of 10 frames can be used to generate the normals up to a rate of 3 kHz.
As before, the speed of the object determines the effective performance rate.
Figure 7 shows the normals of the pear as it falls and bounces on a table.
The radiance of a scene point can be divided into two components - (a) the direct
component Ld , due to the direct illumination from the light source and (b) the
global component Lg due to the illumination indirectly reaching the scene point
from other locations in the scene [10]. The global component Lg includes effects
like interreflections, subsurface and volumetric scattering and translucency. Na-
yar et al [10] demonstrated that using high frequency illumination, it is possible
to separate the two components and obtain novel visualizations of the compo-
nents for the first time. A particular choice for high frequency illumination is
Temporal Dithering of Illumination for Fast Active Vision 839
Fig. 7. Photometric stereo by example: The scene consists of a fast moving pear and a
sphere that are both painted similarly. Three DLP projectors simultaneously illuminate
the scene and the camera operates at 3000Hz. The projectors and camera are far
enough away from the scene to assume orthographic viewing and distant lighting. The
surface normal at a point on the falling pear is computed by matching the normalized
observed intensities to those at the points on the sphere. Since the projectors are
not synchronized, the variation in multiplexed illumination from the 3 projectors is
significant enough to obtain good matches for surface normals. A matching length of
10 frames achieved robust results.
a checker board pattern and its complement (with alternate bright and dark
squares), both of which are projected sequentially for separation.
We exploit illumination dithering to obtain separation at video rates. However,
in our setup, it is possible to input only one image to the DLP projector in 1/60s
and we have no control over the temporal dithering. So, how do we project comple-
mentary patterns much faster than 1/60s? We selected two specific input bright-
nesses 113 and 116 whose dithered patterns are shown in the plot of Figure 8.
Notice how the two patterns “flip” from bright to dark and vice versa over time.
Hence, a checker pattern with these two brightnesses are input to the projector.
The dithering ensures that the two complementary patterns occur at high speeds.
Let the observed temporally dithered values for input values 113 and 116 be a and
b, respectively, and the fraction of pixels that correspond to the value a be α (0.5
in our experiments). The two captured images are [10]:
To solve the above equations, we need to know a and b in every frame. For this,
we place a white planar diffuse surface behind the scene of interest. For points on
this plane, Lg = 0 and Ld is a constant. This allows us to estimate a and b up to a
840 S.G. Narasimhan, S.J. Koppal, and S. Yamazaki
Fig. 8. Direct-Global Separation using DLP Dithering: (a) The DLP projector and
the camera are co-located using a beam splitter. A single checker pattern with two
intensities 113 and 116 are input to the projector. The plot shows how the input
intensities are dithered by the projector over time. Notice that at certain time instants,
the patterns flip between bright and dark. Thus, the projector emits complementary
checker patterns as in (b) onto the scene that are used to separate the direct and global
components (c). The flip occurs once in 1/100s.
single scale factor. Then, the above linear system can be solved at every pixel to
obtain the separation. There is one additional complication in our setup beyond
the method in [10]: it is hard to find out whether a scene point receives intensity
a or intensity b from just the observed appearance of the scene. To address this
problem, we co-locate the projector and the camera using a beam-splitter as
shown in Figure 8. The pixels of the projector are automatically corresponded
with those of the camera.
The scene in our experiment consists of a set of white ping-pong balls dropped
from a hand. The ping-pong balls are mostly diffuse. Notice that the direct
Temporal Dithering of Illumination for Fast Active Vision 841
Fig. 9. Motion blurring under DLP illumination and fluorescent illumination: The
scene consists of a heavy brick falling rapidly and an image is captured with exposures
1/60s (a) and 1/125s (b). Under fluorescent illumination, the motion blur appears
as a smear across the image losing high frequencies. The temporal dithering in DLP
projectors acts as a high frequency modulator that convolves with the moving object.
The motion-blurred image still preserves some of the high spatial frequencies. Six copies
of the text “ECCV08” in (a) and 2 copies in (b) are clearly visible.
component for each ball looks like the shading on a sphere (with dark edges) and
the indirect component includes the interreflections between the balls (notice the
bright edges). For the hand, the direct component is only due to reflection by the
oils near the skin surface and is dark. The indirect component includes the effect
of subsurface scattering and dominates the intensity. The checker pattern “flips”
once in approximately 1/100s and hence we achieve separation at 100Hz. Due
to finite resolution of the camera and the narrow depth of field of the projector,
a 1-pixel blur is seen at the edges of the checker pattern. This results in the grid
artifacts seen in the results.
Motion-blur occurs when the scene moves more than a pixel within the inte-
gration time of a camera. The blur is computed as the convolution of the scene
motion with a box filter of width equal to the camera integration time. Thus,
images captured of fast moving objects cause a smear across the pixels losing
significant high frequencies. Deblurring images is a challenging task that many
works have addressed with limited success. A recent approach by Raskar et al.
[21] uses an electronically controlled shutter in front of the camera to modulate
the incoming irradiance at speeds far greater than the motion of the object. In
other words, the box filter is replaced by a series of short pulses of different
widths. The new convolution between the object motion and the series of short
pulses results in images that preserve more high frequencies as compared to the
box filter. This “Flutter Shutter” approach helps in making the problem better
842 S.G. Narasimhan, S.J. Koppal, and S. Yamazaki
conditioned. Our approach is similar in spirit to [21] with one difference: the fast
shutter is simulated by the temporal dithering of the DLP illumination. Note that
the DLP illumination dithering is significantly faster than mechanical shutters1 .
Figure 9 shows the images captured by with 1/60s exposure. The scene consists
of a brick with the writing “ECCV08” falling vertically. When illuminated by
a fluorescent source, the resulting motion-blur appears like a smear across the
image. On the other hand, when the scene is illuminated using a DLP projector,
we see 6 distinct copies of the text that are translated downward. A Canny edge
detector is applied to the captured image to illustrate the copies. If we knew
the extent of motion in the image, the locations of strong edges can be used as
a train of delta signals that can be used for deblurring the image. In 9(b), we
show an example of deblurring the image captured with 1/125s exposure. As
in the deblurred images obtained the flutter shutter case, the DLP illumination
preserves more high frequencies in the motion-blurred image.
7 Discussion
Speed vs. accuracy trade-off. One limitation of our approach is the require-
ment of a high speed camera. The acquisition speed of the camera and the
effective speed of performance achieved depend on the task at hand and the
signal-to-noise ratio of the captured images. For instance, the decision to use
10 frames for demultiplexing illumination or photometric stereo, or to use 20
frames for structured light, was mainly influenced by the noise characteristics of
the camera. A more scientific exploration of this trade-off is required to better
understand the benefits of our approach to each technique. A future avenue of
research is to design 2D spatial intensity patterns that create temporal dithering
codes that are optimal for the task at hand.
Issues in reverse engineering. The images shown in Figure 1 are dark for the
input brightness range of 0 to 90. Despite the claim from manufacturers that the
projector displays 8-bits of information, only about 160 patterns are usable for
our experiments. To compensate for this, the projector performs spatial dithering
in addition to temporal dithering in a few pixel blocks. This is an almost random
effect that is not possible to reverse engineer without proprietary information
from the manufacturers. We simply average a small neighborhood or discard
such neighborhoods from our processing.
Other active vision techniques and illumination modulations. We be-
lieve that the temporal illumination dithering can be applied to a broader range
of methods including pixel-wise optical flow estimation and tracking, projector
defocus compensation and depth from defocus [12] and spectral de-multiplexing.
While we exploit the temporal dithering already built-in to the projector, we do
not have a way of controlling it explicitly. Better control is obtained by using
a more expensive and special high speed MULE projector [17]. Finally, strobe
lighting, fast LED [22] and flash modulation are also effective in temporally
varying (not dithering) the illumination.
1
Faster shutters can be realized by electronically triggering the camera.
Temporal Dithering of Illumination for Fast Active Vision 843
Acknowledgements
This research was supported in parts by ONR grants N00014-08-1-0330 and
DURIP N00014-06-1-0762, and NSF CAREER award IIS-0643628. The authors
thank the anonymous reviewers for their useful comments.
References
1. Will, P.M., Pennington, K.S.: Grid coding: A preprocessing technique for robot
and machine vision. AI 2 (1971)
2. Zhang, L., Curless, B., Seitz, S.M.: Rapid shape acquisition using color structured
light and multi-pass dynamic programming. 3DPVT (2002)
3. Davis, J., Nehab, D., Ramamoothi, R., Rusinkiewicz, S.: Spacetime stereo: A uni-
fying framework for depth from triangulation. In: IEEE CVPR (2003)
4. Curless, B., Levoy, M.: Better optical triangulation through spacetime analysis. In:
ICCV (1995)
5. Young, M., Beeson, E., Davis, J., Rusinkiewicz, S., Ramamoorthi, R.: Viewpoint-
coded structured light. In: IEEE CVPR (2007)
6. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light.
In: CVPR (2003)
7. Zickler, T., Belhumeur, P., Kriegman, D.J.: Helmholtz stereopsis: Exploiting reci-
procity for surface reconstruction. In: Heyden, A., Sparr, G., Nielsen, M., Johansen,
P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 869–884. Springer, Heidelberg (2002)
8. Hertzmann, A., Seitz, S.M.: Shape and materials by example: A photometric stereo
approach. In: IEEE CVPR (2003)
9. Wenger, A., Gardner, A., Tchou, C., Unger, J., Hawkins, T., Debevec, P.: Perfor-
mance relighting and reflectance transformation with time-multiplexed illumina-
tion. ACM SIGGRAPH (2005)
10. Nayar, S.K., Krishnan, G., Grossberg, M.D., Raskar, R.: Fast separation of di-
rect and global components of a scene using high frequency illumination. ACM
SIGGRAPH (2006)
11. Sen, P., Chen, B., Garg, G., Marschner, S.R., Horowitz, M., Levoy, M., Lensch,
H.P.A.: Dual photography. ACM SIGGRAPH (2005)
12. Zhang, L., Nayar, S.K.: Projection defocus analysis for scene capture and image
display. ACM SIGGRAPH (2006)
13. Dudley, D., Duncan, W., Slaughter, J.: Emerging digital micromirror device (dmd)
applications. In: Proc. of SPIE, vol. 4985 (2003)
14. Nayar, S.K., Branzoi, V., Boult, T.: Programmable imaging using a digital mi-
cromirror array. In: IEEE CVPR (2004)
15. Takhar, D., Laska, J., Wakin, M., Duarte, M., Baron, D., Sarvotham, S., Kelly,
K., Baraniuk, R.: A new compressive imaging camera architecture using optical-
domain compression. Computational Imaging IV at SPIE Electronic Imaging
(2006)
16. Jones, A., McDowall, I., Yamada, H., Bolas, M., Debevec, P.: Rendering for an
interactive 360 degree light field display. ACM SIGGRAPH (2007)
17. McDowall, I., Bolas, M.: Fast light for display, sensing and control applications.
In: IEEE VR Workshop on Emerging Display Technologies (2005)
18. Raskar, R., Welch, G., Cutts, M., Lake, A., Stesin, L., Fuchs, H.: The office of
the future: A unified approach to image-based modeling and spatially immersive
displays. ACM SIGGRAPH (1998)
844 S.G. Narasimhan, S.J. Koppal, and S. Yamazaki
19. Cotting, D., Naef, M., Gross, M., Fuchs, H.: Embedding imperceptible patterns
into projected images for simultaneous acquisition and display. In: ISMAR (2004)
20. Schechner, Y.Y., Nayar, S.K., Belhumeur, P.N.: A theory of multiplexed illumina-
tion. In: ICCV (2003)
21. Raskar, R., Agrawal, A., Tumblin, J.: Coded exposure photography: Motion de-
blurring using fluttered shutter. ACM SIGGRAPH (2006)
22. Nii, H., Sugimoto, M., Inami, M.: Smart light-ultra high speed projector for spatial
multiplexing optical transmission. In: IEEE PROCAMS (2005)
Compressive Structured Light for Recovering
Inhomogeneous Participating Media
1 Introduction
Structured light has a long history in the computer vision community [1]. It has
matured into a robust and efficient method for recovering the surfaces of objects.
By projecting coded light patterns on the scene, and observing it using a camera,
correspondences are established and the 3D structure of the scene is recovered
by triangulation. Over the years, researchers have developed various types of
coding strategies, such as binary codes, phase shifting, spatial neighborhood
coding, etc. All structured light range finding approaches are based on a common
assumption: Each point in the camera image receives light reflected from a single
surface point in the scene.
However, many real-world phenomena can only be described by volume den-
sities rather than boundary surfaces. Such phenomena are often referred to as
participating media. Examples include translucent objects, smoke, clouds, mix-
ing fluids, and biological tissues. Consider an image acquired by photographing
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 845–858, 2008.
c Springer-Verlag Berlin Heidelberg 2008
846 J. Gu et al.
1
“sparse” does not necessarily imply that the volume density must be sparsely dis-
tributed in space. It means that the density can be represented with a few non-zero
coefficients in an appropriately-chosen basis, such as, wavelets, gradients, principal
components, etc.
Compressive Structured Light 847
Participating
Medium
m
y
I y,z ρ ( x, y, z) Projector
Projector
I(y,z) L(x,y)
Camera
x
Milk
Camera Drops
n
z L x,y
2 Related Work
Compressive Sensing. Compressive sensing [6,7] is a nascent field of applied
mathematics with a variety of successful applications including imaging [9], med-
ical visualization [10], and face recognition [11]. It offers a theoretical framework
to reconstruct “sparse” signals from far fewer samples than required by the con-
ventional Shannon sampling theorem. Our work builds on the basic formulation
of compressive sensing, which we augment with auxiliary terms specific to the
reconstruction of volume density.
In the low density case, or when σt is relatively small compared with the scat-
tering, the effect of attenuation usually can be ignored [3,5], i.e., the exponential
Compressive Structured Light 849
term in the above equation is equal to 1. Equation (1) thus can be reduced to a
linear projection of the light and the volume density,
I(y, z) = ρ(x, y, z) · L(x, y) dx . (2)
x
For media where the attenuation cannot be ignored, we present a simple, iterative
method based on iterative relinearization (see §5.3).
In this section, we explain the idea of compressive structured light for recovering
inhomogeneous participating media. For participating media, each camera pixel
receives light from all points along the line of sight within the volume. Thus
each camera pixel is an integral measurement of one row of the volume density.
Whereas conventional structured light range finding methods seek to triangulate
the position of a single point, compressed structured light seeks to reconstruct
the 1D density “signal” from a few measured integrals of this signal.
This is clearly a more difficult problem. One way to avoid this problem is
to break the integrals into pieces which can be measured directly. The price,
however, is the deterioration of either spatial resolution or temporal resolution
of the acquisition. Existing methods either illuminate a single slice at a time and
scan the volume (see Fig. 2a and [4,3]), thus sacrificing temporal resolution, or
they illuminate a single pixel per row and use interpolation to reconstruct the
volume (e.g., Fig. 2b and [5]), sacrificing spatial resolution.
In contrast, the proposed compressive structured light method uses the light
much more efficiently, projecting coded light patterns that yield “signatures,” or
integral measurements, of the unknown volume density function.
The didactic illustration in Fig. 1a depicts a simple lighting/viewpoint geom-
etry under orthographic projection, with the camera viewpoint along the x-axis,
and the projector emitting along the z-axis. Consider various coding strategies
y
e
e
tim
tim
Camera
Fig. 2. Different coding strategies of the light L(x, y) at time t for recovering inho-
mogeneous participating media: (a) scan (one stripe turned on) [4,3]; (b) laser-lines
interpolation (one pixel turned on per one row) [5]; (c) Spatial coding of compressive
structured light (all pixels are turned on with random values per time frame); (d) Tem-
poral coding of compressive structured light (random binary stripes are turned on per
time frame). Compressive structured light, shown in (c) and (d), recovers the volume
by reconstructing the 1D signal along x-axis from a few integral measurements.
850 J. Gu et al.
of the 3D light function L(x, y, t): Spatial codes (Fig. 2c) recover the volume
from a single image by trading spatial resolution along one dimension; Tempo-
ral codes (Fig. 2d) trade temporal resolution by emitting a sequence of vertical
binary stripes (with no coding along y-axis), so that full spatial resolution is
retained.2
We will see that these compressive structured light codes yield high efficiency
both in acquisition time and illumination power; this comes at the cost of a more
sophisticated reconstruction process, to which we now turn our attention.
5.1 Formulation
Consider first the case of spatial coding. Suppose we want to reconstruct a volume
at the resolution n×n×n (e.g., n = 100). The camera and the projector have the
resolution of M ×M pixels (e.g., M = 1024). Therefore, one row of voxels along
the x-axis (refer to the red line in Fig. 1a) will receive light from m = M/n (e.g.,
m = 1024/100 ≈ 10) rows of the projector’s pixels. The light scattered by these
voxels in the viewing direction will then be measured, at each z-coordinate,
by a vertical column of m camera pixels. Thus, using the fact that we have
greater spatial projector/camera resolution than voxel resolution, we can have
m measurements for each n unknowns. Similarly, we can also acquire these m
measurements using temporal coding, i.e., changing the project light patterns at
each of the m time frames.
Without loss of generality, we use l1 = L(x, 1),· · · ,lm = L(x, m) to denote the
m rows of pixels from the projector, and b1 = I(1, z),· · · ,bm = I(m, z) to denote
the image irradiance of the m pixels in the camera image. Let x = [ρ1 ,· · · ,ρn ]T
be the vector of the voxel densities along the row. Assuming no attenuation, the
image irradiance for each of these m pixels is a linear projection of the light and
the voxels’ density from (2): bi = lTix, i = 1,· · · ,m. Rewriting these m equations
in matrix form, we have: Ax = b, where A = [l1 ,· · · ,lm ]T is a m × n matrix,
T
b = [b1 ,· · · ,bm ] is a m×1 vector.
Thus, if attenuation is not considered, the problem of recovering the volume is
formulated as the problem of reconstructing the 1D signal x given the constraints
Ax = b. To retain high spatial and temporal resolution, we often can only afford
far fewer measurements than the number of unknowns, i.e., m < n, which means
the above equation is an underdetermined linear system and optimization is
required to solve for the best x according to certain priors.
One benefit of this optimization-based reconstruction is high efficiency in
acquisition, which we quantify using the measurement cost, m/n, where m is
the number of the measurements and n is the number of unknowns (i.e., the
dimension of the signal). For example, the measurement cost of the scanning
method [4,3] is one. We show that by exploiting the sparsity of the signal, we
can reconstruct the volume with much lower measurement cost (about 18 to 14 ).
2
All of the 4 methods shown in Fig. 2 can be equally improved using color channels.
Compressive Structured Light 851
0.16
0.14
ρ 0.16
0.14
Reconstruction 0.16
0.14
0.16
0.14
0.16
0.14
0.16
0.14
0 0 0 0 0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
0 0 0 0 0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
0 0 0 0 0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Fig. 3. Comparison of different reconstruction methods. The first column is the orig-
inal signal. The remaining columns show reconstruction results (red dashed lines) for
different methods, given the measurement cost, m/n, is equal to 1/4. The value below
each plot is the NRMSE(normalized root mean squared error) of reconstruction.
prior—the sparsity on the signal value or on the signal gradient. The fact that
CS-Gradient is better than CS-Value indicates that the sparsity on the signal
gradient holds better than the sparsity on the signal value. Finally, as expected,
CS-Both outperforms other methods due to its adaptive ability. In our trials,
the favorable performance of CS-Both was not sensitive to changes of the rela-
tive weighting of the value and gradient terms. These observations carry over to
the 3D setting (see Fig. 4), where we reconstruct a 1283 volume; note that this
requires 128 × 128 independent 1D reconstructions.
0.03
Error (NRMSE)
y
0.02
0.01
z
x
0 1 4 7 10 13 16
(a) Ground Truth (c) Coded Image (d) Reconstructed Slice (e) Reconstructed Volume at 2 Views (a) Iterations
Temporal Coding of Compressive Structured Light
LS
NLS
CS-Value
CS-Gradient
CS-Both
largely outperform other methods, especially for low measurement cost, which
indicating strong sparsity in the signal’s gradient. (4) CS-Both is better than
CS-Gradient, especially at low measurement cost (e.g., as shown in Fig. 5 at
m/n = 1/16). Based on these preliminary simulations, we chose to run our ac-
tual acquisition experiments with a measurement cost of 1/4 and the CS-Both
optimization functional.
7 Experimental Results
(a) Photograph (c) View 1 View 2 View 3 (a) Photograph (c) View 1 View 2 View 3
With Attenuation Correction With Attenuation Correction
(b) Coded Image (d) View 1 View 2 View 3 (b) Coded Image (d) View 1 View 2 View 3
Fig. 6. Reconstruction results of LEFT: an object consisting of two glass slabs with
powder where the letters “EC” are on the back slab and “CV” on the front slab, and
RIGHT: point cloud of a face etched in a glass cube. Both examples show: (a) a
photograph of the objects, (b) one of the 24 images captured by the camera, and re-
constructed volumes at different views with (c) and without (d) attenuation correction.
y
y
y
z x
z z
x x
1.0
Time (sec.)
2.0
3.0
4.0
Fig. 7. Reconstruction results of milk drops dissolving in water. 24 images are used
to reconstruct the volume at 128 × 128 × 250 at 15fps. The reconstructed volumes are
shown in three different views. Each row corresponds to one instance in time. The
leftmost column shows the corresponding photograph (i.e., all projector pixels emit
white) of the dynamic process.
Compressive Structured Light 857
8 Limitations
Multiple Scattering. Although utilizing more light elements increases the effi-
ciency of the acquisition, it will increase multiple scattering as well, which will
cause biased reconstruction, as the artifacts shown in Fig. 6. One potential way
to alleviate this problem is to separate multiple/single scattering by using more
complex light codes in a similar way to Nayar et al. [24].
Calibration for the Spatial Coding Method. The spatial coding seems more desir-
able than the temporal coding due to its high temporal resolution (i.e., volume
reconstruction from one single image) and the easy access of high spatial resolu-
tion devices. However, it requires highly accurate calibration both geometrically
and radiometrically. The defocus of both the projector and the camera needs to
be considered as well. In contrast, the temporal coding method is more robust
to noise and defocus and easy to calibrate.
9 Conclusions
We proposed compressive structured light for recovering the volume densities of
inhomogeneous participating media. Unlike conventional structured light range
finding methods where coded light patterns are used to establish correspondence
for triangulation, compressive structured light uses coded light as a way to gen-
erate measurements which are line-integrals of volume density. By exploiting the
sparsity of the volume density, the volume can be accurately reconstructed from
a few measurements. This makes the acquisition highly efficient both in acquisi-
tion time and illumination power, and thus enables the recovery of time-varying
volumetric phenomena.
We view compressive structured light as a general framework for coding the
3D light function L(x, y, t) for reconstruction of signals from line-integral mea-
surements. In this light, existing methods such as laser sheet scanning and laser
line interpolation, as well as the spatial coding and temporal coding discussed in
this paper, can be considered as special cases. One interesting future direction is
to design more complex coding strategies to improve the performance or apply
the method to new problems.
References
1. Salvi, J., Pages, J., Batlle, J.: Pattern codification strategies in structured light
systems. Pattern Recognition 37, 827–849 (2004)
2. Narasimhan, S., Nayar, S., Sun, B., Koppal, S.: Structured light in scattering media.
In: ICCV 2005, pp. 420–427 (2005)
858 J. Gu et al.
1 Introduction
Different surfaces modulate light in different ways, and this leads to distinc-
tive lightness, gloss, sheen, haze and so on. Thus, like shape and color, surface
reflectance can play a significant role in characterizing objects.
Computationally, surface reflectance is represented by the bi-directional re-
flectance distribution function, or BRDF; and the task of inferring the reflectance
of a surface is formulated as that of inferring a BRDF from radiometric mea-
surements. According to conventional methods, measuring surface reflectance
requires the use of controlled, active lighting to sample the double-hemisphere of
input and output directions that constitute the BRDF domain. These approaches
demand complex infrastructure, including mechanical rotation and translation
stages, digital cameras and projectors, and custom catadioptrics.
Perceptual studies suggest that humans can also infer reflectance information
from image data, but that they do so in a very different manner. While the
vast majority of machine measurement systems rely on illumination by a single
moving point source, humans rely on images captured under complex, natural
lighting [1]. The human approach has clear practical advantages: it is a passive
technique that eliminates the need for controlled lighting, and it substantially
reduces the measurement burden.
In this paper we present a passive system for inferring bi-directional surface
reflectance that also exploits natural lighting. The approach is general in that,
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 859–872, 2008.
c Springer-Verlag Berlin Heidelberg 2008
860 F. Romeiro, Y. Vasilyev, and T. Zickler
90 90
60 60
30 30
0 0
Fig. 1. Reflectometry using only a camera and a light probe (bottom left). Using a
bivariate representation of reflectance, the constraints induced by a single HDR image
(top left) of a known shape are sufficient to recover a non-parametric BRDF (mid-
dle). The recovered BRDF summarizes the object’s reflectance properties and is an
important material descriptor. Here, its accuracy is demonstrated through its use in
rendering a synthetic image of a novel shape (right).
For many materials, the dimension of the BRDF domain can be reduced with-
out incurring a significant loss of detail. The domain can be folded in half, for
example, because reciprocity ensures that BRDFs are symmetric about the di-
rections of incidence and reflection: f (u, v) = f (v, u). In many cases, the domain
(θu , φu , θv , φv ) can be further ‘projected’ onto the 3D domain (θu , θv , φu − φv )
and then folded onto (θu , θv , |φu − φv |). The projection is acceptable whenever
a BRDF exhibits little change for rotations of the input and output directions
(as a fixed pair) about the surface normal; and additional folding is acceptable
whenever there is little change when reflecting the output direction about the
incident plane. Materials that satisfy these two criteria—for some definition of
‘little change’—are said to satisfy isotropy and bilateral symmetry, respectively.
(It is also common to use the term isotropy to mean both.)
It is convenient to parameterize the BRDF domain in terms of halfway and
difference angles [13]. Accordingly, the complete 4D domain is written in terms
of the spherical coordinates of the halfway vector h = (u + v)/||u − v|| and
those of the input direction with respect to the halfway vector: (θh , φh , θd , φd ).
See Fig. 2. In this parameterization, the folding due to reciprocity corresponds
to φd → φd + π, and the projection due to isotropy (without bilateral symmetry)
is one onto (θh , θd , φd ) [13]. While undocumented in the literature, it is straight-
forward to show that bilateral symmetry enables the additional folding φd →
φd + π/2 which gives the 3D domain (θh , θd , φd ) ⊂ [0, π/2]3 .
Here, we consider an additional projection of the BRDF domain, one that
reduces it from three dimensions down to two. In particular, we project
(θh , θd , φd ) ⊂ [0, π/2]3 to (θh , θd ) ∈ [0, π/2]2 . A physical interpretation is de-
picted in Fig. 2, from which it is clear that the projection is acceptable whenever
a BRDF exhibits little change for rotations of the input and output directions
(as a fixed pair) about the halfway vector. This is a direct generalization of
isotropy, bilateral symmetry and reciprocity, which already restrict the BRDF
to be π2 -periodic for the same rotations. We refer to materials that satisfy this
requirement (again, for some definition of ‘little change’) as being bivariate.
The accuracy of bivariate representations of the materials in the MERL BRDF
0.9
0.8
ORIGINAL
0.7
Relative RMS BRDF Error
0.6
BIVARIATE
0.5
0.4
0.3
0.2
0.1
0
black−oxidized−steel
two−layer−silver
two−layer−gold
grease−covered−steel
black−fabric
polyethylene
green−latex
black−obsidian
nickel
black−phenolic
specular−yellow−phenolic
specular−black−phenolic
yellow−matte−plastic
ipswich−pine−221
polyurethane−foam
alumina−oxide
silver−metallic−paint
pink−felt
blue−fabric
blue−metallic−paint
special−walnut−224
blue−metallic−paint2
pink−fabric2
pickled−oak−260
green−fabric
white−fabric2
nylon
red−fabric
alum−bronze
green−metallic−paint
beige−fabric
specular−blue−phenolic
orange−paint
aventurnine
white−marble
brass
silicon−nitrade
green−metallic−paint2
dark−blue−paint
blue−rubber
white−fabric
red−fabric2
red−specular−plastic
specular−green−phenolic
color−changing−paint3
natural−209
white−diffuse−bball
color−changing−paint1
specular−orange−phenolic
pink−jasper
gold−paint
colonial−maple−223
specular−red−phenolic
green−plastic
color−changing−paint2
chrome−steel
tungsten−carbide
aluminium
gold−metallic−paint
light−red−paint
steel
specular−white−phenolic
teflon
specular−violet−phenolic
dark−specular−fabric
ss440
red−phenolic
white−paint
red−metallic−paint
red−plastic
blue−acrylic
fruitwood−241
specular−maroon−phenolic
chrome
pure−rubber
hematite
pink−plastic
neoprene−rubber
delrin
green−acrylic
dark−red−paint
pearl−paint
violet−acrylic
purple−paint
cherry−235
database [14] are shown in Fig. 3, where they are sorted by relative RMS BRDF
error:
⎛ ⎞ 12
¯
(f (θh , θd , φd ) − f (θh , θd )) ⎠
2
Erms = ⎝ , (1)
(f (θh , θd , φd ))2
θh ,θd ,φd
with
1
f¯(θh , θd ) = f (θh , θd , φd ).
|Φ(θh , θd )|
Φ(θh ,θd )
Here, Φ(θh , θd ) is the set of valid φd values given fixed values of θh and θd .
The figure also shows synthetic images of materials that are more and less well-
represented by a bivariate BRDF. Overall, our tests suggest that the overwhelm-
ing majority of the materials in the database are reasonably well-represented by
bivariate functions. We even find that the bivariate reduction has positive effects
in some cases. For example, the original green-acrylic BRDF has lens flare ar-
tifacts embedded in its measurements2 , and these are removed by the bivariate
reduction (see Fig. 3).
Motivation for a bivariate representation is provided by the work of Stark et
al. [2] who show empirically that a carefully-selected 2D domain is often sufficient
2
W. Matusik, personal communication.
864 F. Romeiro, Y. Vasilyev, and T. Zickler
3 Passive Reflectometry
We assume that we are given one or more images of a known curved surface,
and that these images are acquired under known distant lighting, such as that
measured by an illumination probe. In this case, each pixel in the images pro-
vides a linear constraint on the BRDF, and our goal is to infer the reflectance
function from these constraints. While the constraints from a single image are
not sufficient to recover a general 3D isotropic BRDF [7], we show that they
often are sufficient to recover plausible bivariate reflectance.
To efficiently represent specular highlights, retro-reflections and Fresnel ef-
fects, we can benefit from a non-uniform sampling of the 2D domain. While
‘good’ sampling patterns can be learned from training data [16], this approach
may limit our ability to generalize to new materials. Instead, we choose to man-
ually design a sampling scheme that is informed by common observations of
reflectance phenomena. This is implemented by defining continuous functions
s(θh , θ.
d ) and t(θh , θd ) and sampling uniformly in (s, t). Here we use s = 2θd /π,
t = 2θh /π which increases the sampling density near specular reflections
(θh ≈ 0). With this in mind, we write the rendering equation as
I(v, n) = L(Rn−1 u)f (s(u, Rn v), t(u, Rn v)) cos θu du, (2)
Ω
Fig. 4. Constraints on bivariate reflectance from natural lighting. Each pixel of an input
image (middle) captured under distant illumination (left) gives a linear constraint that
can be interpreted as an inner product of the 2D BRDF (right, first argument) and
a visible hemisphere of lighting that is weighted, warped and folded across the local
view/normal plane (right, second argument).
where Nk is the set of the four BRDF grid points that are closest to s(uk , Rn v),
t(uk , Rn v), and αki,j is the coefficient of the bilinear interpolation associated
with these coordinates and si , tj . (We find a piecewise linear approximation of
the BRDF to be adequate.) This equation can be rewritten as
2π
I(v, n) ≈ f (si , tj ) αki,j L(Rn−1 uk ) cos θuk , (4)
|Ωd |
(si ,tj )∈S uk ∈binij
I = Lf (5)
As with general 4D BRDFs, bivariate BRDFs vary slowly over much of their
domain. Regularization can therefore be implemented in the form of a smooth-
ness constraint in the st-plane. There are many choices here, and we have found
spatially-varying Tikhonov-like regularization to be especially effective. Accord-
ing to this design choice, the optimization becomes
D D2 D −1 D2
argmin I − Lf 2 + α DΛ−1 D D D
2
s Ds f 2 + Λ t Dt f 2 (6)
f
subject to f ≥ 0,
where Ds and Dt are |S| × |S| derivative matrices, and α is a tunable scalar
regularization parameter. The matrices Λs and Λt are diagonal |S|× |S| matrices
that affect non-uniform regularization in the bivariate BRDF domain. Their
diagonal entries are learned from the MERL database by setting each to the
variance of the partial derivative at the corresponding st domain point, where
the variance is computed across all materials in the database. Probabilistically,
this approach can be interpreted as seeking the MAP estimate with independent,
zero-mean Gaussian priors on the bivariate BRDF’s partial derivatives.
There are many possible alternatives for regularization. For example, one
could learn a joint distribution over the entire bivariate domain, perhaps by
characterizing this distribution in terms of a small number of modes of variation.
However, we have found that the simple approach in Eq. 6 provides reasonable
results, does not severely ‘over-fit’ the MERL database, and is computationally
quite efficient (it is a constrained linear least squares problem).
We begin with an evaluation that uses images synthesized with tabulated BRDF
data from the MERL database [14], measured illumination3 , and a physically
based renderer4 . Using these tools, we can render images for input to our al-
gorithm as well as images with the recovered BRDFs for direct comparison to
ground truth. In all cases, we use complete 3D isotropic BRDF data to create
the images for input and ground-truth comparison, since this is closest to a real-
world setting. Also, we focus our attention on the minimal case of a single input
image; with additional images, the performance can only improve. It is worth
emphasizing that this data is not free of noise. Sources of error include the fact
that the input image is rendered with a 3D BRDF as opposed to a bivariate one,
that normals are computed from a mesh and are stored at single precision, and
that a discrete approximation to the rendering equation is used.
Given a rendered input image of a defined shape (we use a sphere for sim-
plicity), we harvest observations from 8,000 normals uniformly sampled on the
visible hemisphere to create an observation vector I of length 8,000. We discard
normals that are at an angle of more than 80◦ from the viewing direction, since
the signal to noise ratio is very low at these points. The bivariate BRDF domain
is represented using a regular 32 × 32 grid on the st-plane, and our observation
matrix L is therefore M × 1024, where M is the number of useable normals. The
entries in L are computed using Eq. 4 with 32,000 points uniformly distributed
on the illumination hemisphere. With I and L determined, we can solve for the
unknown BRDF as described in the previous sections.
We find it beneficial to use a small variant of the optimization in Eq. 6: we
solve the problem twice using two separate pairs of diagonal weight matrices
(Λs , Λt ). One pair gives preference to diffuse reflectance, while the other gives
preference to gloss. This provides two solutions, and we choose the one with low-
est residual. Using this procedure, we were able to use the same weight matrices
and regularization parameter (α) for all results in this paper. In every case, the
optimizations were initialized with a Lambertian BRDF.
Results are shown in Fig. 5. The two left columns show results using a single
input image synthesized with the Grace Cathedral environment. The recovered
bivariate BRDFs are compared to the (3D) ground truth by synthesizing images
in another setting (St. Peter’s Basilica). Close inspection reveals very little no-
ticeable difference between the two images, and the recovered BRDF is visually
quite accurate. There are numerical differences, however, and these have been
scaled by 100 for visualization. Note that some of this error is simply due to the
bivariate approximation (see Fig. 6). The next two columns similarly show the
recovery of the yellow-matte-plastic and green-acrylic materials, this time using
the Cafe environment and the St. Peter’s Basilica environment (alternately) for
input and comparison to ground truth.
3
Light probe image gallery: http://www.debevec.org/Probes/
4
PBRT: http://www.pbrt.org/
868 F. Romeiro, Y. Vasilyev, and T. Zickler
Fig. 5. Visual evaluation with MERL BRDF data. A bivariate BRDF is estimated
from a single input image (top), and this estimate is used to render a new image under
novel lighting (second row ). Ground truth images for the novel environments are shown
for comparison, along with difference images scaled by 100. Few noticeable differences
exist. Far right: Environment maps used in the paper, top to bottom: St. Peter’s
Basilica, Grace Cathedral, Uffizi Gallery, Cafe and Corner Office.
Fig. 6. Quantitative evaluation with MERL BRDF data. Top: Incident plane scatter-
plots for the four materials in Fig. 5, each showing: original 3D BRDF (blue); ‘ground
truth’ bivariate BRDF (green); and BRDF recovered from one input image (red ). Bot-
tom: Relative RMS BRDF errors for all materials in the MERL database when each is
recovered using a single image under the Grace Cathedral or St. Peter’s environments.
Vertical red lines match the scatterplots above.
purposes). The discrepancy between the results for the two different environ-
ments is expected in light of the discussion from Sect. 3.1. To further emphasize
this environment-dependence, Fig. 7 compares estimates of yellow-matte-plastic
using two different input images. The Uffizi Gallery environment (top left) does
not provide strong observations of grazing angle effects, so this portion of the
BRDF is not accurately estimated. This leads to noticeable artifacts near grazing
angles when the recovered BRDF is used for rendering, and it is clearly visible
in a scatter plot. When the Cafe environment is used as input, however, more
accurate behavior near grazing angles is obtained.
Fig. 7. Dependence on environment used for capture. An input image under the Uffizi
Gallery environment (top left) does not contain strong observations of grazing angle
effects, and as a result, the recovered BRDF is inaccurate. This is visible in a scatter
plot (bottom right, black curves) and causes noticeable artifacts when used to render in
a novel setting. If a different environment is used as input (bottom left) these artifacts
are largely avoided.
Fig. 8. Results using captured data. A BRDF is estimated from a single input image
(top) under a known environment. This recovered BRDF is used to render a synthetic
image for novel view within the same environment (middle). An actual image for the
same novel position is shown for comparison (bottom). Despite the existence of non-
idealities such as surface mesostructure and spatial inhomogeneity, plausible BRDFs
are recovered.
2) surface mesostructure (e.g., the green sphere); and 3) spatial reflectance vari-
ations (e.g., the grey sphere). Presently, surface shape is computed by assuming
the camera to be orthographic and estimating the center and radius of the sphere
in the camera’s coordinate system. Errors in this process, coupled with errors in
Passive Reflectometry 871
the alignment with the illumination probe, lead to structured measurement noise.
Despite this, our results suggest that plausible BRDFs can be recovered for a di-
versity of materials.
5 Discussion
Acknowledgements
References
1. Fleming, R., Dror, R.O., Adelson, E.H.: Real-world illumination and the perception
of surface reflectance properties. Journal of Vision 3 (2003)
2. Stark, M., Arvo, J., Smits, B.: Barycentric parameterizations for isotropic BRDFs.
IEEE Transactions on Visualization and Computer Graphics 11, 126–138 (2005)
3. Ngan, A., Durand, F., Matusik, W.: Experimental analysis of brdf models. In:
Eurographics Symposium on Rendering, pp. 117–126 (2005)
4. Ward, G.: Measuring and modeling anisotropic reflection. Computer Graphics
(Proc. ACM SIGGRAPH) (1992)
5. Marschner, S., Westin, S., Lafortune, E., Torrance, K., Greenberg, D.: Image-based
BRDF measurement including human skin. In: Proc. Eurographics Symposium on
Rendering, pp. 139–152 (1999)
6. Ghosh, A., Achutha, S., Heidrich, W., O’Toole, M.: BRDF acquisition with basis
illumination. In: Proc. IEEE Int. Conf. Computer Vision (2007)
7. Ramamoorthi, R., Hanrahan, P.: A signal-processing framework for inverse render-
ing. In: Proceedings of ACM SIGGRAPH, pp. 117–128 (2001)
8. Boivin, S., Gagalowicz, A.: Image-based rendering of diffuse, specular and glossy
surfaces from a single image. In: Proceedings of ACM SIGGRAPH (2001)
9. Yu, Y., Debevec, P., Malik, J., Hawkins, T.: Inverse global illumination: recover-
ing reflectance models of real scenes from photographs. In: Proceedings of ACM
SIGGRAPH (1999)
10. Georghiades, A.: Incorporating the Torrance and Sparrow model of reflectance in
uncalibrated photometric stereo. In: Proc. IEEE Int. Conf. Computer Vision, pp.
816–823 (2003)
11. Hara, K., Nishino, K., Ikeuchi, K.: Mixture of spherical distributions for single-
view relighting. IEEE Trans. Pattern Analysis and Machine Intelligence 30, 25–35
(2008)
12. Patow, G., Pueyo, X.: A Survey of Inverse Rendering Problems. Computer Graph-
ics Forum 22, 663–687 (2003)
13. Rusinkiewicz, S.: A new change of variables for efficient BRDF representation. In:
Eurographics Rendering Workshop, vol. 98, pp. 11–22 (1998)
14. Matusik, W., Pfister, H., Brand, M., McMillan, L.: A data-driven reflectance model.
ACM Transactions on Graphics (Proc. ACM SIGGRAPH) (2003)
15. Westin, S.: Measurement data, Cornell University Program of Computer Graphics
(2003), http://www.graphics.cornell.edu/online/measurements/
16. Matusik, W., Pfister, H., Brand, M., McMillan, L.: Efficient isotropic BRDF mea-
surement. In: Proc. Eurographics Workshop on Rendering, pp. 241–247 (2003)
17. Alldrin, N., Zickler, T., Kriegman, D.: Photometric stereo with non-parametric and
spatially-varying reflectance. In: Proc. CVPR (2008)
Fusion of Feature- and Area-Based Information for
Urban Buildings Modeling from Aerial Imagery
Abstract. Accurate and realistic building models of urban environments are in-
creasingly important for applications, like virtual tourism or city planning. Initia-
tives like Virtual Earth or Google Earth are aiming at offering virtual models of
all major cities world wide. The prohibitively high costs of manual generation of
such models explain the need for an automatic workflow.
This paper proposes an algorithm for fully automatic building reconstruction
from aerial images. Sparse line features delineating height discontinuities and
dense depth data providing the roof surface are combined in an innovative man-
ner with a global optimization algorithm based on Graph Cuts. The fusion pro-
cess exploits the advantages of both information sources and thus yields superior
reconstruction results compared to the indiviual sources. The nature of the al-
gorithm also allows to elegantly generate image driven levels of detail of the
geometry.
The algorithm is applied to a number of real world data sets encompassing
thousands of buildings. The results are analyzed in detail and extensively evalu-
ated using ground truth data.
1 Introduction
Algorithms for the semi- or fully automatic generation of realistic 3D models of urban
environments from aerial images are subject of research for many years. Such models
were needed for urban planning purposes or for virtual tourist guides. Since the advent
of web-based interactive applications like Virtual Earth and Google Earth and with the
adoption of 3D content for mashups the demand for realistic models has significantly
increased. The goal is to obtain realistic and detailed 3D models for entire cities.
This poses several requirements for the algorithm: First, it should not require any
manual interaction because this would induce high costs. This restriction also dissuades
the use of cadastral maps as they vary in accuracy, are not readily available everywhere
and require careful registration towards the aerial data. Additionally such a dependency
increases the cost at large scale deployment. Second, the algorithm should be flexible
enough to generate accurate models for common urban roof structures without limiting
itself to one specific type, like gabled roofs or rectangular outlines for example. This
This work has been supported by the FFG project APAFA (813397) under the FIT-IT program.
D. Forsyth, P. Torr, and A. Zisserman (Eds.): ECCV 2008, Part IV, LNCS 5305, pp. 873–886, 2008.
c Springer-Verlag Berlin Heidelberg 2008
874 L. Zebedin et al.
also includes the requirement to be able to deal with complex compositions of roof
shapes if those happen to be adjacent. Third, the algorithm should have a certain degree
of efficiency as it is targeted at thousands of cities with millions of buildings in total.
Last, the algorithm should be robust: the visual appearance should degrade gracefully
under the presence of noise or bad input data quality.
In the following a survey and assessment of existing algorithms is given, which fail
to meet one or more of the above mentioned requirements.
Among the early approaches are feature based modelling methods ([1,2,3,4,5])
which show very good results for suburban areas. The drawback of those methods is
their reliance on sparse line features to describe the complete geometry of the build-
ing. The fusion of those sparse features is very fragile as there is no way to obtain the
globally most consistent model.
The possibility of using additional data (cadastral maps and other GIS data in most
cases) to help in the reconstruction task is apparent and already addressed in many
publications ([6,7,8]). Such external data, however, is considered manual intervention
in our work and thus not used.
A different group of algorithms concentrates on the analysis of dense altimetry data
obtained from laser scans or dense stereo matching ([9,10]). Such segmentation ap-
proaches based solely on height information, however, are prone to failure if buildings
are surrounded by trees and require a constrained model to overcome the smoothness
of the data at height discontinuities. Guhno and Downman ([11]) combined the eleva-
tion data from a LIDAR scan with satellite imagery using rectilinear line cues. Their
approach was, however, limited to determining the outline of a building. In our work
we develop this approach further and embed it into a framework which overcomes the
problems described above.
In [12] we have proposed a workflow to automatically derive the input data used
in this paper. The typical aerial images used in the workflow have 80% along-strip
overlap and 60% across-strip overlap. This highly redundant data is utilized in this
paper. Similar approaches have been proposed by others ([13,14]), which demonstrate
that it is possible to automatically derive a digital terrain model, digital elevation model,
land use classification and orthographic image from aerial images. Figure 1 illustrates
the available data which is used for the reconstruction algorithm and also shows the
result of the proposed algorithm.
Our proposed method does not need any manual intervention and uses only data
derived from the original aerial imagery. It combines dense height data together with
feature matching to overcome the problem of precise localization of height discontinu-
ities. The nature of this fusion process separates discovery of geometric primitives from
the generation of the building model in the spirit of the recover-and-select paradigm
([15]), thus lending robustness to the method as the global optimal configuration is cho-
sen. The integration of the theory of instantaneous kinematics ([16]) allows to elegantly
detect and estimate surfaces of revolution which describe a much broader family of roof
shapes. A major feature of the proposed method is the possibility to generate various
levels of geometric detail.
The rest of the paper is structured as follows: Chapter 2 gives a general overview of
the method. In Chapter 3 we will describe the discovery of geometric primitives which
are used to approximate the roof shape, whereas Chapter 4 discusses the building seg-
mentation. Chapter 5 gives details about the fusion process which combines line fea-
tures and dense image data. Results and experiments are outlined in Chapter 6. Finally,
conclusions and further work are described in Chapter 7.
Fig. 2. Illustration of the single steps of the proposed method: height data and building mask
are used to obtain a set of geometric primitives; In parallel the 3D lines are used to generate a
segmentation of the building. Finally, a labeled segmentation is produced.
876 L. Zebedin et al.
The building mask is combined with the dense height data, thus filtering out all 3D
points which do not belong to the building. Afterwards the remaining points are grouped
into geometric primitives. The geometric primitives are the basic building blocks for
assembling the roof shape.
The 3D line segments are projected into the height field and used to obtain a line-
based segmentation of the building. The 2D lines of the segmentation form polygons
which are then assigned to one of the geometric primitives. Therefore, it is important
that the 3D lines capture the location of the height discontinuities as each polygon is
treated as one consistent entity which can be described by one geometric primitive. By
extruding each of the 2D polygons to the assigned geometric primitive a 3D model of
the building is generated.
Note that the algorithm presented in this paper makes no assumptions about the roof
shape. Façades are modeled as vertical planes, because the oblique angle of the aerial
images does not allow a precise reconstruction of any details.
3 Geometric Primitives
Geometric primitives form the basic building blocks which are used to describe the roof
shape of a building. Currently two types of primitives, namely planes and surfaces of
revolution, are used, but the method can be trivially extended to support other primitives.
It is important to note, that the detection of geometric primitives is independent from
the composition of the model. This means that an arbitrary amount of hypotheses can
be collected and fed into later stages of the algorithm. As the order of discovery of the
primitives is not important, weak and improbable hypotheses are also collected as they
will be rejected later in the fusion step. If a primitive is missed, the algorithm selects
another detected primitive instead which minimizes the incurred reconstruction error.
3.1 Planes
Efficiently detecting planes in point clouds for urban reconstruction is well studied and
robust algorithms are readily available ([9]). Thanks to the independence of hypothesis
discovery and model selection, a region growing process is sufficient in our workflow
for the discovery of planes. Depending on the size of the building a number of ran-
dom seed points are selected, for which the normal vector is estimated from the local
neighbourhood. Starting from the seed points, neighbours are added which fit the ini-
tial plane estimate. This plane is regularly refined from the selected neighbours. Small
regions are rejected to improve the efficiency of the optimization phase. Due to their
frequency, close to horizontal planes are modified to make them exactly horizontal, the
other oblique ones are left unchanged.
Planar approximations of certain roof shapes (domes and spires for example) obtained
from plane fitting algorithms, however, are not robust, visually displeasing and do not
take the redundancy provided by the symmetrical shape into account. Therefore it is
Fusion of Feature- and Area-Based Information 877
necessary to be able to deal with other shapes as well and combine them seamlessly to
obtain a realistic model of the building.
Surfaces of revolution are a natural description of domes and spires and can be ro-
bustly detected. Mathematically such surfaces can be described by a 3D curve which
moves in space according to an Euclidean motion. Instantaneus kinematics gives a re-
lationship ([19]) between that Euclidean motion parameters and the corresponding ve-
locity vector field. Using that connection it is possible to estimate the parameters of
the Euclidean motion in a least squares sense given the normal vectors of the resulting
surface.
The equation
v(x) = c̄ + c × x (1)
describes a velocity vector field with a constant rotation and constant translation defined
by the two vectors c, c̄ ∈ R3 . If a curve sweeps along that vector field, the normal
vectors of all points on the resulting surface have to be perpendicular to the velocity
vector at the associated point. Thus
n(x)v(x) = 0 (2)
n(x) (c̄ + c × x) = 0
holds, where n(x) gives the normal vector at point x. With equation (2) it is possi-
ble to estimate the motion parameters given at least six point and normal vector pairs
(x, n(x)) lying on the same surface generated by such a sweeping curve. In the case
of point clouds describing an urban scene the parameter can be constrained by requir-
ing the rotation axis to be vertical. This already reduces the degrees of freedom to two
(assuming that z is vertical) and makes the problem easily solvable:
c̄ = (0, x, y)T c = (0, 0, 1)T
Fig. 3. Illustrations how starting with the dense height data the 3D curve is derived which gen-
erates the dome if it rotates around a vertical axis. (a) Raw height field with the detected axis,
(b) all inliers are projected into the halfplane formed by axis and a radial vector, (c) the moving
average algorithm produces a smooth curve.
878 L. Zebedin et al.
where c̄ gives the position of the axis and c denotes the vertical rotation axis. The re-
maining two unknown parameters are estimated by transforming each 3D point with
the estimated normal vector (x, n(x)) into a Hough space ([20]). Local maxima in the
accumulation space indicate axes for surfaces of revolution. For each axis all inliers
are computed and projected into the halfplane spanned by the rotation axis and an ar-
bitrary additional radial vector. The redundancy of the symmetrical configuration can
be exploited by a moving average algorithm in order to estimate a smooth curve which
generates the surface containing the inliers. Figure 3 illustrates those steps with a point
cloud describing the shape of a spire.
4 Segmentation
The goal of the segmentation is to represent the general building structure - not only a
rectangular shape - as a set of 2D polygons.
The approach of Schmid and Zisserman ([21]) is used for the generation of the 3D
line set that is then used for the segmentation of the building into 2D polygons. A 3D
line segment must have observations in at least four images in order to be a valid hypoth-
esis. This strategy ensures that the reliability and geometric accuracy of the reported 3D
line segments is sufficiently high. The presence of outliers is tolerable since the purpose
of the 3D lines is to provide a possible segmentation of the building. Any 3D line that
does not describe a depth discontinuity can be considered as an unwanted outlier which
will contribute to the segmentation, but will be eliminated in the fusion stage.
The matched 3D line segments are used to obtain a 2D segmentation of the building
into polygons by appying an orthographic projection. The 2D lines cannot be used di-
rectly to segment the building, however, as the matching algorithm often yields many
short line segments describing the same height discontinuity. A grouping mechanism
merges those lines to obtain longer and more robust lines. A weighted orientation
Fig. 4. Segmentation into polygons: (a) The matched 3D lines are projected into the 2 12 D height
field, (b) outliers are eliminated by a weighted orientation histogram which helps to detect princi-
pal directions of the building. (c) Along those directions lines are grouped, merged and extended
to span the whole building.
Fusion of Feature- and Area-Based Information 879
histogram - the weights correspond to the length of each line - is created. The prin-
cipal orientations are detected by finding local maxima in the histogram. Along those
directions quasi parallel lines are grouped and merged thus refining their position.
Each grouped line is extended to span the whole building in order to simplify the
segmentation process. The lines are splitting the area into a number of polygons. Each
polygon is considered to be one consistent entity where the 3D points can be approxi-
mated by one geometric primitive.
Figure 4 illustrates this concept. The advantage of this approach is that no assumption
or constraint of the shape, angles and connectivity of the building is necessary.
5 Information Fusion
Each polygon resulting from the segmentation is assigned to one geometric primitive
(plane or surface of revolution, see Chapter 3). This labeling allows to create a piece-
wise planar reconstruction of the building - surfaces of rotation are approximated by
a rotating polyline and therefore also yield piecewise planar surfaces in the polyhedral
model.
The goal of the fusion step is to approximate the roof shape by the geometric prim-
itives in order to fullfill an optimization criterion. In this paper we use the Graph Cuts
algorithm with alpha-expansion moves ([22,23]), but other techniques like belief propa-
gation are suited as well. The goal of this optimization is to select a geometric primitive
for each polygon of the segmentation and to find an optimal trade-off between data
fidelity and smoothness.
where Vp,q (fp , fq ) is called the smoothness term for the connected nodes p and q which
are labeled fp and fq and Dp (fp ) is called the data term which measures a data fidelity
obtained by assigning the label fp to node p.
In our approach the segmentation induces a set P of polygons, where each polygon
represent a node of the graph. The neighboorhood relationship is reflected by the set
N , which contains pairs of adjacent polygons, ie. polygons sharing an edge. The set of
labels used in the optimization process represent the geometric primitives (planes and
surfaces of revolution):
Thus fp ∈ L reflects the label (current geometric primitve) assigned to node (polygon)
p ∈ P.
880 L. Zebedin et al.
The optimization using polygons is much faster than optimizing for each individual
pixel because there are much fewer polygons than pixels. On the other hand it also
exploits the redundancy of the height data because it is assumed that all pixels in one
polygon belong to the same geometric primitive.
In our context the smoothness term measures the length of the border between two
polygons and the data term measures the deviation between the observed surface (ob-
tained from the dense image matching algorithm) and the fitted primitive. The following
formulae are used to calculate those two terms:
Dp (fp ) = heightobs (x) − heightfp (x) (5)
x∈p
5
length(border(p, q)) iffp =
fq
Vp,q (fp , fq ) = (6)
0 iffp = fq
where p and q denote two polygons and fp is the current label of polygon p. The preset
constant λ can be used to weight the two terms in the energy functional. The data term
Dp calculates an approximation of the volume between the point cloud (heightobs(x))
and primitive fp (heightfp (x)) by sampling points x which lie within the polygon p.
This sampling strategy allows to treat all geometric primitives similarly. because they
are reduced to the incurred difference in volume and induced border to other polygons
assigned to another geometric primitive. The smoothness term Vp,q penalizes neigh-
bouring polygons with different labels depending on their common border, thus favour-
ing homogeneous regions.
The alpha-expansion move is used in order to efficiently optimize the labeling of
all polygons with respect to all discovered primitives. The initial labeling can either be
random or a labeling which minimizes only the data term for each individual polygon.
After a few iterations (usually less than 5), the optimization converges and all 2D poly-
gons can be extruded to the respective height of the assigned primitive to generate a
polyhedral model of the building.
Table 1. The impact of the smoothness parameter λ on the reconstructed model. The number of
unique labels used after the Graph Cuts optimization iterations decreases as well as the number of
triangles in the polygonal model. Δ Volume denotes the estimated difference in volume between
the surface obtained by dense image matching and the reconstruced model (data term). The last
column refers to the accumulated length of all borders in the final labeling (smoothness term).
between observed height values and reconstructed models. This feature can be used to
generate different models with varying smoothness, trading data fidelity for geometric
simplificationa as smaller details of the building are omitted. An example of such a
simplification is shown in Figure 6. The relevant numbers for that building are given in
Table 1.
6 Experiments
The first illustrative experiment was conducted on a test data set of a Graz. The ground
sampling distance of the aerial imagery is 8cm. The examined building features four
small cupolas at the corners. Additionally one façade is partially occluded by trees.
Figure 5 shows the results of the reconstruction process. The texture of the façades is
well aligned, implying that their orientation was accurately estimated by the 3D line
matching. The domes are smoothly integrated into the otherwise planar reconstruction.
Even the portion occluded by the tree has been straightened by the extension of the
matched 3D lines.
The next example is taken from a data set of Manhattan, New York. This building
shows that the reconstruction algorithm is not limited to façades perpendicular or par-
allel to each other. Figure 6 illustrates the effect of the smoothness term in the global
optimization energy function. Various runs with different values for λ yield a reduced
triangle count as the geometry is progressively simplified. Table 1 gives details about
the solution for different values of λ. The Graph Cuts algorithm allows to find a glob-
ally optimal tradeoff between data fidelity and generalization. Those properties are ex-
pressed by the decreased length of borders and number of labels (which translate in
general to fewer triangles) at the cost of an increase of the average difference between
reconstructed and observed surface.
Fig. 5. The stages of the reconstruction are illustrated by means of the building of the Graz Uni-
versity of Technology: (a) Segmented height field, (b) labeled polygons after the Graph Cuts
optimization, (c) screenshot of the reconstructed model (λ = 5)
882 L. Zebedin et al.
Fig. 6. Levels of Detail: The same building was reconstructed with different values for λ. The
number of geometric primitives used to approximate the shape of the roof is decreasing with
higher values for λ. In the upper row a screenshot of the reconstruction is depicted, below are
illustrations of the matching labeling obtained by the Graph Cuts optimization.
Apart from judging the visual appearance of the resulting models, we assess the
quality of the reconstructed models by comparing them to a ground truth which was
obtained manually from the same imagery. For this purpose we use a stereoscopic de-
vice to trace the roof lines in 3D. Those roof lines are connected to form polygons and
then extruded to the ground level. Those manually reconstructed models are considered
ground truth data in this paper. Using this procedure the whole data set from Manhattan
(consisting of 1419 aerial images at 15cm ground sampling distance) was processed
yielding 1973 buildings.
A comparison of manual and automatic reconstruction for one building is illustrated
in Figure 7. Both building models are converted into a height field with a ground sam-
pling distance of 15cm. This makes it easy to determine and illustrate their differences.
Figure 8 gives a break down of the height differences as a cummulative probabilty
distribution. Those graphs give the percentage of pixels where the height difference be-
tween manual and automatic reconstruction is lower than a certain threshold. Analysis
of this chart shows that for the whole data set of Manhattan (1973 buildings) 67.51%
of the pixels have a height difference smaller than 0.5m, 72.85% differ by less than
1m and 86.91% are within 2m. There are two main reasons for discrepancies of height
values: On the one hand there are displacement errors of roof edges which lead to large
height differences, depending on the height of the adjacent roof. On the other hand the
human operator is able to recognize small superstructurial details on the roofs like el-
evator shafts and air conditioning units which cause height differences usually below
2m. Those small features are sometimes missed by the automatic reconstruction.
Fusion of Feature- and Area-Based Information 883
Fig. 7. Quality assessment with a manually generated ground truth: In (a) and (b) the height fields
for the manually and automatically reconstructed building are shown, in (c) the height differences
are shown. The largest difference in the placement of edges is about two pixels, which is about
30cm.
Fig. 8. The cummulative probabilty distribution of the height difference for manual and automatic
reconstruction. The graph shows the error distribution for 1973 buildings from a data set of Man-
hattan, New York. The left image shows the graphs for height differences up to 100 meters; the
right graph zooms on differences up to five meters.
Detailed views of typical results from the Manhattan data set are shown in Figure 9.
The reconstruction of rectangular buildings is very successful, even though huge por-
tions of their façades are occluded by trees. The integration of surfaces of revolution
realistically models domes and spires (see 9b and 9d). It is important to note that for the
purpose of visualization the surfaces of revolution are converted to triangle meshes by
sampling them regularly (2m radially with 45 degrees of angular separation).
(a) (b)
(c) (d)
Fig. 9. Four detailed views of typical results for different types of buildings from the Manhat-
tan data set: (a) rectangular buildings, (b) rectangular building with nicely integrated dome, (c)
skyscrapers in downtown and (d) skyscraper with a spire
Fusion of Feature- and Area-Based Information 885
algorithms using a global optimization technique is very promising and is not restricted
to the reconstruction of urban scenes from aerial imagery. Additionally it allows for the
generation of different globally optimal levels of detail.
Future work will involve the investigation of other geometric primitives and methods
to exploit symmetries encountered in common roof shapes like gabled roofs. Further re-
search will be needed to evaluate the possibilities of this approach in other applications
like streetside imagery.
References
1. Baillard, C., Zisserman, A.: Automatic Line Matching And 3D Reconstruction Of Buildings
From Multiple Views. In: ISPRS Conference on Automatic Extraction of GIS Objects from
Digital Imagery, vol. 32, pp. 69–80 (1999)
2. Bignone, F., Henricsson, O., Fua, P., Stricker, M.A.: Automatic Extraction of Generic House
Roofs from High Resolution Aerial Imagery. In: European Conference on Computer Vision,
Berlin, Germany, pp. 85–96 (1996)
3. Fischer, A., Kolbe, T., Lang, F.: Integration of 2D and 3D Reasoning for Building Recon-
struction using a Generic Hierarchical Model. In: Workshop on Semantic Modeling for the
Acquisition of Topographic Information, Munich, Germany, pp. 101–119 (1999)
4. Taillandier, F., Deriche, R.: Automatic Buildings Reconstruction from Aerial Images: a
Generic Bayesian Framework. In: Proceedings of the XXth ISPRS Congress, Istanbul,
Turkey (2004)
5. Vosselman, G.: Building Reconstruction Using Planar Faces in Very High Density Height
Data. In: ISPRS Conference on Automatic Extraction of GIS Objects from Digital Imagery,
Munich, vol. 32, pp. 87–92 (1999)
6. Baillard, C.: Production of DSM/DTM in Urban Areas: Role and Influence of 3D Vectors.
In: ISPRS Congress, Instanbul, Turkey, vol. 35, p. 112 (2004)
7. Haala, N., Anders, K.H.: Fusion of 2D-GIS and Image Data for 3D Building Reconstruction.
In: International Archives of Photogrammetry and Remote Sensing, vol. 31, pp. 289–290
(1996)
8. Suveg, I., Vosselman, G.: Reconstruction of 3D Building Models from Aerial Images and
Maps. ISPRS Journal of Photogrammetry and Remote Sensing 58(3-4), 202–224 (2004)
9. Haala, N., Brenner, C.: Generation of 3D City Models from Airborne Laser Scanning Data.
In: 3rd EARSEL Workshop on Lidar Remote Sensing on Land and Sea, Tallinn, Estonia, pp.
105–112 (1997)
10. Maas, H.G., Vosselman, G.: Two Algorithms for Extracting Building Models from Raw
Laser Altimetry Data. In: ISPRS Journal of Photogrammetry and Remote Sensing, vol. 54,
pp. 153–163 (1999)
11. Sohn, G., Dowman, I.: Data Fusion of High-Resolution Satellite Imagery and LIDAR Data
for Automatic Building Extraction. ISPRS Journal of Photogrammetry and Remote Sens-
ing 62(1), 43–63 (2007)
12. Zebedin, L., Klaus, A., Gruber-Geymayer, B., Karner, K.: Towards 3D Map Generation from
Digital Aerial Images. ISPRS Journal of Photogrammetry and Remote Sensing 60(6), 413–
427 (2006)
13. Chen, L.C., Teo, T.A., Shaoa, Y.C., Lai, Y.C., Rau, J.Y.: Fusion of Lidar Data and Optical
Imagery for Building Modeling. In: International Archives of Photogrammetry and Remote
Sensing, vol. 35(B4), pp. 732–737 (2004)
14. Hui, L.Y., Trinder, J., Kubik, K.: Automatic Building Extraction for 3D Terrain Reconstruc-
tion using Interpretation Techniques. In: ISPRS Workshop on High Resolution Mapping from
Space, Hannover, Germany, p. 9 (2003)
886 L. Zebedin et al.
15. Leonardis, A., Gupta, A., Bajcsy, R.: Segmentation of Range Images as the Search for Geo-
metric Parametric Models. International Journal of Computer Vision 14(3), 253–277 (1995)
16. Pottmann, H., Leopoldseder, S., Hofer, M.: Registration without ICP, vol. 95, pp. 54–71
(2004)
17. Klaus, A., Sormann, M., Karner, K.: Segment-Based Stereo Matching Using Belief Propa-
gation and a Self-Adapting Dissimilarity Measure. In: Proceedings of the 18th International
Conference on Pattern Recognition, vol. 3, pp. 15–18. IEEE Computer Society Press, Wash-
ington (2006)
18. Scharstein, D., Szeliski, R.: A Taxonomy and Evaluation of Dense Two-Frame Stereo Cor-
respondence Algorithms. International Journal of Computer Vision 47, 7–42 (2002)
19. Pottmann, H., Leopoldseder, S., Hofer, M.: Simultaneous Registration of Multiple Views of
a 3D Object. In: Archives of the Photogrammetry, Remote Sensing and Spatial Information
Sciences, vol. 34, Part 3A (2002)
20. Illingworth, J., Kittler, J.: A Survey of the Hough Transform. Computer Vision, Graphics and
Image Processing 44(1) (1988)
21. Schmid, C., Zisserman, A.: Automatic Line Matching Across Views. In: IEEE Conference
on Computer Vision and Pattern Recognition, pp. 666–671 (1997)
22. Boykov, Y., Veksler, O., Zabih, R.: Fast Approximate Energy Minimization Via Graph Cuts.
In: International Conference on Computer Vision, Kerkyra, Corfu, vol. 1, pp. 377–384 (1999)
23. Kolmogorov, V., Zabih, R.: What Energy Functions Can Be Minimized Via Graph Cuts? In:
European Conference on Computer Vision, Copenhagen, Denmark, vol. 3, pp. 65–81 (2002)
Author Index