Object Detection
Object Detection
AbstractÐIn this paper, we present a general example-based framework for detecting objects in static images by components. The
technique is demonstrated by developing a system that locates people in cluttered scenes. The system is structured with four distinct
example-based detectors that are trained to separately find the four components of the human body: the head, legs, left arm, and right
arm. After ensuring that these components are present in the proper geometric configuration, a second example-based classifier
combines the results of the component detectors to classify a pattern as either a ªpersonº or a ªnonperson.º We call this type of
hierarchical architecture, in which learning occurs at multiple stages, an Adaptive Combination of Classifiers (ACC). We present results
that show that this system performs significantly better than a similar full-body person detector. This suggests that the improvement in
performance is due to the component-based approach and the ACC data classification architecture. The algorithm is also more robust
than the full-body person detection method in that it is capable of locating partially occluded views of people and people whose body
parts have little contrast with the background.
Index TermsÐObject detection, people detection, pattern recognition, machine learning, components.
1 INTRODUCTION
Fig. 1. These images demonstrate some of the challenges involved with detecting people in still images with cluttered backgrounds. People are
nonrigid objects and dress in a wide variety of colors and garment types. Additionally, people may be rotated in depth, partially occluded, or in motion
(i.e., running or walking).
features of a class from sets of labeled positive and negative [26]. Systems in [11] and [26] have the ability to explicitly
examples. Example-based techniques have also been suc- deal with partial occlusions. These systems have two
cessfully used in other areas of computer vision, including common features: They all have component detectors that
object recognition [13]. identify candidate components in an image and they all
People Detection in Images. Most people detection have a means to integrate these components and determine
systems reported on in the literature either use motion if together they define a face In [4] and [5], the authors
information, explicit models, a static camera, assume a describe a system that uses color, texture, and geometry to
single person in the image, or implement tracking rather localize horses and naked people in images. The system can
than pure detection; relevant work includes [8], [10], [7]. be used to retrieve images satisfying certain criteria from
Papageorgiou et al. have successfully employed exam- image databases but is mainly targeted towards images
ple-based learning techniques to detect people in complex containing one object. Methods of learning these ªbody
static scenes without assuming any a priori scene structure plansº from examples are described in [4].
or using any motion information. Their system detects the It is worth mentioning that a component-based object
full body of a person. Haar wavelets [12] are used to detection system for people is harder to realize than one for
represent the images and Support Vector Machine (SVM) faces because the geometry of the human body is less
classifiers [25] are used to classify the patterns. Details are constrained than that of the human face. This means that
presented in [16], [15], and [14]. not only is there greater intraclass variation concerning the
configuration of body parts, but also that it is more difficult
Papageorgiou's system has reported successful results
to detect body parts in the first place since their appearance
detecting frontal, rear, and side views of people, indicating
can change significantly when a person moves.
that the wavelet-based image representation scheme and
the SVM classifier are well-suited to this particular 1.1.2 Classifier Combination Algorithms
application. However, the system's ability to detect partially Recently, a great deal of interest has been shown in
occluded people or people whose body parts have little hierarchical classification structures, i.e., data classification
contrast with the background is limited. devices that are a combination of several other classifiers. In
Component-Based Object Detection Systems. Previous particular, two methods have received considerable atten-
research suggest that some of these problems associated tionÐbagging and boosting. Both of these algorithms have
with Papageorgiou's full-body detection system may be been shown to increase the performance of certain
addressed by taking a component-based approach to classifiers for a variety of data sets [2], [6], [17], [1]. Despite
detecting objects. A component-based object detection the well-documented practical success of these algorithms,
system is one that searches for an object by looking for its the reasons why they work so well is still open to debate.
identifying components rather than the whole object. An
example of such a system is a face detection system that 1.2 Component-Based People DetectionÐOur
finds a face when it locates a pair of eyes, a nose, and a Approach
mouth in the proper configuration. The approach we take to detecting people in static images
Component-based approaches to object detection have borrows ideas from the fields of object detection in images
been described in the past but their application to the and data classification. In particular, the system detects the
problem of locating people in images is fairly limited. For components of a person's body in an image, i.e., the head, the
component-based face detection systems see [20], [11], and left and right arms, and the legs, instead of the full body. The
MOHAN ET AL.: EXAMPLE-BASED OBJECT DETECTION IN IMAGES BY COMPONENTS 351
system then checks to ensure that the detected components 2 SYSTEM DETAILS
are in the proper geometric configuration and then combines
2.1 Overview of System Architecture
them using a classifier. This approach of integrating
components using a classifier promises to increase accuracy The section explains the overall architecture and operation
based on the results of previous work in the field. of the system by tracing the detection process when the
We introduce a new hierarchical classification architecture system is applied to an image; Fig. 2 is a graphical
where example-based learning is conducted at multiple representation of this procedure.
The system starts detecting people in images by selecting
levels, called an Adaptive Combination of Classifiers (ACC).
a 128 64 pixel window from the top left corner of the
Specifically, it is composed of distinct example-based compo-
image as an input. This input is then classified as either a
nent classifiers trained to detect different object parts, i.e.,
ªpersonº or a ªnonperson,º a process which begins by
heads, legs, and left and right arms, at one level and a similar
determining where and at which scales the components of a
example-based combination classifier at the next. The combina-
person, i.e., the head, legs, left arm, and right arm, may be
tion classifier takes the output of the component classifiers as
found within the window. All of these candidate regions
its input and classifies the entire pattern under examination
are processed by the respective component detectors to find
as either a ªpersonº or a ªnonperson.º It bears repeating that
the strongest candidate components.
since the classifiers are example-based, this system can easily
The component detectors process the candidate regions
be modified to detect objects other than people.
by applying the Haar wavelet transform to them and then
A component-based approach to detecting people is
classifying the resultant data vector. The component
appealing and has the following advantages over existing
classifiers are quadratic Support Vector Machines (SVM)
techniques:
which are trained prior to use in the detection process (see
. It allows for the use of the geometric information Section 2.2). The strongest candidate component is the one
concerning the human body to supplement the that produces the highest positive raw output, referred to in
visual information present in an image and thereby this paper as the component score, when classified by the
improve the overall performance of the system. component classifiers. If the highest component score for a
More specifically, the visual data in an image is used particular component is negative, i.e., the component
to detect body components and knowledge of the detector in question did not find a component in the
structure of the human body allows us to determine geometrically permissible area, then a component score of
if the detected components are proportioned cor- zero is used instead. The raw output of an SVM is a rough
rectly and arranged in a permissible configuration. measure of how well a classified data point fits in with its
In contrast, a full-body person detector relies solely designated class and is defined in Section 2.2.1. The highest
on visual information and does not take full component score for each component is fed into the
advantage of the known geometric properties of combination classifier which is a linear SVM. The combina-
the human body. In particular, it employs an implicit tion classifier processes the scores to determine if the
and fixed representation of the human form and pattern is a person.
does not explicitly allow for variations in limb This process of classifying patterns is repeated at all
positions [16], [15], [14]. locations in an image by shifting the 128 64 pixel window
. Sometimes it is difficult to detect the human body across and down the image. The image itself is processed at
pattern as a whole due to variations in lighting and several sizes, ranging from 0.2 to 1.5 times its original size.
orientation. The effect of uneven illumination and This allows the system to detect various sizes of people at
varying viewpoint on body components (like the any location in an image.
head, arms, and legs) is less pronounced and, hence,
2.2 Details of System Architecture
they are comparatively easier to identify.
. The component-based framework directly addresses 2.2.1 First StageÐIdentifying Components of People
the issue of detecting people that are partially in an Image
occluded or whose body parts have little contrast When a 128 64 pixel window is evaluated by the system,
with the background. This is accomplished by the individual component detectors are applied only to
designing the system, using an appropriate classifier specific areas of the window and only at particular scales,
combination algorithm, so that it detects people even since the relative proportions must match and the approx-
if all of their components are not detected. imate configuration of body parts is known a priori. This is
. The structure of the component-based solution necessary because even though a component detection is
allows for the convenient use of hierarchical classi- the strongest in a particular window under examination (it
fication machines to classify patterns which have has the highest component score), it does not imply that it is
been shown to perform better than similar single in the correct position, as illustrated in Fig. 3. The centroid
layer devices for certain data classification tasks [2], and boundary of the allowable rectangular area for a
[6], [17], [1]. component detection (relative to the upper left-hand corner
The rest of the paper is organized as follows: Section 2 of the 128 64 pattern) determine the location of the
describes the system in detail. Section 3 reports on the component and the width of the rectangle is a measure of a
performance of our system. In Section 4, we present component's scale.
conclusions along with suggestions for future research in We calculated the geometric constraints for each compo-
this area. nent from a sample of the training images, tabulated in
352 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 4, APRIL 2001
Table 1 and shown in Fig. 4, by taking the means of the Wavelet functions are used to represent the components
centroid and top and bottom boundary edges of each in the images. Wavelets are a type of multiresolution
component over positive detections in the training set. The function approximation that allow for the hierarchical
tolerances were set to include all positive detections in the decomposition of a signal [12]. When applied at different
training set. Permissible scales were also estimated from the scales, wavelets encode information about an image from
training images. There are two sets of constraints for the the coarse approximation all the way down to the fine
arms, one intended for extended arms and the other for details. The Haar basis is the simplest wavelet basis and
bent arms. provides a mathematically sound extension to an image
MOHAN ET AL.: EXAMPLE-BASED OBJECT DETECTION IN IMAGES BY COMPONENTS 353
invariance scheme [21]. Haar wavelets of two different In (2), K is one of many possible kernel functions, yi 2
scales (16 16 pixels and 8 8 pixels) are used to generate
fÿ1; 1g is the class label of the data point xi , and fxi gli1 is a
a multiscale representation of the images. The wavelets are subset of the training data set. The xi are called support
applied to the image such that they overlap 75 percent with vectors and are the points from the data set that define the
the neighboring wavelets in the vertical and horizontal separating hyperplane. Finally, the coefficients i and b are
directions; this is done to increase the spatial resolution of determined by solving a large-scale quadratic programming
our system and to yield richer representation. At each scale, problem. One of the appealing characteristics of SVMs is
three different orientations of Haar wavelets are used, each that there are just two tunable parameters, Cpos and Cneg,
of which responds to differences in intensities across which are penalty terms for positive and negative pattern
different axes. In this manner, information about how misclassifications, respectively. The kernel function K that
intensity varies in each color channel (red, green, and blue) is used in the component classifiers is a quadratic
in the horizontal, vertical, and diagonal directions is polynomial and is K
x; xi
x xi 12 .
obtained. The information streams from the three color In (1), f
x 2 fÿ1; 1g is referred to as the binary class of
channels are combined and collapsed into one by taking the the data point x which is being classified by the SVM. As (1)
wavelet coefficient for the color channel that exhibits the shows, the binary class of a data point is the sign of the raw
greatest variation in intensity at each location and for each output g
x of the SVM classifier. The raw output of an
orientation. At these scales of wavelets there are 582 features SVM classifier is the distance of a data point from the
for the 32 32 pixel window for the head and shoulders decision hyperplane. In general, the greater the magnitude
and 954 features for the 48 32 pixel windows representing of the raw output, the more likely a classified data point
the lower body and the left and right arms. This method belongs to the binary class it is grouped into by the
results in a thorough and compact representation of the SVM classifier.
TABLE 1
Geometric Constraints Placed on Each Component
All coordinates are relative to the upper left-hand corner of a 128 64 rectangle.
354 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 4, APRIL 2001
Fig. 4. Geometric constraints that are placed on the different components. All coordinates are relative to the upper left-hand corner of a
128 64 rectangle. (a) Illustrates the geometric constraints on the head, (b) the lower body, (c) an extended right arm, and (d) a bent right arm.
The component classifiers are trained on positive images output of the component classifier and is the distance of the
and negative images for their respective classes. The test point from the decision hyperplane, a rough measure of
positive examples are of arms, legs, and heads of people how ªwellº a test point fits into its designated class. If the
in various environments, both indoors and outdoors and component detector does not find a component in the
under various lighting conditions. The negative examples designated area of the 128 64 pixel window, then zero is
are taken from scenes that do not contain any people. placed in the data vector.
Examples of positive images used to train the component The combination classifier is a linear SVM classifier. The
classifiers are shown in Fig. 5.
kernel K that is used in the SVM classifier and shown in (2)
2.2.2 Second StageÐCombining the Component has the form K
x; xi
x xi 1. This type of hierarch-
Classifiers ical classification architecture where learning occurs at
Once the component detectors have been applied to all multiple stages is termed an Adaptive Combination of
geometrically permissible areas within the 128 64 pixel Classifiers (ACC). Positive examples were generated by
window, the highest component score for each component processing 128 64 pixel images of people at one scale and
type is entered into a data vector that serves as the input to taking the highest component score (from detections that
the combination classifier. The component score is the raw are geometrically allowed) for each component type.
MOHAN ET AL.: EXAMPLE-BASED OBJECT DETECTION IN IMAGES BY COMPONENTS 355
Fig. 5. The top row shows examples of ªheads and shouldersº and ªlower bodiesº of people that were used to train the respective component
detectors. Similarly, the bottom row shows examples of ªleft armsº and ªright armsº that were used for training purposes.
Fig. 6. ROC curves illustrating the ability of the component detectors to correctly indentify a person in an image. The positive detection rate is plotted
as a percentage against the false alarm rate which is measured on a logarithmic scale. The false alarm rate is the number of false positive detections
per window inspected.
systems. For each pattern that the system classifies, the 3.2 Experimental Results
system must evaluate the logic presented below: We compare the ACC-based system, the VCC-based
system, and the full-body detection system. The
P erson Head & Legs & Left arm & Right arm;
3
ROC curves of the person detection systems are shown in
where a state of true indicates that a pattern belonging to the Fig. 7 and explicitly capture the tradeoff between accuracy
class in question has been detected. and false detections that is inherent to every detector. An
The detection threshold of the VCC-based system is analysis of the ROC curves suggest that a component-based
determined by selecting appropriate thresholds for the person detection system performs very well and signifi-
component detectors. The thresholds for the component cantly better than the baseline system at all thresholds. It
detectors are chosen such that they all correspond to should be emphasized that the baseline system uses the
approximately the same positive detection rate, estimated same image representation scheme (Haar wavelets) and
from the ROC curves of each of the component detectors classifier (SVM) that the component detectors used in the
shown in Fig. 6. These ROC curves were calculated in a component-based systems. Thus, the improvement in
manner similar to the procedure described earlier in performance is due to the component-based approach and
Section 3.1.1. A point of interest is that these ROC curves the algorithm used for combining the component classifiers.
indicate how discriminating the individual components of a For the component-based systems, the ACC approach
person are in detecting the full body. The legs perform the produces better results than VCC. In particular, the
best, followed by the arms and the head. The superior ACC-based system that uses a linear SVM to combine the
performance of the legs may be due to the fact that the component classifier is the most accurate. This is related to
background of the lower body in images is usually either the fact that higher degree polynomial classifiers require
the street, pavement, or grass and, hence, is relatively more training examples in proportion with the higher
clutter free compared to the background of the head and dimensionality of the feature space to perform at the same
arms. level as the linear SVM. During the course of the
experiment, the linear SVM-based system displayed a
3.1.3 Baseline System superior ability to detect people even when one of the
The system that is used as the ªbaselineº for this components was not detected, in comparison to the higher
comparison is a full-body person detector. Details of this degree polynomial SVM-based systems. A possible expla-
system, which was created by Papageorgiou et al. are nation for this observation may be that the higher degree
presented in [16], [14], and [15]. It has the same architecture polynomial classifiers place a stronger emphasis on the
as the individual component detectors used in our system, presence of combinations of components, due to the
described in Section 2.2.1, but is trained to detect structure of their kernels. The second, third, and fourth
full-body patterns and not separate components. The degree polynomial kernels include terms that are products
quadratic SVM classifier was trained on 869 positive and of up to two, three, and four elements (which are
9,225 negative examples. component scores).
MOHAN ET AL.: EXAMPLE-BASED OBJECT DETECTION IN IMAGES BY COMPONENTS 357
Fig. 7. ROC curves comparing the performance of various component-based people detection systems using different methods of combining the
classifiers that detect the individual components of a person's body. The positive detection rate is plotted as a percentage against the false alarm
rate which is measured on a logarithmic scale. The false alarm rate is the number of false positives detections per window inspected. The curves
indicate that the system in which a linear SVM combines the results of the component classifiers performs best. The baseline system is a full-body
person detector similar to the component detectors used in the component-based system.
It is also worth mentioning that the database of test Fig. 9 shows the results obtained when the system was
images that were used to generate the ROC curves did not applied to images of people who are partially occluded or
just include frontal views of people, but also contained a whose body parts blend in with the background. In these
variety of challenging images. Included are pictures of
examples, the system detects the person while running at a
people walking and running, occluded people, people
where portions of their body has little contrast with the threshold that, according to the ROC curve shown in Fig. 7,
background, and slight rotations in depth. Fig. 8 is a corresponds to a false detection rate of less than one false
selection of these images. alarm for every 796,904 patterns inspected. Fig. 10 shows
Fig. 8. Samples from the test image database. These images demonstrate the capability of the system. It can detect running people, people who are
slightly rotated, people whose body parts blend into the background (bottom row, second from rightÐthe person is detected even though the legs are
not), and people under varying lighting conditions (top row, second from leftÐone side of the face is light and the other dark).
358 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 4, APRIL 2001
Fig. 9. Results of the system's application to images of partially occluded people and people whose body parts have little contrast with the
background. In the first image, the person's legs are not visible, in the second image, her hair blends in with the curtain in the background, and in the
last image, her right arm is hidden behind the column.
the result of applying the system to sample images with examples, from the same databases of images used to train
clutter in the background. the component classifiers.
This new system was tested on the same database as the
3.3 Extension of the System
system presented earlier. Fig. 11 compares the ROC curves
In the component-based object detection system presented for the two systems. Where the performance of the two
in this paper, the constraints that are placed on the size and system is very similar, the system that learns the geometry
relative location of the components of an object are of an object performs better at higher thresholds. An added
determined manually. As explained in Section 2.2.1, the advantage of the system that learns the relative location and
constraints were calculated from the training examples. size of the components of an object is that one can change
While this method produced excellent results, it is possible the size of the geometrically permissible area by varying the
that it may suffer from a bias introduced by the designer. penalty parameters, Cpos and Cneg, for the misclassifica-
Therefore, it is desirable for the system to learn the tion of positive and negative examples during training [25],
geometric constraints to be placed on the components of [3]. This results in different geometric classifiers and, hence,
an object from examples. This would make it easier to apply different geometrically permissible areas. ROC curves
this system to other objects of interest. Also, such an object corresponding to different penalty terms are shown in
detection system would be an initial step toward a more Fig. 11.
sophisticated component-based object detection system in
which the components of an object are not predefined.
We created a component-based object detection system 4 CONCLUSIONS AND FUTURE WORK
that learns the relative location and size of an object's In this paper, we have presented a component-based person
components from examples in order to explore the viability detection system for static images that is able to detect
and performance of such a system. In the new system, frontal, rear, slightly rotated (in depth) and partially
the geometrically permissible areas are learned by occluded people in cluttered scenes without assuming any
SVM classifiers from training examples. Thus, instead of a priori knowledge concerning the image. The framework
checking the candidate coordinates of a window against the described here is applicable to other domains besides
constraints listed in Table 1, the coordinates are fed into an people, including faces and cars.
SVM classifier. The output of the each geometric classifier A component-based approach handles variations in
determines whether the window is permissible for the lighting and noise in an image better than a full-body
particular component. The coordinates that are fed into the person detector and is able to detect partially occluded
geometric classifiers are the location of the top left corner people and people who are rotated in depth, without any
and bottom right corner of the window, relative to the top additional modifications to the system. A component-based
left corner of the 128 64 pixel window, i.e., four dimen- detector looks for the constituent components of a person
sional feature vectors. and if one of these components is not detected, due to an
The kernel function K in (2) that is used in the geometric occlusion or because the person is rotated into the plane of
classifiers is a fourth degree polynomial and has the form the image, the system can still detect the person if the
K
x; xi
x xi 14 . We trained the geometric classifiers component detections are combined using an appropriate
for each component on 855 positive and 9,000 negative hierarchical classifier.
MOHAN ET AL.: EXAMPLE-BASED OBJECT DETECTION IN IMAGES BY COMPONENTS 359
Fig. 10. Results from the component-based person detection system. The solid boxes outline the complete person and the dashed rectangles
identify the individual components. People may be missed by the system because they are either too large or too small to be processed by the
system (top rightÐperson on the right) because several parts of their body may have very little contrast with the background (bottom leftÐperson on
the left) or because several parts of their body may be occluded (bottom rightÐperson second from the left).
The hierarchical classifier that is implemented in this performance is due to the component-based approach and
system uses four distinct component detectors at the first the ACC classification architecture we employed. (Further
level, that are trained to find, independently, components of work in this area to quantitatively determine how much of
the ªpersonº object, i.e., heads, legs, and left and right arms. the improvement can be attributed to the component-based
These detectors use Haar wavelets to represent the images approach and how much is due to the ACC classification
and Support Vector Machines (SVM) to classify the architecture would be useful.) The superior performance of
patterns. The four component detectors are combined at the component-based approach can be attributed to the fact
the next level by another SVM. We call this type of that it operates with more information about the object class
hierarchical classification architecture, in which learning than the full-body person detection method. Specifically,
occurs at more than two levels, an Adaptive Combination of where both systems are trained on positive examples of the
Classifiers (ACC). It is worth mentioning that one may use human body (or human body parts in the case of the
classification devices other than SVM's in this system; a component-based system), the component-based algorithm
comparative study in this area to determine the perfor- incorporates explicit knowledge about the geometric prop-
mance of such implementations would be of interest. erties of the human body and explicitly allows for variations
The system is very accurate and performs significantly in the human form.
better than a full-body person detector designed along This paper presents a valuable first step but there are
similar lines. This suggests that the improvement in several directions in which this work could be extended. It
360 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 23, NO. 4, APRIL 2001
Fig. 11. ROC curves comparing the performance of various methods of defining and placing geometric constraints on components of objects. The
core component-based object detection algorithm is the same for all the systems tested hereÐa linear SVM-based ACC system. The positive
detection rate is plotted as a percentage against the false alarm rate which is measured on a logarithmic scale. The false alarm rate is the number of
false positives detections per window inspected. The curves indicate that the systems that learn the geometric constraints perform slightly better
than the one that uses manually determined values. The graphs also shows that changing the penalty parameters for misclassifications of the
geometry (Cpos and Cneg) alters the overall system's performance.
would be useful to test the system described here in other and the US National Science Foundation under contract
domains, such as cars and faces. Since the component-based No. IIS-9800032 and contract No. DMS-9872936. Additional
systems described in this paper were implemented as support is provided by: AT&T, Central Research Institute of
prototypes, we could not gauge the speeds of the various Electric Power Industry, Eastman Kodak Company, Daim-
algorithms accurately. It would be interesting to learn how ler-Chrysler, Digital Equipment Corporation, Honda
the different algorithms compare with each other in terms R&D Co., Ltd., NEC Fund, Nippon Telegraph & Telephone,
of speed. It would also be interesting to study how the and Siemens Corporate Research, Inc.
performance of the system depends on the choice of the
SVM kernels and the number of training examples. While
this paper establishes that this system can detect people REFERENCES
who are slightly rotated in depth, it does not determine, [1] E. Bauer and R. Kohavi, ªAn Empirical Comparison of Voting
Classification Algorithms: Bagging, Boosting, and Variants,º
quantitatively, the extent of this capability; further work in Machine Learning, 1998.
this direction would be of interest. Along similar lines, it [2] L. Breiman, ªBagging Predictors,º Machine Learning, vol. 24,
would be useful to investigate if the approach described in pp. 123-140, 1996.
this paper could be extended to detect objects from an [3] C. Burges, ªA Tutorial on Support Vector Machines for Pattern
Recognition,º Proc. Data Mining and Knowledge Discovery,
arbitrary viewpoint. In order to accomplish this, the system U. Fayyad, ed., pp. 1-43, 1998
would have to have a richer understanding of the geometric [4] D. Forsyth and M. Fleck, ªBody Plans,º Computer Vision and
properties of an object, that is to say, it would have to be Pattern Recognition, pp. 678-683, 1997.
[5] D. Forsyth and M. Fleck, ªFinding Naked People,º Int'l J. Computer
capable of learning how the various components of an Vision, 1998. (pending publication.)
object change in appearance with a change in viewpoint [6] Y. Freund and R. Schapire, ªExperiments with a New Boosting
and also how the change in viewpoint affects the geometric Algorithm,º Machine Learning: Proc. 13th Nat'l Conf., 1996.
configuration of the components. [7] I. Haritaoglu, D. Harwood, and L. Davis, ªW4: Who? When?
Where? What? A Real Time System for Detecting and Tracking
People,º Face and Gesture Recognition, pp. 222-227, 1998.
[8] D. Hogg, ªModel-Based Vision: A Program to See a Walking
ACKNOWLEDGMENTS Person,º Image and Vision Computing, vol. 1, no. 1, pp. 5-20, 1983.
The research described in this paper was conducted within [9] T. Joachims, ªText Categorization with Support Vector Machines:
Learning with Many Relevant Features,º Proc. 10th European Conf.
the Center for Biological and Computational Learning in the Machine Learning (ECML), 1998.
Department of Brain and Cognitive Sciences and in the [10] M.K. Leung and Y.-H. Yang, ªA Region Based Approach for
Artificial Intelligence Laboratory at the Massachusetts Human Body Analysis,º Pattern Recognition, vol. 20, no. 3, pp. 321-
Institute of Technology. The research is sponsored by 39, 1987.
[11] T. Leung, M. Burl, and P. Perona, ªFinding Faces in Cluttered
grants from the US Office of Naval Research under contract Scenes Using Random Labeled Graph Matching,º Proc. Fifth Int'l
no. N00014-93-1-3085 and contract no. N00014-95-1-0600 Conf. Computer Vision, pp. 637-644, June 1995.
MOHAN ET AL.: EXAMPLE-BASED OBJECT DETECTION IN IMAGES BY COMPONENTS 361
[12] S. Mallat, ªA Theory for Multiresolution Signal Decomposition: Anuj Mohan received the SB degree in elec-
The Wavelet Representation,º IEEE Trans. Pattern Analysis and trical engineering and the MEng degree in
Machine Intelligence, vol. 11, no. 7, pp. 674-693, July 1989. electrical engineering and computer science
[13] H. Murase and S. Nayar, ªVisual Learning and Recognition of 3D from Massachusetts Institute of Technology in
Objects from Appearance,º Int'l J. Computer Vision, vol. 14, no. 1, 1998 and 1999, respectively. His research
pp. 5-24, 1995. interests center around engineering applications
[14] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio, of machine learning and data classification
ªPedestrian Detection Using Wavelet Templates,º Proc. Computer algorithms. He is currently a software engineer
Vision and Pattern Recognition, pp. 193-199, June 1997. at Kana Communications where he is working
[15] C. Papageorgiou, M. Oren, and T. Poggio, ªA General Framework on applications of text classification and natural
for Object Detection,º Proc. Int'l Conf. Computer Vision, Jan. 1998. language processing algorithms.
[16] C. Papageorgiou and T. Poggio, ªA Trainable System for Object
Detection,º Int'l J. Computer Vision, vol. 38, no. 1, pp. 15-33, 2000.
[17] J. Quinlan, ªBagging, Boosting, and C4.5,º Proc. 13th Nat'l Conf.
Constantine Papageorgiou received the BS
Artificial Intelligence, 1996.
degree in mathematics/computer science from
[18] H. Rowley, S. Baluja, and T. Kanade, ªNeural Network-Based Face
Carnegie Mellon University in 1992. After receiv-
Detection,º IEEE Trans. Pattern Analysis and Machine Intelligence,
ing his degree, he worked in the Speech and
vol. 20, no. 1, pp. 23-38, Jan. 1998.
Language Processing Department at BBN until
[19] H. Rowley, S. Baluja, and T. Kanade, ªRotation Invariant Neural
starting graduate school in 1995. He received
Network-Based Face Detection,º Proc. Computer Vision and Pattern
the doctorate in electrical engineering and
Recognition, pp. 38-44, June 1998.
computer science from Massachusetts Institute
[20] L. Shams and J. Spoelstra, ªLearning Gabor-Based Features for
of Technology in December 1999. His research
Face Detection,º Proc. World Congress in Neural Networks, Int'l
focused on developing trainable systems for
Neural Network Soc., pp. 15-20, Sept. 1996.
object detection. He has also done research in image compression,
[21] P. Sinha, ªObject Recognition via Image Invariants: A Case
reconstruction, and superresolution, and financial time series analysis.
Study,º Investigative Ophthalmology and Visual Science, vol. 35,
Currently, he is working as a research scientist at Kana Communications
pp. 1735-1740, May 1994.
where his focus is on natural language understanding and text
[22] K.-K. Sung and T. Poggio, ªExample-Based Learning for View-
classification.
Based Human Face Detection,º Proc. Image Understanding Work-
shop, Nov. 1994.
[23] K.-K. Sung and T. Poggio, ªExample-Based Learning for View
Based Human Face Detection,º IEEE Trans. Pattern Analysis and Tomaso Poggio received the doctorate degree
Machine Intelligence, vol. 20, no. 1, pp. 39-51, Jan. 1998. in theoretical physics from the University of
[24] R. Vaillant, C. Monrocq, and Y. Le Cun, ªOriginal Approach for Genoa in 1970. From 1971 to 1981, he held a
the Localisation of Objects in Images,º IEE Proc. Vision Image tenured research position at the Max Planck
Signal Processing, vol. 141, no. 4, pp. 245-50, Aug. 1994. Institute, after which he became a professor at
[25] V. Vapnik, The Nature of Statistical Learning Theory. Springer Massachusetts Institute of Technology (MIT).
Verlag, 1995. Currently, he is the Uncas and Helen Whitaker
[26] K. Yow and R. Cipolla, ªFeature-Based Human Face Detection,º Professor in the Department of Brain and
Image and Vision Computing, vol. 15, no. 9, pp. 713-35, Sept. 1997. Cognitive Sciences at MIT and a member of
[27] A. Yuille, ªDeformable Templates for Face Recognition,º J. the Artificial Intelligence Laboratory. He is doing
Cognitive Neuroscience, vol. 3, no. 1, pp. 59-70, 1991. research in computational learning and vision at the MIT Center for
Biological and Computational Learning, of which he is co-director. He
has authored more than 200 papers in areas ranging from psychophy-
sics and biophysics to information processing in man and machine,
artificial intelligence, machine vision, and learning. His main research
activity at present is learning from the perspective of statistical learning
theory, engineering applications, and neuroscience. He has received a
number of distinguished international awards in the scientific community,
is on the editorial board of a number of interdisciplinary journals, a fellow
of the American Association for Artificial Intelligence as well as the
American Academy of Arts and Sciences, and an Honorary Associate of
the Neuroscience Research Program at Rockefeller University.
Dr. Poggio is a member of the IEEE.