Semi-Supervised Self-Training of Object Detection Models
Semi-Supervised Self-Training of Object Detection Models
Semi-Supervised Self-Training of Object Detection Models
0.9
0.822 0 0.822
0.8
0.4
smooth saturated 0.745 3 0.922
failure
0.3
cv=025 sts=06 (+/- 95% sig, +/- std dev)
0.759 4 0.931
20 30 40 60 80 100 120 160 200 240 320 400 480
Number of Training Images on a Log Scale
Figure 5. Normalized AUC performance of the detector Figure 6. Comparison of the training images selected
plotted against training set size on a log scale with the at each iteration for the confidence and the MSE selec-
three regimes of operation labeled. The inner error bars tion metrics. The initial training set of 40 images is the
indicate the 95% significance interval, and the outer er- same for both metrics and is 1/12 of the initial training
ror bars indicate the standard deviation of the mean. set size.
help. In order to establish a baseline for the typical number The comparison between the two selection metrics is
of examples needed to train the detector, we ran the detector summarized in Figure 7. In these plots, the horizontal axis
with different training set sizes and recorded the AUC per- indicates the frequency at which the training data is sam-
formance (Figure 5). Our interpretation of this data is that is pled in order to select the initial labeled training set for each
there are three regimes in which the training process oper- run (“8” means that 1/8th of the full training data was used
ates. We call the first the “saturated” regime, which in this as initial labeled training data, while the rest was used as un-
case appears to be from approximately 160 to 480 train- labeled data). The plots show that performance is improved
ing examples. In this regime, 160 examples are sufficient by the addition of weakly labeled data over the range of data
for the detector to learn the requisite parameters; more data set sizes. However, the improvements are not significant at
does not result in better performance. Similarly, variation in the 95% level for the confidence metric. For the MSE met-
performance is relatively constant and small in this range. ric however, the improvement in performance is significant
We call the second regime the “smooth” regime, which ap- for all the data set sizes. This observation is supported by
pears in this case to be between 35 and 160 training exam- other experimental variations in which the MSE metric con-
ples. In this regime, performance decreases and variation in- sistently outperforms the confidence metric. Figure 6 shows
creases relatively smoothly as training set size decreases. In montages of the examples selected from the weakly labeled
the third regime, the “failure” regime, there is both a precip- training images selected at each iteration using the confi-
itous drop in performance and a very large increase in per- dence metric and the MSE metric for a single run. The per-
formance variation. This third regime occurs when the train- formance of the detector trained with the MSE metric im-
ing algorithm does not have sufficient data to estimate some proves with each iteration, whereas the performance of the
set of parameters. An extreme case of this would be when confidence-based one decreases. For the confidence metric,
the parameter estimation problem is ill conditioned. Based there are clearly incorrect detections included in the train-
on this set of experiments, we chose the size of the labeled ing set past the first iteration. In contrast, all of the images
training set to be in the smooth regime for the experiments that the MSE metric selects are valid except for one outlier
with weakly-labeled data. at iteration 4.
0.95
1
when a small fraction of the training data is used in the ini-
Full Data Normalized Area Under the ROC Curve
0.85
taking into account the high degree of variability in perfor-
0.8
0.8 mance across different choices of initial training sets (as il-
0.75
0.7
0.75
0.7
lustrated by the error bars in the graphs presented, and the
0.65
22 20 18
cv025 sts06 - fully labeled data only (+/- 95% sig)
cv025 sts06 - best weakly labeled performance (+/- 95% sig)
16 14 12 10 8 6
0.65
22 20 18 16
cv025 sts06 - fully labeled data only (+/- 95% sig)
cv025 sts06 - best weakly labeled performance (+/- 95% sig)
14 12 10 8 6
fact that we normalize the detector performance with respect
Training Set Sampling Rate Training Set Sampling Rate
to the base detector trained with all the labeled data). Sec-
(a) (b)
ond, as a practical matter, the experiments show that the self-
training approach to semi-supervised training can be applied
Figure 7. Normalized performance of the detector, in-
to an existing detector that was originally designed for su-
corporating weakly labeled data by using the confidence
metric (a) or the MSE metric (b), as the fully labeled train- pervised training. In fact, in our case, we used a detector that
ing set size varies. The bottom plot line is the perfor- was already highly optimized and we were able to integrate
mance with labeled data only and the top plot line is the it in the training framework. This suggests a general proce-
performance with the addition of weakly labeled data. dure for using semi-supervised training with existing detec-
Error bars indicate the 95% significance interval of the tors.
mean value. Finally, a more fundamental observation is that the MSE
selection metric consistently outperforms the confidence
metric. Experiments with simulated data and other, filter-
5
Weakly versus Fully Labeled Training Set Size with Confidence Score over Sampling Rate
based detectors (from [16], not reported here from reasons
Ratio of Weakly to Fully Labeled Training Set Size (+/- 95% sig)
2.5
but also batch EM approaches. These results bring to light an
2 important aspect of the the self-training process which is of-
1.5 ten overlooked. The issue is that during the training process,
1
the distribution of the labeled data at any particular iteration
0.5
22 20 18 16 14 12
Training Set Sampling Rate
10 8 6
may not match the actual underlying distribution of the data.
As a result, confidence metrics may perform poorly because
Figure 8. Ratio of weakly labeled to fully labeled data the labeled data distribution created by this metric is quite
as the fully labeled training set size increases. different from the underlying distribution, even when all of
the weakly labeled data selected by the metric is correctly
against the size of the initial training set (or, more precisely, labeled. To illustrate this observation, Figure 9 shows a sim-
the sampling rate that was used for generating the initial ple simulated example in which the labeled and unlabeled
training set). This data shows that, as expected, the ratio in- examples are drawn from two Gaussian distributions in the
creases as the size of the initial training set decreases since plane. Comparing the labels obtained after five iterations by
more weakly labeled examples are needed to compensate using the confidence metric (Figure9(c)) and the Euclidean
for smaller training sets. More importantly, the total size metric, we see that the labeled points cluster around exist-
of the training set (initial labeled training images + exam- ing data points. We believe a closer examination of this issue
ples added during training) is within the “saturated” operat- from both a theoretical and practical standpoint is an impor-
ing regime identified in Figure 5. This is important because tant interesting topic for future research toward the effective
it shows that, even for small initial training sets, the total application of the semi-supervised approaches to object de-
number of examples is on the same order as the number that tection problems.
would be needed to train the detector with a single set of la- 4. Summary and Conclusions
beled examples. In other words, using a small set of labeled
examples does not cause us to pay a penalty in terms of a The goal of this work was to explore and evaluate ap-
greater size of the total training set. proaches to semi-supervised training using weakly labeled
data for appearance-based object detection. We conducted
3.6. Discussion
extensive experiments with a state-of-the art detector that
These experiments lead us to several observations that led to several important conclusions including a quantita-
will be useful in developing future detection systems based tive evaluation of the performance gained by adding weakly
on weakly-labeled training. First, the results show that it is labeled data to an initial small set of labeled data; a demon-
possible to achieve detection performance that is close to the stration of the feasibility of modifying an existing detector
14 14
NIPS, 1998.
12
10
12
10
[2] C. M. Bishop. Neural Networks for Pattern Recognition. Ox-
8
6
8
6
ford University Press, 1995.
4 4
[3] A. Blum and S. Chawla. Learning from labeled and unla-
Feature 2
Feature 2
2 2
−2
0
−2
0
beled data using graph mincuts. ICML, 2001.
−4
−6
−4
−6
[4] A. Blum and T. Mitchell. Combining labeled and unlabeled
−8
−10
−8
−14
Class 1
Class 2
−12
−14
Labeled Class 1
Labeled Class 2
[5] A. Corduneanu and T. Jaakkola. On information regulariza-
−14 −12 −10 −8 −6 −4 −2 0
Feature 1
2 4 6 8 10 12 14 −14 −12 −10 −8 −6 −4 −2 0
Feature 1
2 4 6 8 10 12 14
tion. UAI, 2003.
(a) (b) [6] F. Cozman, I. Cohen, and M. Cirelo. Semi-supervised learn-
14
12
14
12
ing and model search. ICML Workshop on the Continuum
10
8
10
8
from Labeled to Unlabeled Data in Machine Learning and
6 6 Data Mining., 2003.
4 4
Feature 2
2 2
0 0
−6 −6 2003.
−8 −8
−10
Unlabeled
Unlabeled Class 1
Unlabeled Class 2
−10
Unlabeled
Unlabeled Class 1
Unlabeled Class 2
[8] R. Fergus, O. Perona, and A. Zisserman. Object class recog-
−12
−14
−14 −12 −10 −8 −6 −4 −2 0 2 4 6
Labeled Class 1
Labeled Class 2
8 10 12 14
−12
−14
−14 −12 −10 −8 −6 −4 −2 0 2 4 6
Labeled Class 1
Labeled Class 2
8 10 12 14
nition by unsupervised scale-invariant learning. CVPR, 2003.
Feature 1 Feature 1
[9] T. Joachims. Transductive inference for text classification us-
(c) (d) ing support vector machines. ICML, 1999.
[10] A. Levin, P. Viola, and Y. Freund. Unsupervised improve-
Figure 9. (a) Original unlabeled data and labeled data; ment of visual detectors using co-training. ICCV, 2003.
(b) Plot of the true labels for the unlabeled data; (c),(d) [11] A. McCallum, K. Nigam, and L. Ungar. Efficient cluster-
The points labeled by the incremental self-training algo- ing of high-dimensional data sets with application to refer-
rithm after 5 iterations using the confidence metric and ence matching. KDD, 2000.
the Euclidean metric, respectively. [12] K. Nigam. Using Unlabeled Data to Improve Text Classifica-
tion. PhD thesis, Carnegie Mellon University Computer Sci-
ence Dept., 2001. CMU-CS-01-126.
to use weakly labeled data; and insights into the choice of [13] K. Nigam and R. Ghani. Analyzing the effectiveness and ap-
selection metric used for training. plicability of co-training. CIKM, 2000.
Many important issues that are critical to practical appli- [14] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learn-
ing to classify text from labeled and unlabeled documents.
cations of these training ideas remain to be explored. First, AAAI, 1998.
it might be important to use a different version of the de- [15] Y. Rachlin. A general algorithmic framework for discover-
tector for initial training and for actual use on test images. ing discriminative and generative structure in data. Master’s
For example, we found that the position and scale accuracy thesis, ECE Dept. Carnegie Mellon University, 2002.
of the detector are important for semi-supervised training, [16] C. Rosenberg. Semi-Supervised Training of Models for
whereas they may be less important when the detector is Appearance-Based Statistical Object Detection Methods.
PhD thesis, Carnegie Mellon University Computer Science
used in an application. Second, one alternative explanation Dept., May 2004. CMU-CS-04-150.
for the success of the nearest neighbor approach (based on [17] B. Schiele and J. Crowley. Recognition without correspon-
the appropriate selection metric) is that it is performing a dence using multidimensional receptive field histograms.
type of co-training [4], [13], [10]. It would be interesting to IJCV, 36(1):31–52, 2000.
study the relation between the semi-supervised training ap- [18] H. Schneiderman. Feature-centric evaluation for efficient
proach evaluated here with the co-training approaches. As cascaded object detection. CVPR, 2004.
[19] H. Schneiderman. Learning a restricted bayesian network for
shown in the experiments, the choice of the initial training
object detection. CVPR, 2004.
set has a large effect on performance. Although we have per- [20] A. Selinger. Minimally supervised acquisition of 3d recogni-
formed experiments that compare different selections of the tion models from cluttered images. CVPR, 2001.
initial training set, it would be useful to develop more pre- [21] M. Szummer and T. Jaakkola. Partially labeled classification
cise guidelines for selecting it. Finally, the approach could with markov random walks. NIPS, 2001.
be extended to training examples that are labeled in differ- [22] M. Szummer and T. Jaakkola. Information regularization
ent ways. For example, some images may be provided with with partially labeled data. NIPS, 2002.
[23] P. Viola and M. J. Jones. Robust real-time object detection.
scale information and nothing else. Additional information
Technical report, Compaq Cambridge Research Lab, 2001.
may be provided such as the rough shape of the object, or a [24] M. Weber, M. Welling, and P. Perona. Unsupervised learning
prior distribution over its location in the image. of models for recognition. ECCV, 2000.
[25] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised
References learning using gaussian fields and harmonic functions.
ICML, 2003.
[1] S. Baluja. Probabilistic modeling for face orientation dis-
crimination: Learning from labeled and unlabeled data.