On Discriminative and Semi-Supervised
Dimensionality Reduction
Chris Pal, Michael Kelm, Xuerui Wang, Greg Druck and Andrew McCallum
Department of Computer Science,
University of Massachusetts, Amherst, MA 01003
Abstract
We are interested in using the goal of making predictions to influence dimensionality reduction procedures. A number of new methods are emerging aimed at
combining attributes of generative and discriminative approaches to data modeling. New approaches to semi-supervised learning have also been emerging. We
present and apply some new methods to non-linear and richly structured problems comparing and contrasting models designed for computer vision with those
designed for text processing and discuss essential properties that need to be preserved when reducing dimensionality.
Overview
Recently there has been a flurry of interest in exploring new techniques combining generative and
discriminative methods through novel model structures and objective functions [6, 1, 7]. As well,
new and related semi-supervised methods are emerging such as: entropy regularization [4], which
aims to avoid violations of clustering assumptions, information regularization [2], which aims to put
decision boundaries in low density areas and [13], an approach based on graph Laplacian methods
which aims to achieve label smoothness. We focus here on obtaining dimensionality reductions that
are6 altered by labeled data and which
improve classification performance in the context of computer
6
6
vision
and text processing applications.
5
5
5
4
4
3
3
3
2
2
2
1
1
1
0
0
0
−1
−1
−1
−2
−2
−2
−3
−3
−4
−4
−2
0
2
4
6
−4
−48
4
−3
−2
0
2
4
6
−4
−48
−2
0
2
4
6
Figure 1: Interleaved Spirals and Dimensionality Reduction. (Left) An MCL based locally linear dimensionality reduction (LLDR). (Middle) Semi-supervised LLDR where grey points illustrate un-labeled data – Here
90% of labels are unobserved. (Right) Here 99% of labels are unobserved. Please note: figure best viewed in
color as circles at ellipse extremes indicates subspace class membership and a 3D example is also available.
We begin with an intuitive two class interleaved spiral example, a surrogate problem with structure
similar to a variety of computer vision tasks ranging from pixel classification to face manifolds[10].
For example, in [11] mixtures of locally linear models for dimensionality reduction were used for
image compression. In the experiments of figure 1 we used a mixture of locally linear, factor analysis
models [3] with a single latent dimension or factor, indicated by the dominant axis of the ellipse.
The joint distribution for a mixture of factor analyzers can be written as p(x, z, c, s) = exp{θ T c +
8
cT Θs}N (x, µs + Λs z, Ψ)N (z, 0, I), with cluster mean µs , factor matrix Λs , latent subspace z,
indexed by s, diagonal covariance matrix Ψ and identity latent space covariance I. The model is
an exponential family mixture and can be illustrated with the factor graph shown in fig. 2 (Left).
We use these models partly because of their similarities to other locally linear techniques such as
locally linear embedding [9]. However, our probabilistic formulation allows us to optimize the
model using multi-conditional learning (MCL) [7] – under an objective based on α log P (c|x) +
β log P (x|c), where x is a continuous input vector, c is a multinomial class label and α and β are
weights selected by hand or using cross-validation methods, typically leading to α > β. Fig. 1
(Left) illustrates a model obtained using α = 1, β = .05. Fig. 1 illustrates how most of the models
power is devoted to creating a high fidelity decision boundary, with more components in boundary
regions. The marginal density of x is more coarsely captured. To use this underlying model for
semi-supervised learning we have experimented with the objective α log P (c|x) + β log P (x). For
both we use expected gradient based optimization. In fig. 1 (Middle) and (Right) we obtained
models using this objective with α = 1, β = .1 and with 90% and 99% unlabeled data. Under
this metric, even with 90% of the labels missing, the model does a good job at recovering complex
non-linear latent spaces and discriminative boundaries. Our quantitative experiments also confirm
that both these objectives produce models with superior classification performance compared to
Maximum Likelihood. Using this approach, we therefore achieve locally linear but globally nonlinear discriminative dimensionality reductions using labels to directly improve our model of p(c|x).
Next, we explore a dimensionality reduction for text where cosine comparisons in the latent space
must be meaningful for predictions.
MCL & New Graphical Models
Latent Space, z
Latent Space, z
0.7
Sub-class, s
ML 1
ML 2
CL
M CL
0.68
…
0.66
Labels
y
Hidden
y
…
y
0.64
Precision
0.62
…
…
0.6
z
z
x1
x2 …
…
…
z
z
xMn
xMn
0.58
0.56
0.54
…
0.52
Class, c
Input Space, x
0.5
−3
10
−2
−1
10
10
Mn Word Events
Recall
0
10
Mn
Mn
People
Time
Word Events (Binary) (Discrete)
Observed
x
Figure 2: (Left) A graphical model for a mixture of linear subspaces (Middle) Classifying the year of a NIPS
paper using a latent space found under CL, MCL, and ML5 optimization. We see that CL and MCL have clear
and superior performance to traditional ML objectives. (Right) The model as a Factor Graph.
4
Exponential family models [12] and factor graphs [5] are extremely flexible and allow us to create
a low dimensional, continuous latent space z for a variety of richly structured inputs. Consider now
creating a latent space that views documents and authors (from NIPS papers) as input and helps
us predict publication time (volume number). Consider an input space consisting of a multivariate
Bernoulli (binary) random variable xb for author identities, a single discrete label xd for time and
Mn draws from a discrete or multinomial random variables for words. If we define the sum of the
word multinomials as xm , the complete composite input space can be written xT = [xTb xTm xTd ]T .
If we integrate out the latent space z, we can then write the probability model of fig. 2 (Right) as
T
WdT ] and
P (x|θ, Λ) = exp{θ T x + xT Λx − A(θ, Λ)}, where Λ = 21 WWT , WT = [WbT Wm
A(θ, Λ) is the partition function. To find W, we use Gibbs sampling in an approximate expected
gradient based optimization with momentum terms and annealing to speed up convergence. We
test our model on the NIPS Conference Papers data set from Roweis [8]. We processed this data
set so that: 1) only authors who published (by themselves or co-authored with someone else) more
than 5 NIPS papers are used; giving us 125 authors, 2) only papers authored by one or more of the
125 authors are considered, leading to 873 papers, then 3) we select the top 150 words in terms of
mutual information for authors. Papers are labeled by the NIPS volume number in which they were
published. We retrieve documents that have the same volume label as a test document based on the
cosine coefficient between them in the latent space. For evaluation, we score papers as relevant if
they are within ±3 years of when the test document was published. We show precision and recall
results in fig. 2. We experiment with Conditional Likelihood (CL) based optimization, using the
probability of the volume given authors and words and MCL, where we also use the reverse with
α = 1, β = .001. ML-1 is Maximum likelihood optimization with no volume label, ML-2 uses the
volume label. We find that CL and MCL derived latent spaces show marked improvement.
Acknowledgements
This work was supported in part by the Center for Intelligent Information Retrieval, in part by The
Central Intelligence Agency, the National Security Agency and National Science Foundation under
NSF grant #IIS-0326249, and in part by the Defense Advanced Research Projects Agency (DARPA),
through the Department of the Interior, NBC, Acquisition Services Division, under contract number
NBCHD030010. This work is also supported in part by Microsoft Research under the eScience
and Memex funding programs and by Kodak. Any opinions, findings and conclusions or recommendations expressed in this material are the author(s) and do not necessarily reflect those of the
sponsor.
References
[1] G. Bouchard and B. Triggs. The tradeoff between generative and discriminative classifiers. In J. Antoch,
editor, Proceedings in Computational Statistics, 16th Symposium of IASC, volume 16, Prague, 2004.
Physica-Verlag.
[2] A. Corduneanu and T. Jaakkola. On information regularization. In Proceedings of the 19th UAI, 2003.,
2003.
[3] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical Report
CRG-TR-96-1, University of Toronto, 1996.
[4] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural
Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 1318, 2004, Vancouver, British Columbia, Canada], 2004.
[5] F. R. Kschischang, B. Frey, and H. A. Loeliger. Factor graphs and the sum-product algorithm. IEEE
Trans. Inform. Theory, 47(2):498–519, 2001.
[6] J. A. Lasserre, C. M. Bishop, and T. P. Minka. Principled hybrids of generative and discriminative models.
In CVPR ’06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pages 87–94, Washington, DC, USA, 2006. IEEE Computer Society.
[7] A. McCallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning: Generative/discriminative
training for clustering and classification. In To appear in AAAI ’06: American Association for Artificial
Intelligence National Conference on Artificial Intelligence, 2006.
[8] http://www.cs.toronto.edu/ roweis/data.html.
[9] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,
290(5500):2323–2326, Dec. 22 2000.
[10] H. S. Seung and D. D. Lee. The manifold ways of perception. Science, 290(5500):2268–2269, Dec. 2000.
[11] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analysers. Neural
Computation, 11(2):443–482, 1999.
[12] M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS17, pages 1481–1488. 2005.
[13] X. Zhu and J. Lafferty. Harmonic mixtures: combining mixture models and graph-based methods for
inductive and scalable semi-supervised learning. In ICML ’05: Proceedings of the 22nd international
conference on Machine learning, pages 1052–1059, New York, NY, USA, 2005. ACM Press.