Introduction To Pattern Recognition and Machine Learning PDF
Introduction To Pattern Recognition and Machine Learning PDF
Published:
Vol. 1: Introduction to Algebraic Geometry and Commutative Algebra
by Dilip P Patil & Uwe Storch
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.
Printed in Singapore
IISc Press and WSPC are co-publishing books authored by world renowned sci-
entists and engineers. This collaboration, started in 2008 during IISc’s centenary
year under a Memorandum of Understanding between IISc and WSPC, has resulted
in the establishment of three Series: IISc Centenary Lectures Series (ICLS), IISc
Research Monographs Series (IRMS), and IISc Lecture Notes Series (ILNS).
The “IISc Lecture Notes Series” will consist of books that are reasonably self-
contained and can be used either as textbooks or for self-study at the postgraduate
level in science and engineering. The books will be based on material that has been
class-tested for most part.
Table of Contents
1. Introduction 1
1. Classifiers: An Introduction . . . . . . . . . . . . . . 5
2. An Introduction to Clustering . . . . . . . . . . . . . 14
3. Machine Learning . . . . . . . . . . . . . . . . . . . 25
2. Types of Data 37
1. Features and Patterns . . . . . . . . . . . . . . . . . 37
2. Domain of a Variable . . . . . . . . . . . . . . . . . 39
3. Types of Features . . . . . . . . . . . . . . . . . . . 41
3.1. Nominal data . . . . . . . . . . . . . . . . . . 41
3.2. Ordinal data . . . . . . . . . . . . . . . . . . . 45
3.3. Interval-valued variables . . . . . . . . . . . . 48
3.4. Ratio variables . . . . . . . . . . . . . . . . . . 49
3.5. Spatio-temporal data . . . . . . . . . . . . . . 49
4. Proximity measures . . . . . . . . . . . . . . . . . . 50
4.1. Fractional norms . . . . . . . . . . . . . . . . 56
4.2. Are metrics essential? . . . . . . . . . . . . . . 57
4.3. Similarity between vectors . . . . . . . . . . . 59
4.4. Proximity between spatial patterns . . . . . . 61
4.5. Proximity between temporal patterns . . . . . 62
vii
April 8, 2015 13:2 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-fm page viii
Table of Contents ix
5. Classification 135
1. Classification Without Learning . . . . . . . . . . . 135
2. Classification in High-Dimensional Spaces . . . . . . 139
2.1. Fractional distance metrics . . . . . . . . . . . 141
2.2. Shrinkage–divergence proximity (SDP) . . . . 143
3. Random Forests . . . . . . . . . . . . . . . . . . . . 144
3.1. Fuzzy random forests . . . . . . . . . . . . . . 148
4. Linear Support Vector Machine (SVM) . . . . . . . 150
4.1. SVM–kNN . . . . . . . . . . . . . . . . . . . . 153
4.2. Adaptation of cutting plane algorithm . . . . 154
4.3. Nystrom approximated SVM . . . . . . . . . . 155
5. Logistic Regression . . . . . . . . . . . . . . . . . . . 156
6. Semi-supervised Classification . . . . . . . . . . . . . 159
6.1. Using clustering algorithms . . . . . . . . . . . 160
6.2. Using generative models . . . . . . . . . . . . 160
6.3. Using low density separation . . . . . . . . . . 161
6.4. Using graph-based methods . . . . . . . . . . 162
6.5. Using co-training methods . . . . . . . . . . . 164
6.6. Using self-training methods . . . . . . . . . . . 165
6.7. SVM for semi-supervised classification . . . . 166
6.8. Random forests for semi-supervised
classification . . . . . . . . . . . . . . . . . . . 166
7. Classification of Time-Series Data . . . . . . . . . . 167
7.1. Distance-based classification . . . . . . . . . . 168
7.2. Feature-based classification . . . . . . . . . . . 169
7.3. Model-based classification . . . . . . . . . . . 170
April 8, 2015 13:2 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-fm page x
x Table of Contents
Table of Contents xi
Index 365
April 8, 2015 13:2 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-fm page xiii
xiii
May 2, 2013 14:6 BC: 8831 - Probability and Statistical Theory PST˙ws
Preface
xv
April 8, 2015 13:2 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-fm page xvi
xvi Preface
Preface xvii
M. Narasimha Murty
V. Susheela Devi
Bangalore, India
May 2, 2013 14:6 BC: 8831 - Probability and Statistical Theory PST˙ws
Chapter 1
Introduction
This book deals with machine learning (ML) and pattern recognition
(PR). Even though humans can deal with both physical objects and
abstract notions in day-to-day activities while making decisions in
various situations, it is not possible for the computer to handle them
directly. For example, in order to discriminate between a chair and
a pen, using a machine, we cannot directly deal with the physical
objects; we abstract these objects and store the corresponding rep-
resentations on the machine. For example, we may represent these
objects using features like height, weight, cost, and color. We will
not be able to reproduce the physical objects from the respective
representations. So, we deal with the representations of the patterns,
not the patterns themselves. It is not uncommon to call both the
patterns and their representations as patterns in the literature.
So, the input to the machine learning or pattern recognition sys-
tem is abstractions of the input patterns/data. The output of the
system is also one or more abstractions. We explain this process
using the tasks of pattern recognition and machine learning. In pat-
tern recognition there are two primary tasks:
1. Classification: This problem may be defined as follows:
• There are C classes; these are Class1 , Class2, . . . , ClassC .
• Given a set Di of patterns from Classi for i = 1, 2, . . . , C.
D = D1 ∪ D2 . . . ∪ DC . D is called the training set and mem-
bers of D are called labeled patterns because each pattern
has a class label associated with it. If each pattern Xj ∈ D is
1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 2
Introduction 3
F2 X8 X9
X6 O9
X7 t2 O10
b O8
X5 O7
X4 X3 t1 O6O5
X1 X2
O1 O3O4
O2
a F1
Introduction 5
1. Classifiers: An Introduction
In order to get a feel for classification we use the same data points
shown in Figure 1.1. We also considered two test points labeled t1
and t2 . We briefly illustrate some of the prominent classifiers.
Introduction 7
Similarly both SVM and DTC are linear in the example as the
decision boundaries are linear. In general, NNC and KNNC are
nonlinear. Even though Kernel SVM can be nonlinear, it may be
viewed as a linear classifier in a high-dimensional space and DTC
may be viewed as a piecewise linear classifier. There are other lin-
ear classifiers like the Naive Bayes Classifier (NBC ) and Logistic
Regression-based classifiers which are discussed in the later chap-
ters.
• Classification in High-dimensional Spaces: Most of the
current applications require classifiers that can deal with high-
dimensional data; these applications include text classification,
genetic sequence analysis, and multimedia data processing. It is
difficult to get discriminative information using conventional dis-
tance based classifiers; this is because the nearest neighbor and
farthest neighbors of a pattern will have the same distance values
from any point in a high-dimensional space. So, NNC and KNNC
are not typically used in high-dimensional spaces. Similarly, it is
difficult to build a decision tree when there are a large number of
features; this is because starting from the root node of a possibly
tall tree we have to select a feature and its value for the best split
out of a large collection of features at every internal node of the
decision tree. Similarly, it becomes difficult to train a kernel SVM
in a high-dimensional space.
Some of the popular classifiers in high-dimensional spaces are
linear SVM , NBC , and logistic regression-based classifier. Classi-
fier based on random forest seems to be another useful classifier
in high-dimensional spaces; random forest works well because each
tree in the forest is built based on a low-dimensional subspace.
• Numerical and Categorical Features: In several practical
applications we have data characterized by both numerical and cat-
egorical features. SVM s can handle only numerical data because
they employ dot product computations. Similarly, NNC and
KNNC work with numerical data where it is easy to compute
neighbors based on distances. These classifiers require conversion
of categorical features into numerical features appropriately before
using them.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 8
Introduction 9
over the entire training dataset will double the number of training
patterns.
Bootstrapping may be explained using the data in Figure 1.1.
Let us consider three nearest neighbors for each pattern. Let us
consider X1; its 3 neighbors from the same class are X2, X4, and
X3. Let X1 be the centroid of these three points. In a similar
manner we can compute bootstrapped patterns X2 , X3 , . . . , X9
corresponding to X2, X3, . . . , X9 respectively. In a similar manner
bootstrapped patterns corresponding to Os also can be computed.
For example, O2, O3, O6 are the three neighbors of O1 and their
centroid will give the bootstrapped pattern O1 . In a general set-
ting we may have to obtain bootstrap patterns corresponding to
both the classes; however to deal with the class imbalance problem,
we need to bootstrap only the minority class patterns. There are
several other ways to synthesize patterns in the minority class.
Preprocessing may be carried out either by decreasing the size
of the training data of the majority class or by increasing the size
of training data of the minority class or both.
• Training and Classification Time: Most of the classifiers
involve a training phase; they learn a model and use it for classi-
fication. So, computation time is required to learn the model and
for classification of the test patterns; these are called training time
and classification/test time respectively. We give the details below:
− Training: It is done only once using the training data. So, for
real time classification applications classification time is more
important than the training time.
∗ NNC : There is no training done; in this sense it is the
simplest model. However, in order to simplify testing/
classification a data structure is built to store the training
data in a compact/compressed form.
∗ KNNC : Here also there is no training time. However, using a
part of the training data and the remaining part for valida-
tion, we need to fix a suitable value for K. Basically KNNC
is more robust to noise compared to NNC as it considers
more neighbors. So, smaller values of K make it noise-prone
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 10
Introduction 11
Introduction 13
2. An Introduction to Clustering
In clustering we group a collection, D, of patterns into some K clus-
ters; patterns in each cluster are similar to each other. There are a
variety of clustering algorithms. Broadly they may be characterized
in the following ways:
Introduction 15
Introduction 17
For example, in Figure 1.1, for the points X6, X7, X8, X9 in one of
the clusters, the centroid is located inside the circle having these
four patterns. The advantage of representing a cluster using its
centroid is that it may be centrally located and it is the point
from which the sum of the squared distances to all the points in the
cluster is minimum. However, it is not helpful in achieving robust
clustering; this is because if there is an outlier in the dataset then
the centroid may be shifted away from a majority of the points in
the cluster. The centroid may shift further as the outlier becomes
more and more prominent. So, centroid is not a good representative
in the presence of outliers. Another representative that could be
used is the medoid of the cluster; medoid is the most centrally
located point that belongs to the cluster. So, medoid cannot be
significantly affected by a small number of points in the cluster
whether they are outliers or not.
Another issue that emerges in this context is to decide whether
each cluster has a single representative or multiple representatives.
• Dynamic Clustering: Here, we obtain a partition of the data
using a set, Dn , of patterns. Let the partition be πn where n is
the number of patterns in D. Now we would like to add or delete
a pattern from D. So, the possibilities are:
− Addition of a Pattern: Now the question is whether we can
reflect the addition of a pattern to Dn in the resulting cluster-
ing by updating the partition πn to πn+1 without re-clustering
the already clustered data. In other words, we would like to use
the n + 1th pattern and πn to get πn+1 ; this means we generate
πn+1 without reexamining the patterns in Dn . Such a clustering
paradigm may be called incremental clustering. This paradigm
is useful in stream data mining. One problem with incremen-
tal clustering is order dependence; for different orderings of the
input patterns in D, we obtain different partitions.
− Deletion of a Pattern: Even though incremental clustering
where additional patterns can be used to update the current
partition without re-clustering the earlier seen data is popular,
deletion of patterns from the current set and its impact on the
partition is not examined in a detailed manner in the literature.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 19
Introduction 19
Then
xij = xqj + δ,
Introduction 21
Introduction 23
Term Probability
Cricket 0.4
Ball 0.2
Bat 0.15
Umpire 0.12
Wicket 0.08
Run 0.05
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 24
A − BC2F is minimized.
Introduction 25
3. Machine Learning
We have seen the two important pattern recognition tasks which are
classification and clustering. Another task is dimensionality reduc-
tion which is required in both classification and clustering. Some of
the other tasks in machine learning are regression or curve fitting,
ranking, and summarization. We deal with them in this section.
• Dimensionality reduction
Introduction 27
Introduction 29
Research Ideas
1. What are the primary differences between pattern recognition, machine learn-
ing, and data mining. Which tasks are important in each of these areas?
Relevant References
(a) J. Han, M. Kamber and J. Pei, Data Mining: Concepts and Techniques,
Third Edition. New York: Morgan Kauffmann, 2011.
(b) K. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge,
MA: MIT Press, 2012.
(c) M. N. Murty and V. Susheela Devi, Pattern Recognition: An Algorithmic
Approach. London: Springer, 2011.
2. How do we evaluate classifiers performance? What is the best validation
scheme? How to deal with class imbalance?
Relevant References
(a) V. Lopez, A. Fernandez and F. Herrera, On the importance of the vali-
dation technique for classification with imbalanced datasets: Addressing
covariate shift when data is skewed. Information Sciences, 257:1–13, 2014.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 31
Introduction 31
(b) Q.-Y. Yin, J.-S. Zhang, C.-X. Zhang and N.-N. Ji, A novel selective ensem-
ble algorithm for imbalanced data classification based on exploratory
under-sampling. Mathematical Problems in Engineering, 2014.
(c) Y. Sun, A. K. C. Wong and M. S. Kamel, Classification of imbalanced
data: A review. International Journal of Pattern Recognition and Artificial
Intelligence, 23:687–719, 2009.
3. How do we represent clusters? Is it essential that there should be one repre-
sentative per cluster? Is it possible to solve it using an optimization scheme?
Relevant References
(a) D. Bhattacharya, S. Seth and T.-H. Kim, Social network analysis to detect
inherent communities based on constraints. Applied Mathematics and
Information Sciences, 8:1–12, 2014.
(b) W. Hamalainen, V. Kumpulainen and M. Mozgovoy, Evaluation of clus-
tering methods for adaptive learning systems. In Artificial Intelligence in
Distance Education, U. Kose and D. Koc (eds.). Hershey, PA: IGI Global,
2014, pp. 237–260.
(c) P. Franti, M. Rezaei and Q. Zhao, Centroid index: Cluster level similarity
measure. Pattern Recognition, 47:3034–3045, 2014.
4. How do we compute similarity between a pair of patterns that employ both
numerical and categorical features? Can distance/dot product based methods
work well with such patterns?
Relevant References
(a) Y.-M. Cheung and H. Jia, Categorical-and-numerical-attribute data clus-
tering based on a unified similarity metric without knowing cluster number.
Pattern Recognition, 46:2228–2238, 2013.
(b) I. W. Tsang, J. T. Kwok and P.-M. Cheung, Core vector machines: Fast
SVM training on very large data sets. JMLR, 6:363–392, 2005.
(c) A. Ahmad and G. Brown, Random projection random discretization
ensembles — ensembles of linear multivariate decision trees. IEEE
Transactions on Knowledge Data and Engineering, 26:1225–1239,
2014.
5. In the context of so-called generative models, why should one synthesize
patterns? Which classifiers exploit the synthetic patterns better?
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 32
Relevant References
(a) L. Plonsky, J. Egbert and G. T. Laflair, Bootstrapping in applied linguistics:
Assessing its potential using shared data. Applied Linguistics, 2014.
(b) P. S. Gromski, Y. Xu, E. Correa, D. I. Ellis, M. L. Turner and R. Goodacre,
A comparative investigation of modern feature selection and classification
approaches for the analysis of mass spectrometry data. Analytica Chimica
Acta, 829:1–8, 2014.
(c) H. Seetha, R. Saravanan and M. N. Murty, Pattern synthesis using multiple
Kernel learning for efficient SVM classification. Cybernetics and Infor-
mation Technologies, 12:77–94, 2012.
6. Is it meaningful to combine several binary classifiers to realize multi-class
classification?
Relevant References
(a) V. Sazonova and S. Matwin, Combining binary classifiers for a multi-class
problem with differential privacy. Transactions on Data Privacy, 7:51–70,
2014.
(b) A. Kontorovich and R. Weiss, Maximum margin multiclass nearest neigh-
bors, arXiv:1401.7898, 2014.
(c) T. Takenouchi and S. Ishii, A unified framework of binary classifiers
ensemble for multi-class classification. Proceedings of ICONIP, 2012.
(d) K. Hwang, K. Lee, C. Lee and S. Park, Multi-class classification using
a signomial function. Journal of the Operational Research Society,
doi:10.105.7/jors.2013.180, Published online on 5 March 2014.
7. Which classifier is ideally suited to deal with a large number, say 1000, classes?
Can one design one?
Relevant References
(a) K. Mei, P. Dong, H. Lei and J. Fan, A distributed approach for large-scale
classifier training and image classification. Neurocomputing, 144:304–
317, 2014.
(b) P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and Y. LeCun,
OverFeat: Integrated recognition, localization and detection using convo-
lutional networks, arXiv:1312.6229, 2014.
(c) T.-N. Doan, T.-N. Do and F. Poulet, Large scale visual classification with
many classes. Proceedings of MLDM, 2013.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 33
Introduction 33
Relevant References
(a) B. Akhand and V. S. Devi, Multi-label classification of discrete data, IEEE
International conference on fuzzy systems. FUZZ-’13, 2013.
(b) J. Xu, Fast multi-label core vector machine. Pattern Recognition, 46:885–
898, 2013.
(c) J. Read, L. Martino and D. Luengo, Efficient Monte Carlo methods for
multi-dimensional learning with classifier chains. Pattern Recognition,
47:1535–1546, 2014.
(d) J. Lee and D.-W. Kim, Feature selection for multi-label classification using
multivariate mutual information. Pattern Recognition Letters, 34:349–357,
2013.
9. It is possible to show equivalence between threshold based algorithms like
leader and number of clusters based algorithms like the K -means algorithm.
Is it possible to axiomatize clustering to derive such equivalences?
Relevant References
(a) R. Chitta and M. N. Murty, Two-level k-means clustering algorithm for
k-tau relationship establishment and linear-time classification. Pattern
Recognition, 43:796–804, 2010.
(b) M. Ackerman, S. Ben-David and D. Loker, Towards property-based clas-
sification of clustering paradigms. Proceedings of NIPS, 2010.
(c) M. Meila, Comparing clusterings — An axiomatic view. Proceedings of
ICML, 2005.
10. In social networks that evolve over time the notion of outlier may have to be
redefined. How do we achieve it?
Relevant References
(a) N. N. R. R. Suri, M. N. Murty and G. Athithan, Characterizing temporal
anomalies in evolving networks. Proceedings of PAKDD, 2014.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 34
(b) M. Gupta, J. Gao, C. Aggawal and J. Han, Outlier Detection for Temporal
Data. San Rafael: Morgan and Claypool Publishers, 2014.
(c) L. Peel and A. Clauset, Detecting change points in the large-scale structure
of evolving networks, arXiv:1403.0989, 2014.
11. What is the most appropriate scheme for clustering labeled data?
Relevant References
(a) V. Sridhar and M. N. Murty, Clustering algorithms for library comparison.
Pattern Recognition, 24:815–823, 1991.
(b) Q. Qiu and G. Sapiro, Learning transformations for clustering and classi-
fication, arXiv:1309.2074, 2014.
(c) A. Kyriakopoulou, Theodore Kalamboukis: Using clustering to enhance
text classification. Proceedings of SIGIR, 2007.
12. How do we exploit Map-Reduce framework to design efficient clustering
schemes?
Relevant References
(a) A. Ene, S. Im and B. Moseley, Fast clustering using MapReduce. Proceed-
ings of KDD, 2011.
(b) R. L. F. Cordeiro, C. Traina Jr., A. J. M. Traina, J. Lopez, U. Kang
and C. Faloutsos, Clustering very large multi-dimensional datasets with
MapReduce. Proceedings of KDD, 2011.
(c) S. Fries, S. Wels and T. Seidl, Projected clustering for huge data sets in
MapReduce. Proceedings of EDBT, 2014.
13. What is the best way to exploit knowledge in clustering? Which components
of the clustering system are more sensitive to the usage of knowledge?
Relevant References
(a) A. Srivastava and M. N. Murty, A comparison between conceptual clus-
tering and conventional clustering. Pattern Recognition, 23:975–981,
1990.
(b) X. Hu, X. Zhang, C. Lu, E. K. Park and X. Zhou, Exploiting Wikipedia as
external knowledge for document clustering. Proceedings of KDD, 2009.
(c) W. Pedrycz, Knowledge-Based Clustering: From Data to Information
Granules. New Jersey: John Wiley & Sons, 2005.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch01 page 35
Introduction 35
Relevant References
(a) N. Karthik and M. N. Murty, Obtaining single document summaries using
latent Dirichlet allocation. Proceedings of ICONIP, 2012.
(b) B. Piwowarski, M. R. Amini and M. Lalmas, On using a quantum physics
formalism for multi-document summarization. Journal of the American
Society for Information Science and Technology, 63:865–888, 2012.
May 2, 2013 14:6 BC: 8831 - Probability and Statistical Theory PST˙ws
Chapter 2
Types of Data
37
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 38
O
X2 O
O
O
XXX
XXX
X — Chair X1
O — Human
Pattern number Weight (in kgs) Height (in feet) Class label
1 10 3.5 Chair
2 63 5.4 Human
3 10.4 3.45 Chair
4 10.3 3.3 Chair
5 73.5 5.8 Human
6 81 6.1 Human
7 10.4 3.35 Chair
8 71 6.4 Human
Types of Data 39
2. Domain of a Variable
Typically, each feature is assigned a number or a symbol as its
value. For example, the weight of the chair shown in the first row
of Table 2.1 is 3.5 kgs where 3.5 is a real number; weight of other
objects in the Table are all numbers. However, color of a chair could
be black, red, or green which we are not using in the representation
used in Table 2.1. So, color assumes symbolic values. It is possible
for the features to assume values that could be trees or graphs and
other structures. Specifically, a document collection is popularly rep-
resented as a document–term matrix, where each document (pattern)
corresponds to a row and each term in the collection corresponds to a
column. It is also possible to represent a document collection using an
inverted index as is done by search engines for information retrieval.
In the inverted index, a list of the documents in which a term occurs
is stored for each term in the collection. Even though such an index
is primarily used for information retrieval, it is possible to use the
index structure in other machine learning tasks like classification,
clustering, ranking and prediction in general.
A feature assumes values from a set: For example, the feature
weight has its values from the set of positive reals in general and an
appropriate subset in particular. Such a set is called the domain of
the feature. It is possible that weight can be measured in terms of
either kilograms or pounds. In both the cases, the domain of weight
is the set of positive reals. Also, it is possible that different features
can use the same domain of values. For example, a feature like height
also has set of positive reals as its domain. For some features, these
assignments are artificial or man-made; for others they are natural.
For example, it is possible to use either a number or a symbol as the
value of the feature ID or Identification of a person; the only require-
ment here is being able to discriminate between two individuals based
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 40
1 10 12.5 light
2 15 47 light
3 60 93 heavy
4 85 120 heavy
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 41
Types of Data 41
values of these two objects is 1.98 93 47 . So, such a transformed
data may be adequate for pattern classification based on choosing or
learning the appropriate threshold value either in the input or in the
transformed domain. This example illustrates the property that the
way we measure and represent an attribute may not match its pro-
perties. However, classification is still possible. We discuss properties
associated with different types of variables in the next section.
3. Types of Features
There are different types of features or variables. We may categorize
them as follows:
Types of Data 43
R = {(the, 2), (good, 1), (old, 2), (teach, 2), (several, 1), (course, 1),
(in, 2), (big, 1), (college, 1)}
Histogram(D1 ): {(the, 1), (good, 1), (old, 1), (teach, 2), (several, 1),
(course, 1)}.
Histogram(D2 ): {(in, 1), (the, 1), (big, 1), (old, 1), (college, 1)}.
D = {D1 , D2 , . . . , Dn }.
{(obj1, blue), (obj2, blue), (obj3, red), (obj4, green), (obj5, blue),
(obj6, green), (obj7, blue), (obj8, red), (obj9, blue), (obj10, green)}
Considering the set, we can say that the domain of the nomi-
nal variable color is Dcolor = {blue, red, green}; we can discriminate
objects based on the value the variable takes from the domain. For
example, obj1 and obj2 are identical and are different from obj3. In
a similar manner we can compare any pair of objects in the set based
on the value assumed. Further, note that there are 5 blue, 3 green
and 2 red objects in the collection, which means that the set can be
represented by a histogram given by {(blue, 5), (red, 2), (green, 3)}.
Once we have the histogram, we can obtain the mode and entropy
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 45
Types of Data 45
as follows:
d
Entropy (D) = − pi log pi ,
i=1
Types of Data 47
Histogram(D) = {(in, 2), (old, 2), (teach, 2), (the, 2), (big, 1), (col-
lege, 1), (course, 1), (good, 1), (several, 1)}.
D1 0 0 1 1 0 1 1 1 1
D2 1 1 0 0 1 1 0 0 1
Types of Data 49
1
n
Mean = xi ,
n
i=1
1
1
n 2
4. Proximity measures
Matching is of paramount importance in several areas of computer
science and engineering. Matching trees, graphs, sequences, strings,
and vectors is routinely handled by a variety of algorithms. Match-
ing is either exact or approximate. Exact matching is popular in
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 51
Types of Data 51
3. Triangular Inequality
Types of Data 53
n10 + n01
d(x, y) = .
n11 + n10 + n01 + n00
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 54
7
d(D1 , D2 ) = .
9
(iii) Similarity between binary strings
2 7
s(D1 , D2 ) = , as d(D1 , D2 ) = .
9 9
Note that
n10 + n01
s(x, y) = 1 −
n11 + n10 + n01 + n00
n11 + n00
= .
n11 + n10 + n01 + n00
n11
J(x, y) = .
n11 + n10 + n01
Note that J(D1 , D2 ) = 29 . Also, s(x, y) and J(x, y) need not be same.
For example, if there are two 5 bit binary strings x and y given by
x = 1 0 0 0 1 and
y=01001
then s(x, y) = 35 = 0.6 and J(x, y) = 1
3 = 0.33.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 55
Types of Data 55
Types of Data 57
d1 (X3 , X4 ) = 2 + 2 = 4 and
√
d2 (X3 , X4 ) = [22 + 22 ]0.5 = 8 = 2.828.
qi
KL(p, q) = − pi log 2 .
pi
i=1
Types of Data 59
Note that the squared euclidean distance does not satisfy the tri-
angular inequality because d22 (x, y) + d22 (y, z) = 13 and d22 (x, z) =
16 which means sum of lengths of two sides of the triangle (13) is
less than the length of the third side (16) violating the triangular
inequality.
However, the squared euclidean distance is symmetric. This
is because basically, euclidean distance is a metric and so it sat-
isfies symmetry. As a consequence, for any two points u and v,
d2 (u, v) = d2 (v, u) which means d22 (u, v) = d22 (v, u) ensuring that
the squared euclidean distance is symmetric. Similarly, one can
show that squared euclidean distance satisfies positive reflexivity.
In this case
x · y = 3 ∗ 1 + 1 ∗ 0 + 5 ∗ 1 + 1 ∗ 0 + 0 ∗ 0 + 0 ∗ 2 = 8,
√ 1
x = (3 ∗ 3 + 1 ∗ 1 + 5 ∗ 5 + 1 ∗ 1 + 0 ∗ 0 + 0 ∗ 0) 2 = 36 = 6,
1 √
y = (1 ∗ 1 + 0 ∗ 0 + 1 ∗ 1 + 0 ∗ 0 + 0 ∗ 0 + 2 ∗ 2) 2 = 6 = 2.245.
8
So, cos(x, y) = 6 ∗ 2.245 = 0.594.
Some of the issues associated with computing the cosine values are:
1
x = (3, 1, 5, 1, 0, 0) = (0.5, 0.166, 0.83, 0.166, 0, 0),
6
1
y = (1, 0, 1, 0, 0, 2) = (0.445, 0, 0.445, 0, 0, 0.89).
2.245
So, it can simplify the cosine computation if the data are initially
normalized. Data is routinely normalized when classifiers based
on neural networks and support vector machines are used.
2. It is possible to approximate the cosine computation by ignoring
the smaller values in a normalized vector; we replace such small
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 61
Types of Data 61
Types of Data 63
1. Minkowski distance
2. Cosine similarity
3. KL distance: This is an assymetric distance function. The sym-
metric version is:
d(a, b) + d(b, a)
D(a, b) = ,
2
where d(a, b) is the conventional KL distance between a and b.
1
d
dm = dissim(pk , qk ),
d
k=1
|pk −qk |
where dissim(pk , qk ) = |pk |+|qk | .
| pk − q k |
dp (pk , qk ) = .
2.max(| pk |, | qk |)
1
d
dp (P, Q) = dp (pk , qk ).
d
k=1
1
d
µP = Pi ,
d
i=1
1 − CC(P, Q) β
dCC1 (P, Q) = ,
1 + CC(P, Q)
where an appropriate value of β is to be chosen. β is a value greater
than zero; typically β = 2.
Another distance measure is
Types of Data 65
d(Pi , Qj ) = |Pi − Qj |.
This means that the cumulative distance DST (i, j) is the sum of
the distance in the current cell d(i, j) and the minimum of the cumu-
lative distances of the adjacent elements, DST (i−1, j), DST (i, j−1),
and DST (i − 1, j − 1).
To decrease the number of paths considered and to speed up the
calculations, some constraints are considered. The paths considered
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 66
1. Sakoe–Chiba Band
2. Itakura Parallelogram
Types of Data 67
Q Q
P P
Ri = d, 0 ≤ d ≤ m,
1. Initialize for i = 1 to dP
for j = 1 to dQ DTW(i, j) = ∞
DTW(0, 0) = 0
2. Update for i = 1 to dP
for j = max (1, i − w) to min(dQ , i + w)
DTW(i, j) =| Pi − Qj | + min(DT W (i − 1, j), DT W (i, j − 1),
DT W (i − 1, j − 1))
Types of Data 69
j \i 0 1 2 3 4 5
0 0 ∞ ∞ ∞ ∞ ∞
1 ∞ 0 0 2 3 5
2 ∞ 1 1 0 0 1
3 ∞ 3 3 1 1 0
4 ∞ 5 5 2 2 0
j \i 0 1 2 3 4 5
0 0 ∞ ∞ ∞ ∞ ∞
1 ∞ 0 0 ∞ ∞ ∞
2 ∞ 1 1 0 ∞ ∞
3 ∞ ∞ 3 1 1 ∞
4 ∞ ∞ ∞ 1 2 1
Research Ideas
1. It is pointed out in Section 1 that pattern classification can be carried out
using approximate values of features. Is it possible to work out bounds on the
approximation for an acceptable level of classification?
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 70
Relevant References
(a) L. E. Ghaoui, G. R. G. Lanckriet and G. Natsoulis, Robust classification
with interval data. Technical Report UCB/CSD-03-1279, Computer Sci-
ence Division, University of California, Berkeley, 2003.
(b) Robust Classification, www.ims.nus.edu.sg/Programs/semidefinite/files/
IMS2006 Lect2.ppt [accessed on 25 October 2014].
(c) A. Ben-Tal, S. Bhadra, C. Bhattacharyya and J. Saketha Nath, Chance
constrained uncertain classification via robust optimization. Mathematical
Program, 127(1):145–173, 2011.
(d) A. Takeda, H. Mitsugi and T. Kanamori, A unified classification model
based on robust optimization. Neural Computation, 25:759–804, 2013.
2. Even though the domain of a variable is a large or/and possibly infinite set,
it may make sense to restrict the domain size to build a variety of classifiers.
For example, in document classification it is possible to ignore some terms as
illustrated by stemming in Section 3.1. How to exploit such a reduction in the
domain size in classification?
Relevant References
(a) A. Globerson and N. Tishby, Sufficient dimensionality reduction. Journal
of Machine Learning Research, 3:1307–1331, 2003.
(b) C. D. Manning, P. Raghavan and H. Schutze, Introduction to Information
Retrieval. Cambridge: Cambridge University Press, 2008.
(c) T. Berka, web.eecs.utk.edu/events/tmw11/slides/Berka.pdf, Dimensional-
ity reduction for information retrieval using vector replacement of rare
terms, 2011.
(d) D. Wang and H. Zhang, Inverse-category-frequency based supervised term
weighting schemes for text categorization. Journal of Information Science
and Engineering, 29:209–225, 2013.
3. Given the training dataset, how do we learn an appropriate distance/similarity
function that could be used in classification? Is it possible to use different
similarity functions in different regions of the feature space?
Relevant References
(a) M. Gonen and E. Alpaydn, Multiple Kernel learning algorithms. Journal
of Machine Learning Research, 12:2211–2268, 2011.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 71
Types of Data 71
Relevant References
(a) C.-M. Hsu and M.-S. Chen, On the design and applicability of distance
functions in high-dimensional data space. IEEE Transactions Knowledge
and Data Engineering, 21(4):523–536, 2009.
(b) C. C. Aggarwal, A. Hinneburg and D. A. Keim, On the surprising behavior
of distance metrics in high dimensional spaces. ICDT, 420–434, 2001.
Relevant References
Relevant References
(a) N. Lee and J. Kim, Conversion of categorical variables into numerical
variables via Bayesian network classifiers for binary classifications. Com-
putational Statistics & Data Analysis, 54:1247–1265, 2010.
(b) Natural Language Understanding, www.cs.stonybrook.edu/ ychoi/cse507/
slides/04-ml.pdf.
(c) I. W. Tsang, J. T. Kwok and P.-M. Cheung, Core vector machines: Fast
SVM training on very large data sets. JMLR, 6:363–392, 2005.
7. In Section 4.2, a variety of distance functions that violate one or more of
the metric properties have been detailed. Analyze further to rank the distance
functions based on the type of property violated.
Relevant References
(a) T. Skopal and J. Bustos, On nonmetric similarity search problems in com-
plex domains. ACM Computing Surveys, 43:34–50, 2011.
(b) M. Li, X. Chen, X. Li, B. Ma and P. Vitanyi, The similarity metric. IEEE
Transactions on Information Theory, 50:3250–3264, 2004.
8. Give example algorithms where the triangular inequality satisfied by the dis-
tance measure is exploited to simplify the learning algorithm.
Relevant References
(a) S. Guha, A. Meyerson, N. Mishra, R. Motwani and L. O’Callaghan, Clus-
tering data streams: Theory and practice. IEEE Transactions on Knowledge
Data and Engineering, 15:515–528, 2003.
(b) D. Arthur and S. Vassilvitskii, k-means++: The advantages of careful seed-
ing. SODA: 1027–1035, 2007.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch02 page 73
Types of Data 73
Relevant References
(a) S. V. Dongen and A. J. Enright, Metric distances derived from cosine
similarity and Pearson and Spearman correlations, arXiv:1208.3145v1,
2012.
(b) C. D. Manning, P. Raghavan and H. Schutze, Introduction to Information
Retrieval. Cambridge: Cambridge University Press, 2008.
10. How do we speedup the computation of DTW values further.
Relevant References
(a) T. Giorgino, Computing and visualizing dynamic time warping alignments
in R: The dtw package, cran.r-project.org/web/packages/dtw/vignettes/
dtw.pdf.
(b) Y. Sakurai, M. Yoshikawa and C. Faloutsos, FTW: Fast similarity search
under the time warping distance. PODS: 326–337, 2005.
May 2, 2013 14:6 BC: 8831 - Probability and Statistical Theory PST˙ws
Chapter 3
75
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 76
1. Filter methods
2. Wrapper methods
3. Embedded methods
The filter methods compute a score for each feature and then
select features according to the score. The wrapper methods score
feature subsets by seeing their performance on a dataset using a clas-
sification algorithm. The embedded methods select features during
the process of training.
The wrapper method finds the feature subset by the method of
search. For every subset generated, its performance is evaluated on a
validation dataset using a classification algorithm. The feature subset
giving the best performance on the validation dataset is selected.
Some of the methods used are:
1. Exhaustive enumeration
2. Branch and bound technique
3. Sequential selection
(a) Sequential Forward Selection
(b) Sequential Backward Selection
(c) Sequential Floating Forward Selection
(d) Sequential Floating Backward Selection
4. Min–max approach
5. Stochastic techniques like genetic algorithms (GA)
6. Artificial Neural Networks
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 77
for j = i
s.t. x ∈ {0, 1} i = 1, . . . , d,
i
i xi = k,
Nu¯t ūl = number of documents where both the term and class are
absent,
N = Total number of documents.
If the distribution of the term in the whole document is the
same as its distribution in the class then MI = 0. If MI is large,
it means the term is in a document if and only if the document is in
the class. It makes sense to keep only informative terms and elimi-
nate non-informative terms so that the performance of the classifier
improves.
In the filter approach, a filter is used to discard features having
a low value of MI. We can also use the backward filter which discards
features if its value of MI with the class is less than some with
probability p. The forward filter also can be used which includes a
feature if the MI is greater than with a probability p.
3. Chi-square Statistic
The chi-square statistic is used to determine if a distribution
of observed frequencies differs from the theoretical expected fre-
quencies. This non-parametric statistical technique uses frequencies
instead of using the mean and variances, since it uses categorical
data.
The chi-square statistic is given by
χ2 = |(Ni − Ei )2 /Ei |,
i
If the χ2 value is larger than the one in the table giving χ2 distri-
butions for the degree of freedom, then it means that we need to reject
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 81
the hypothesis that they are independent. This means that since the
two are dependent, the occurrence of the term makes the occurrence
of the class more likely. This means the term is useful as a feature.
4. Goodman–Kruskal Measure
The Goodman–Kruskal measure λ measures the interaction between
a feature and a class. If there are two classes + and −, the measure
for a feature i is
v
j=1 max(nj+ , nj− ) − max(n+ , n− )
λi = ,
n − max(n+ , n− )
where
ni+ = number of instances for which the value of a feature is Vi
and the class is ‘+’,
ni− = number of instances for which the value of a feature is Vi
and the class is ‘−’,
v = number of discrete values taken by the feature.
5. Laplacian Score
The Laplacian Score measures importance of a feature by its ability
for locality preserving. The features are given a score depending on
their locality preserving power. The algorithm is based on finding a
nearest neighbor graph for the set of nodes and finding the Laplacian
of the graph. Using the Laplacian, the Laplacian score is calculated
for every feature.
The algorithm is as follows:
1. Construct the nearest neighbor graph G for the set of points. For
every pair of points i and j, if xi is one of the k-nearest neighbors
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 82
D = diag(WI ).
The Laplacian L = D − W .
fiT DI
Then f˜i = fi − I T DI
I.
f˜iT Lf˜i
Li = . (1)
f˜iT Df˜i
f T DI
f˜i = fi = iT I,
I DI
2
var(fi ) = f˜ij Djj = f˜iT Df˜i .
j
P = AΣB T ,
The k1 singular values which are larger are chosen. High singular
values correspond to dimensions which have more variability. The
dimensions with lower singular values correspond to dimensions with
less variability which may not be discriminative features for learning.
Classification which has to be performed on P which is an m × n
matrix can now be performed on the matrix ΣB T which is a k1 × n
matrix where k1 < m.
It is also possible to find a basis vector A which will transform
any vector from the original vector space to the new vector space.
For a training dataset T , the resulting SVD T = ÃΣ̃B̃ T will yield a
set of basis vectors which can be used for the range of P . Here T is
m × r where r is the number of points in the training dataset. To use
the resulting SVD for T to transform P , it is necessary to project
the columns of P onto a subspace spanned by the first k1 columns of
Ã. A is transformed by computing (Ã.1 , Ã.2 , . . . , Ã.k1 )T A. Hence the
original patterns can be transformed using (Ã.1 , Ã.2 , . . . , Ã.k1 )T and
classification can be carried out on the transformed patterns.
minB,H D(XBH ),
such that B, H ≥ 0.
In this formulation, B and H have to be non-negative.
D(XBH ) means the divergence of X from BH and is the cost
function of the problem. If BH is actually Y , then
xij
D(XY ) = xij log − xij + yij .
yij
i,j
where the value Xij is generated by adding Poisson noise to the prod-
uct (BH )ij . The objective function O is subject to the non-negativity
constraint i.e. all the non-zero elements of B and H are positive. An
iterative procedure is used to modify the initial values of B and H
so that the product approaches X. This procedure terminates when
the approximation error converges or after a user-defined number of
iterations. The update formula for B and H is as follows:
Xij
Bik = Bik Hkj ,
(BH )ij
j
Bik
Bik = ,
l Blk
Xij
Hkj = Hkj Bik .
(BH )ij
i
The learned bases using NMF are not orthonormal to each other.
This is because to satisfy the non-negativity constraint the bases
cannot be orthonormal. One way of handling this is to consider the
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 86
d
P x1 − P x2 .
k
d
The scaling fraction k is required due to the decrease in the
dimensionality of the data. This fraction is called the Johnson–
Lindenstrauss (J–L) scaling term.
In many cases where RP is used, the elements pij are Gaussian
distributed. Using a simpler distribution, according to Achlioptas the
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 87
d
d
2
m1 = a1 = a21,i ; m2 = a2 = 2
a22,i , (5)
i=1 i=1
d
u= aT1 a2 = a1,j a2,j ; d = a1 − a2 2 = m1 + m2 − 2u. (6)
i=1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 88
Then
2 2
E(b1 2 ) = a1 2 ) = m1 ; var(b1 2 ) = m , (7)
k 1
2 2
E(b1 − b2 2 ) = d; var(b1 − b2 2 ) = d , (8)
k
1
E(bT1 b2 ) = u; var(bT1 b2 ) = (m1 m2 + u2 ). (9)
k
This shows that distances between two points and inner prod-
ucts can be computed in k dimensions. If k d then there is a lot
of saving in time and space. Looking at var(b1 2 ), var(b1 − b2 2 )
and var(bT1 b2 ) in Eqs. (7), (8) and (9), it can be seen that Random
Projections preserve the pairwise distances in expected sense.
For points p and q in Rd that are far apart, there is a low prob-
ability P2 that they fall into the same bin. That is,
where s gives the number of bins. When the CS j values are sorted
in decreasing order, if the difference between the class separability of
two features differs by a very small value, say 0.001, then the feature
having the smaller distance is eliminated.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 91
where
e = error rate,
t = feasibility threshold,
m = scale factor.
This penalty function is monotonic with respect to e. If e < t, then
p(e) is negative and, as e approaches zero, p(e) slowly approaches its
minimal value.
If e = t, then p(e) = 0 and
if e = t + m, then p(e) = 1.
For greater values of the error rate the penalty function quickly rises
towards infinity.
written as:
c(s) = ns1 + p,
|A|
1
cr(s) = nu1 (s, SA (i)),
ns1
i=1
Most approaches use the same training set both for constructing
the predictor and for evaluating its error.
An evaluation measure such as mean average precision (MAP) or
Normalized discount cumulative gain (NDCG) is used. The features
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 97
are then sorted according to the score and this ordered list of features
is considered for feature selection.
Some feature ranking algorithms are discussed here.
1. MAP
MAP measures the precision of the ranking results. If there are
two classes, the positive and negative class, precision measures the
accuracy of the top n results to a query. It is given by:
no. of positive instances within top n
P (n) = .
n
Average precision of a query is
N
P ∗ pos(n)
Pav = ,
no. of positive instances
n=1
(11)
where
n+ = number of positive instances,
n− = number of negative instances,
xj = average of jth feature in the whole dataset,
(+)
xj = average of jth feature in the positive labeled instances,
(−)
xj = average of jth feature in the negative labeled instances,
(+)
xi,j = jth feature of the ith positive instance,
(−)
xi,j = jth feature of the ith negative instance.
1 l
minw,b wt w + C E(w, b; Xi , yi ), (12)
2
i=1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 101
or
x̄i = xj .
d
j= kd (i−1)+1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 104
Research Ideas
1. It is possible to view feature selection as a specialization of either linear or
nonlinear feature extraction. Under what conditions can feature extraction be
preferred over feature selection? The improved performance could be in terms
of space, time and/or accuracy.
Relevant References
(a) V. S. Devi and M. N. Murty, Pattern Recognition: An Introduction.
Hyderabad, India: Universities Press, 2012.
(b) M. N. Murty and V. S. Devi, Pattern Recognition: An Algorithmic
Approach. New York: Springer, 2012.
(c) Jens Kresten, Simultaneous feature selection and Gaussian mixture model
estimation for supervised classification problems. Pattern Recognition,
47:2582–2595, 2014.
(d) D. Zhang, J. He, Y. Zhao, Z. Luo and M. Du, Global plus local: A complete
framework for feature extraction and recognition. Pattern Recognition,
47:1433–1442, 2014.
(e) G. Wang, Q. Song, H. Sun, X. Zhang, B. Xu and Y. Zhou, A feature subset
selection algorithm automatic recommendation method. JAIR, 47:1–34,
2013.
2. MI has been popularly exploited in feature selection. How can we reduce the
number of features selected by such a method to get better accuracy? Will it
help in improving the scalability of the feature selection scheme?
Relevant References
(a) G. Herman, B. Zhang, Y. Wang, G. Ye and F. Chen, Mutual information-
based method for selecting informative feature sets. Pattern Recognition,
46(12):3315–3327, 2013.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 106
(b) H. Liu, J. Sun, L. Liu and H. Zhang, Feature selection with dynamic mutual
information. Pattern Recognition, 42:1330–1339, 2009.
(c) P. M. Chinta and M. N. Murty, Discriminative feature analysis and selection
for document classification. Proceedings of ICONIP, 2012.
(d) J. Dai and Q. Xu, Attribute selection based on information gain ratio in
fuzzy rough set theory with application to tumor classification. Applied
Soft Computing, 13:211–221, 2013.
3. Feature selection based on MI and Chi-Square test perform reasonably well
on large datasets. How do you compare them?
Relevant References
(a) C. D. Manning, P. Raghavan and H. Schutze, Introduction to Information
Retrieval. Cambridge: Cambridge University Press, 2008.
(b) P. M. Chinta and M. N. Murty, Discriminative feature analysis and selection
for document classification. Proceedings of ICONIP, 2012.
(c) S. R. Singh, H. A. Murthy and T. A. Gonsalves, Feature selection for text
classification based on gini coefficient of inequality. JMLR Workshop and
Conference Proceedings, 10:76–85, 2010.
4. Principal components are the eigenvectors of the covariance matrix of the data.
The first principal coefficient is in the maximum variance direction; second
component is orthogonal to the first and corresponds to the next variance
direction and so on. Show using a simple two-dimensional example that the
second principal component is better than the first principal component for
discrimination. However, there are popular schemes like the latent semantic
indexing which use the principal components of the high-dimensional data
successfully. What could be the reason behind this?
Relevant References
(a) M. N. Murty and V. S. Devi, Pattern recognition, Web course, NPTEL,
2012, http://nptel.iitm.ac.in/courses.php [accessed on 2 November 2014].
(b) S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and
R. Harshman, Indexing by latent semantic analysis. Journal of the Amer-
ican Society for Information Science, 41:391–407, 1990.
(c) M. Prakash and M. N. Murty, A genetic approach for selection of (near-)
optimal subsets of principal components for discrimination. PR Letters,
16:781–787, 1995.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 107
Relevant References
(a) D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative
matrix factorization. Nature, 401:788–791, 1999.
(b) D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factoriza-
tion. Advances in Neural Information Processing Systems, 13:556–562,
2001.
(c) C. C. Aggarwal and C. K. Reddy, Data Clustering: Algorithms and Appli-
cations. New York: CRC Press, 2014.
(d) C. Thurau, K. Kersting, M. Wahabzada and C. Bauckhage, Convex non-
negative matrix factorization for massive datasets. Knowledge and Infor-
mation Systems, 29:457–478, 2011.
6. One of the issues with NMF is that the resulting factorization may lead to a
local minimum of the optimization function considered. What are the different
ways of improving upon this?
Relevant References
(a) A. Korattikara, L. Boyles, M. Welling, J. Kim and H. Park, Statistical
optimization of non-negative matrix factorization. Proceedings of The
Fourteenth International Conference on Artificial Intelligence and Statis-
tics, JMLR: W&CP 15, 2011.
(b) F. Pompili, N. Gillis, P.-A. Absil and F. Glineur, Two algorithms for ortho-
gonal nonnegative matrix factorization with application to clustering.
CoRR abs/1201.0901, 2014.
(c) V. Bittorf, B. Recht, C. Re and J. A. Tropp, Factoring nonnegative matrices
with linear programs. CORR abs/1206.1270, 2013.
7. It is possible to project d-dimensional patterns to k -dimensional patterns using
random projections where the random entries come from a Gaussian with zero
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 108
mean and unit variance. Then if Xi and Xj are a pair of patterns in the d space
and the corresponding patterns after projection into the k space are Xi and
Xj , then it is possible to show that with probability greater than or equal to
1 − n−β
(1 − )Xi − Xj 2 ≤ Xi − Xj 2 ≤ (1 + )Xi − Xj 2
4+2β
given positive and β and k is any number greater than kmin = 2 3 log n.
2
− 3
How do we appreciate the role of various quantities like β, , n, and k ? What
can happen to the bounds when these parameters are varied within their legal
ranges?
Relevant References
(a) A. K. Menon, Random projections and applications to dimensionality
reduction. BS (advanced) thesis, School of Info. Tech., University of Syd-
ney, 2007.
(b) P. Li, T. J. Hastie and K. W. Church, Very sparse random projections.
Proceedings of KDD, 2006.
(c) R. J. Durrant and A. Kaban, Sharp generalization error bounds for
randomly-projected classifiers. Proceedings of ICML, 2013.
8. Even though GAs have been used in feature selection and extraction, algo-
rithms based on GAs cannot scale up well, specifically steady-state GA may
be very slow in converging. How to make them scale-up well?
Relevant References
(a) I. Rejer and K. Lorenz, Genetic algorithm and forward method for feature
selection in EEG feature space. JTACS, 7:72–82, 2013.
(b) A. Ekbal and S. Saha, Stacked ensemble coupled with feature selection
for biomedical entity extraction. Knowledge-Based Systems, 46:22–32,
2013.
(c) D. Dohare and V. S. Devi, Combination of similarity measures for time
series classification using genetic algorithms. IEEE Congress on Evolu-
tionary Computation, 2011.
(d) D. Anand, Article: Improved collaborative filtering using evolutionary
algorithm based feature extraction. International Journal of Computer
Applications, 64:20–26, 2013.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 109
where S is the set of support vectors, yi is the class label of Xi which is either
+1 or −1, and αi is the Lagrange variable associated with Xi . So, how can
such a weight vector W be useful in ranking features?
Relevant References
(a) Y.-W. Chang and C.-J. Lin, Feature ranking using linear SVM. JMLR
Workshop and Conference Proceedings, pp. 53–64, 2008.
(b) Y.-W. Chen and C.-J. Lin, Combining SVMs with various feature selec-
tion strategies. In Feature Extraction, Foundations and Applications,
I. Guyon, S. Gunn, M. Nikravesh and L. Zadeh (eds.). New York: Springer,
2006.
(c) J. Wang, S. Zhou, Y. Yi and J. Kong, An improved feature selection based
on effective range for classification. The Scientific World Journal, 1–8,
2014.
(d) H. Li, C.-J. Li, X.-J. Wu and J. Sun, Statistics-based wrapper for fea-
ture selection: An implementation on financial distress identification with
support vector machine. Applied Soft Computing, 19:57–67, 2014.
10. Feature selection based on F -score has been effectively used in several prac-
tical applications. What is the reason for its success.
Relevant References
(a) H.-Y. Lo et al., An ensemble of three classifiers for KDD cup 2009:
Expanded linear model, heterogeneous boosting, and selective naive bayes.
JMLR: Workshop and Conference Proceedings, 7:57–64, 2009.
(b) Y.-W. Chen and C.-J. Lin, Combining SVMs with Various Feature selection
strategies. In Feature Extraction, Foundations and Applications. Berlin:
Springer, 2006, pp. 315–324.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch03 page 110
(c) J. Xie, J. Lei, W. Xie, Y. Shi and X. Liu, Two-stage hybrid feature selection
algorithms for diagnosing erythemato-squamous diseases. Health Infor-
mation Science and Systems, 1:10, 2013.
10. Time series data can be large in several applications. How to extract features
for meaningful classification?
Relevant References
(a) P. K. Vemulapalli, V. Monga and S. N. Brennan, Robust extrema features
for time–series data analysis. IEEE Transactions on PAMI, 35:1464–1479,
2013.
(b) M. G. Baydogan, G. Runger and E. Tuv, A bag-of-features framework to
classify time series. IEEE Transactions on PAMI, 35:2796–2802, 2013.
(c) Q. Wang, X. Li and Q. Qin, Feature selection for time series model-
ing. Journal of Intelligent Learning Systems and Applications, 5:152–164,
2013.
(d) B. D. Fulcher and N. S. Jones, Highly comparative, feature-based time-
series classification, CoRR abs/1401.3531, 2014.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 111
Chapter 4
Bayesian Learning
1. Document Classification
In the Bayesian approach, typically we exploit the Bayes rule to
convert the prior probabilities to posterior probabilities based on the
data under consideration. For example, let us consider a collection
of documents
D = {(d1 , C), (d2 , C), . . . , (dn1 , C), (dn1 +1 , C), . . . , (dn , C)},
where we have n1 documents from class C and n − n1 documents
from class C. Now a new document d can be classified using the
111
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 112
Noting that out of the five labeled patterns, three are from sports,
an estimate for P (Sports) = 35 . Similarly, P (politics) = 25 . However,
frequency based estimate for P (d6 |Sports) is zero because none of the
training documents in the class sports matches d6 ; similarly estimate
of P (d6 |P olitics) becomes zero. So, it becomes difficult to obtain a
meaningful estimate of either P (d6 |Sports) or P (d6 |P olitics).
Now we can use these estimates and the prior probabilities given
by P (Sports) = 35 and P (P olitics) = 25 to obtain the posterior pro-
babilities using the following
P (Sports|d6 )
P (d6 |Sports)P (Sports)
=
P (d6 |Sports)P (Sports) + P (d6 |P olitics)P (P olitics)
= 0.24,
P (P olitics|d6 )
P (d6 |P olitics)P (P olitics)
=
P (d6 |Sports)P (Sports) + P (d6 |P olitics)P (P olitics)
= 0.76.
Example 2. Let a coin be tossed n times out of which let the num-
ber of times head shows up be nh , then probability of the coin show-
ing head, P (h), is estimated using
nh
P (h) = .
n
Specifically let the coin be tossed six times out of which head shows
up four times, so, P (h) = 46 = 23 . However, in another coin tossing
experiment, if there are 0 (zero) heads out of five tosses of the coin,
then the probability P (h) = 05 = 0. This is the problem with the fre-
quency based estimation scheme; the estimate may not be accurate
when the experiment is conducted a smaller number of times or
equivalently when the dataset size is small.
One way to improve the quality of the estimate is to integrate
any prior knowledge we have in the process of estimation. A simple
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 116
4. Posterior Probability
The most important feature of Bayesian learning is to exploit Bayes
rule to convert the prior probability into the posterior probability.
Specifically let C be the class label and d be the observed document.
Then
• Prior probability: P (C)
• Posterior probability: P (C|d); it is the probability of the class after
observing d.
• Using Bayes rule, we have
P (d|C) × P (C)
P (C|d) = .
P (d|C) × P (C) + P (d|C) × P (C)
Once we have the posterior probabilities, we can assign d to class C
if P (C|d) > P (C|d); else assign d to C. Equivalently, we assign d
to class C if PP (C|d)
(C|d)
> 1. We can simplify the expressions if P (C|d)
and P (C|d) are exponential functions by assigning d to class C if
logP (C|d) > logP (C|d). We consider an example involving univari-
ate normal densities next; note that univariate normal is a member
of the exponential family of distributions.
Note that P (C|d) has the same denominator as P (C|d); they differ
only in their numerator values. So,
P (C|d) exp − 12 (x − µ)2
= .
P (C|d) exp − 12 (x − µ)2
In this ratio we have exponentials in both the numerator and denom-
inator; so, it is good to consider the comparison of log(P (C|d)) and
log(P (C|d)). We assign d to class C if
or
5. Density Estimation
We have noted earlier that in order to use the Bayes classifier, it is
important to have the prior probabilities and the probability density
function of each class. One of the simplest ways is to assume that the
form of the density function is known and the parameter underlying
the density function is unknown. Estimation of the density function
under these conditions is called parametric estimation. In the fre-
quentist test approach, the parameters are assumed to be unknown
but deterministic. On the contrary, in the Bayesian approach the
parameters are assumed to be random variables. We examine para-
metric estimation using these schemes in this section.
D = {X1 , X2 , . . . , Xn }.
n
1−Xi
P (D|p1 ) = pX
1 (1 − p1 )
i
.
i=1
n
n
Xi n− Xi
i=1 i=1
− = 0.
p1 1 − p1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 121
P (D|p1 )P (p1 )
P (p1 |D) =
P (D|p1 )P (p1 )dp1
P
n
P
n
Xi n− Xi
p1i=1
× (1 − p1 ) i=1
= P
n
P
n
Xi n− Xi
i=1
p1 × (1 − p1 ) i=1 dp1
P
n
P
n
Xi n− Xi
i=1
p1 × (1 − p1 ) i=1 × (n + 1)!
× n .
n
Xi ! × n − Xi !
i=1 i=1
(n + 1)!
=
n
n
Xi ! n − Xi !
i=1 i=1
P „ «
X+
n
Xi n+1− X+
P
n
Xi
× p1 i=1
(1 − p1 ) i=1 dp1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 123
(n + 1)!
=
n
n
Xi ! n − Xi !
i=1 i=1
n
n
(X + i=1 Xi )! n + 1 − (X + Xi ) !
i=1
× .
(n + 2)!
By simplifying the above expression we get
n n
X+ Xi ! n+1− X + Xi !
P (X|D) = n ×
i=1 i=1
.
n
Xi ! n− Xi ! (n + 2)
i=1 i=1
C
fi = , (4)
ri
where ri is the rank of the ith term; most frequent term has
the smallest rank of 1 and hence its frequency is the maximum
given by C and the least frequent term has its rank value to
be C and hence its frequency is 1 C C .
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 125
M
nk = ,
kα
where nk is the number of nodes of degree k and M and α are
some constants. So, using an appropriate prior density that
may not be uniform may make sense based on the context.
In such a case it may be difficult to get simple closed form
expressions for the densities in general. However, it is theo-
retically possible to simplify the analysis by assuming suitable
complementary or conjugate forms for the prior density. Even
when closed form expression is not obtained for the density
P (X|D), it is possible to consider some simplifying scenarios
where mode, mean or maximum values of the posterior densi-
ties are used. In order to explain the notion of conjugate prior
we consider another popular distribution.
• Binomial distribution: It is one of the popularly encountered distri-
butions. For example, consider tossing a coin n times out of which
we get nh heads and nt tails; let the probability of head in a single
toss be ph . Then the probability of this event is given by
n
Bin(nh heads out of n tosses) = nh pnhh (1 − ph )n−nh .
6. Conjugate Priors
Note that both the Bernoulli and Binomial distributions have like-
lihood functions, based on parameter q ∈ [0, 1], proportional to
q a (1 − q)b , where a and b are constants. This suggests that we choose
a prior density that is also proportional to q a (1 − q)b so as to ensure
that the posterior has a similar form; Beta density has such a form;
the probability density is given by
Γ(a + b) a−1
Beta(q|a, b) = q (1 − q)b−1
Γ(a)Γ(b)
∞ ∞
where Γ(a) = v a−1 e−v dv = (a − 1) v a−2 e−v dv
0 0
= (a − 1)Γ(a − 1).
Note that any probability density function p(x) satisfies the following
properties:
1. p(x)
≥ 0 for all x,
2. p(x) dx = 1.
1
Γ(a + b) a−1
q (1 − q)b−1 dq = 1
0 Γ(a)Γ(b)
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 127
or equivalently
1
Γ(a + b)
q a−1 (1 − q)b−1 dq = 1,
Γ(a)Γ(b) 0
1
Γ(a + b) a−1
E[q] = q q (1 − q)b−1 dq
0 Γ(a)Γ(b)
Γ(a + b) 1 a
= q (1 − q)b−1 dq
Γ(a)Γ(b) 0
Γ(a + b) Γ(a + 1)Γ(b)
= ×
Γ(a)Γ(b) Γ(a + b + 1)
Γ(a + b) a × Γ(a)Γ(b)
= × .
Γ(a)Γ(b) (a + b)Γ(a + b)
By canceling terms that are present in both the numerator and the
denominator, we get
a
E[q] = .
a+b
P (D|q)P (q)
P (q|D) = 1 .
0 P (D|q)P (q)dq
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 128
P
n P
n
Xi n− Xi
P (q|D) = C1 q i=1 (1 − q) i=1 C2 q a−1 (1 − q)b−1 ,
Γ(a+b)
where C2 is Γ(a)Γ(b)
; by using C for C1 × C2 , we get
P
n P
n
a+ Xi −1 n+b− Xi −1
P (q|D) = Cq i=1 (1 − q) i=1 .
×pimi −1 dmi
= mpi × 1 = mpi .
Observe that this result is similar to the estimate of mean of the bino-
mial; similarly it is possible to show that variance of mi is mpi (1−pi ).
We discuss next how the Dirichlet prior is the conjugate to multi-
nomial; this result has been significantly exploited in the machine
learning literature during the past decade in the form of soft cluster-
ing based on latent Dirichlet allocation and its variants. It is possible
to show that the likelihood function corresponding to multinomial is
given by
k
M
P (D|p) ∝ pj j ,
j=1
Γ(a1 + a2 +, . . . , +ak ) aj −1
k
P (p) = pj .
Γ(a1 ), . . . , Γ(ak )
j=1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 130
i=1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 131
Research Ideas
1. In Section 2, we have discussed the Naive Bayes classifier (NBC) which assumes
that the terms are independent of each other given the class. It is not difficult
to realize that this could be a gross simplification. Then why should NBC
work well?
Relevant References
(a) I. Rish, An empirical study of the Naive Bayes classifier. International Joint
Conferences on Artificial Intelligence workshop on empirical methods in
artificial intelligence, 2001.
(b) L. Jiang, D. Wang and Z. Cai, Discriminatively weighted Naive Bayes and
its application in text classification. International Journal of Artificial Intel-
ligence Tools, 21(1), 2012.
(c) C. D. Manning, P. Raghavan and H. Schutze, Introduction to Information
Retrieval. Cambridge: Cambridge University Press, 2008.
2. It is possible to use classifiers in feature selection. How does one use the NBC
in feature selection?
Relevant References
(a) C.-H. Lee, F. Gutierrez and D. Dou, Calculating feature weights in Naive
Bayes with Kullback–Leibler Measure. IEEE International Conference on
Data Mining, 2011.
(b) J. Chen, H. Huang, S. Tian and Y. Qu, Feature selection for text classification
with Naive Bayes. Expert Systems with Applications, 36(3):5432–5435,
2009.
(c) Z. Zeng, H. Zhang, R. Zhang and Y. Zhang, A hybrid feature selection
method based on rough conditional mutual information and Naive Bayesian
classifier. ISRN Applied Mathematics, 2014:1–11, 2014.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 132
Relevant References
(a) S. Dey and M. N. Murty, Using discriminative phrases for text catego-
rization. 20th International Conference on Neural Information Processing,
2013.
(b) M. Yuan, Y. X. Ouyang and Z. Xiong, A text categorization method using
extended vector space model by frequent term sets. Journal of Information
Science and Engineering, 29:99–114, 2013.
(c) D. Gujraniya and M. N. Murty, Efficient classification using phrases gener-
ated by topic models. In Proceedings of International Conference on Pattern
Recognition, 2012.
4. It is possible to extend the MDC discussed in Section 4 to deal with more
than two classes. Specifically, if there are n classes corresponding to the n
training patterns, then each class may be viewed as drawn from a normal density
with mean at the point and the covariance matrix is of the form 0I. In such a
case the MDC converges to Nearest Neighbor Classifier. Is this interpretation
meaningful?
Relevant References
(a) R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, Second
Edition. New York: Wiley Interscience, 2000.
(b) J. Ye, Multiple Closed-Form local metric learning for K-nearest neighbor
classifier. CoRR abs/1311.3157, 2013.
(c) J. Liu, X. Pan, X. Zhu and W. Zhu, Using phenological metrics and the
multiple classifier fusion method to map land cover types. Journal of Applied
Remote Sensing, 8: 2014.
5. We have discussed conjugate priors in Section 6. There could be other kinds
of priors that may help in getting closed form expressions. How can they be
used?
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 133
Relevant References
(a) M. I. Jordan, Jeffrey’s Priors and Reference Priors, Lecture 7, 2010.
(www.cs.berkeley.edu/jordan/courses/260-spring10/.../lecture7.pdf)
(b) R. Yang and J. O. Berger, Estimation of a covariance matrix using the
reference prior. The Annals of Statistics, 22(3):1195–1211, 1994.
(c) M. D. Branco, M. G. Genton and B. Liseo, Objective Bayesian analysis of
skew-t distributions. Scandinavian Journal of Statistics Theory and Appli-
cations, 40(1):63–85, 2013.
(d) C. Hu, E. Ryu, D. Carlson, Y. Wang and L. Carin, Latent Gaussian models
for topic modeling. JMLR Workshop and Conference Proceedings, 2014.
6. It is analytically convenient to assume that the prior is Dirichlet and the like-
lihood to be multinomial in finding clusters of documents. However, because
we know that frequency distribution of terms satisfies power law as specified
by Zipf, does it make sense to consider other forms of prior densities?
Relevant References
(a) D. M. Blei, Probabilistic topic models. Communications of the ACM,
55(4):77–84, 2012.
(b) C. Wang and D. M. Blei, Variational inference in non-conjugate models.
Journal of Machine Learning Research, 14(1):1005–1031, 2013.
(c) D. Newman, E. V. Bonilla and W. Buntine, Improving topic coherence with
regularized topic models. Proceedings of Neural Information Processing
Systems, 2011.
7. Most of the times priors are selected based on analytical tractability rather than
the semantic requirement. For example, Dirichlet is convenient mathematically
to deal with the frequency distribution of terms in a document collection where
the likelihood is characterized by multi-nomial. However, Zipf’s curve based
on empirical studies gives a better prior in this case. Similarly Wikipedia offers
a rich semantic input to fix the prior in case of clustering and classification of
documents. How does such empirical data help in arriving at more appropriate
Bayesian schemes?
Relevant References
(a) C. M. Bishop, Pattern Recognition and Machine Learning. Singapore:
Springer, 2008.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch04 page 134
Chapter 5
Classification
135
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 136
The test pattern is given a class label which is the class label of the
pattern closest to it in the training data. If there are n training pat-
terns (X1 , w1 ), (X2 , w2 ), . . . , (Xn , wn ), we need to classify a pattern
P , and if DP i is the distance between P and pattern Xi , then if
DP k = min DP i where i = 1, 2, . . . , n, then P is assigned the class of
pattern Xk which will be wk .
Other algorithms based on the NN rule are the k-nearest
neighbor (kNN) algorithm, the modified k-nearest neighbor
(mkNN) algorithm, and the r-nearest neighbor (rNN) algo-
rithm. All these algorithms do not need to develop any model for
classification using the training data. Hence, no learning takes place
except for fixing the parameter k. The value of k is crucial to the
performance of the classifier. It can therefore be seen that these
algorithms do not need any time for learning the classification model.
Classification algorithms which carry out classification without going
through the learning phase have no design time (or training time).
These algorithms are robust. 1NN classifier has an error rate less than
twice the bayes error rate which is the optimal error rate asymptot-
ically. Similarly, the kNN classifier gives the optimal (bayes) error
rate asymptotically.
Consider the set of points in Figure 5.1. It is a two-dimensional
dataset with two features f1 and f2 . The nearest neighbor or the 1NN
classifier assigns the label of the closest neighbor to a test point.
Test data P will be classified as belonging to class ‘square’ as its
closest neighbor belongs to that class. Q will be classified as belonging
to the class ‘circle’ as it is closest to point 7. R will be classified as
belonging to class ‘circle’ as it is closest to point 7. In the case of
point P , there is no ambiguity and the 1NN classifier works well. In
the case of point Q, even though it is closest to class ‘circle’, since it
is on the boundary of the class ‘cross’ and class ‘circle’, there is some
ambiguity. If kNN is used with k = 5, the points Q and R are labeled
as belonging to class ‘cross’ and not class ‘circle’. In the case of
mkNN, the distances of the test pattern from the k neighbors is also
taken into account. Out of the k neighbors, if dmin is the distance of
the closest neighbor and dmax is the distance of the furthest neighbor
out of the k neighbors, then the weight given to the class of neighbor
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 137
Classification 137
4
f2
17
18
3
14 15 16
P 13
2
3 x 6 x 9 o
o 12
8 o o
1 1x 5 x Q 11
2 x o
4 x R 7 o
10
1 2 3 4 5
f1
i is
(dmax − di )
wi = .
(dmax − dmin )
The NN weight w1 is set to 1. The score of every class is initialized
to 0, i.e. scorei = 0, i = 1, . . . , c.
For every neighbor i out of the k neighbors, if the point belongs
to class j, scorej is incremented by wi . After doing this for the k
neighbors, the test pattern belongs to the class having the largest
score. In the case of pattern R, using kNN it is given the class label
‘cross’. Using mkNN, since 7 is the closest neighbor, class ‘circle’
is given a weightage of 1. The other points 4, 2, and 5 belong to
class ‘cross’. The 5th neighbor is quite far away from R. If the score
aggregated to class ‘cross’ is more than the score for class ‘circle’, R
will be assigned to class ‘cross’. It can be seen that often, using kNN
may classify the test pattern differently as compared to mkNN.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 138
These classifiers require time linear in the sample size for classifi-
cation. This goes up as the training data size goes up. In this context,
if the training dataset size can be reduced, the time required for clas-
sification can be reduced. This reduction can be accomplished either
by reducing the number of training patterns, reducing the number of
features, or both. Reducing the number of training patterns can be
done by carrying out prototype selection which includes condensation
algorithms and editing algorithms. There are a number of algorithms
including the popular Condensed Nearest Neighbor (CNN) algorithm
and the Modified Condensed Nearest Neighbor (MCNN) algorithm.
The CNN is an order-dependent algorithm and different orderings of
the input data give different condensed sets. As a result, you cannot
be guaranteed to get the optimal condensed set. MCNN mitigates
this problem by suitably modifying the algorithm to make it order
independent.
The CNN starts with a set of all patterns Data and a condensed
set Condensed which is empty. The first pattern in Data is put into
Condensed. After this, the following set of statements are repeated
till there is no change in Condensed in an iteration.
Classification 139
the data give different condensed sets, the CNN algorithm does not
give an optimal condensed set.
The MCNN algorithm is a modification of the CNN algorithm
making it an order-independent algorithm. In this algorithm,
in each iteration, one pattern which is a typical pattern of a class
is added to Condensed. So in each iteration, c patterns are added
to Condensed, one from each class. The condensed set is used to
classify the training set. The misclassified patterns of each class are
used to find the next typical pattern for each class which is added to
Condensed. This is continued till there are no misclassified patterns.
It is to be noted that in a particular iteration, a class which has no
misclassified patterns will not have any pattern added to Condensed.
It can be seen that in the above algorithm, all the patterns are con-
sidered. Finding the typical pattern of a class and classification of
the training set using the condensed set do not depend on the order
in which the patterns are presented to the algorithm. MCNN is an
order-independent algorithm which gives better results than CNN.
However, the MCNN has a higher time complexity. Since it needs to
be run only once for a dataset to get the condensed set which can
be used for 1NN classification, the time taken should not matter if
it gives a better condensed set. It is to be noted that both CNN
and MCNN work by classifying the training dataset by using the
condensed dataset. Using the final Condensed set obtained, both
the algorithms result in 100% classification accuracy on the training
dataset.
Dnn
Dmax
Classification 141
(k + 1) k 2k + 1 dk−2
C(n − 1) 1
≤ 1 , (2)
(k + 1) k 2k + 1
Classification 143
2·f +1 f
Dnn d
1
≤ C · (n − 1) · . (7)
2·f +1
From Eqs. (1)–(3), (6), and (7), it can be seen that fractional
distance metrics provide better contract than the integer valued dis-
tance metrics.
SDPs1 ,s2 (p, q) = Σdi=1 fs1 ,s2 (Dd (pi , qi )). (10)
3. Random Forests
Random forests or decision tree forests consist of a number of deci-
sion trees. It is an ensemble classifier. The prediction of the decision
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 145
Classification 145
i(N ) is:
c
i(N ) = − fi log fi ,
i=1
where
Ni
fi = c .
j=1 Nj
i = 1 . . . c, j = 1 . . . c.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 147
Classification 147
This is called the Gini impurity. In the case of the two-class prob-
lem, we get
i(N ) = fp ∗ fn .
1. Forest size F .
2. Number of patterns considered for each tree N 1 which gives the
randomness parameter ρ.
3. Number of random features (d1 ) considered for each tree.
y(x) = ΣN 1 i
i=1 ai x .
Classification 149
x1 − x2 − 1 = 1.
Note that both the points (2, 0)t and (3, 1)t satisfy this equation.
Also if we consider the point (1, −1)t , it satisfies the above equation
as it falls on the corresponding line. The second line is characterized
O
X(3,1)
O
X
X
(0,0)O X(2,0)
Classification 151
by
x1 − x2 − 1 = −1.
The point (0, 0)t satisfies this equation. Also, point (1, 1)t satisfies
this equation as it falls on this line. The first line may be represented
as W t X + b = 1 where W and X are vectors and b is a scalar. In the
two-dimensional case
W = (w1 , w2 )t ; X = (x1 , x2 )t .
In this example, W = (1, −1)t and b = −1. The second line is of the
form W t X + b = − 1. In the two-dimensional case, these lines are
called support lines. In the high-dimensional case, we have support
planes characterized by W t X+b = 1 and W t X+b = −1. It is possible
to show that the normal distance or margin between these two planes
2
is ||W || . In classification based on SVM, we use the patterns from two
different classes, positive and negative classes, to learn W and b.
Any point X from the positive class satisfies the property that
W t X + b ≥ 1 with support vectors (some kind of boundary vectors)
from the positive class satisfying W t X +b = 1. Similarly, points from
the negative class satisfy W t X + b ≤ −1 with support vectors from
the negative class satisfying W t X + b = −1. So, the margin is con-
cerned with distance between the two support planes; we would like
2
to maximize the margin. So, we would like to maximize ||W || . Equiv-
2
alently, we can minimize ||W2 || .
By assuming a point Xi in positive class has the label yi = 1 and
a point Xj in the negative class has the class label yj = −1. So, the
optimization problem has constraints. The optimization problem is
||W ||2
Minimize ,
2
such that yi (W t Xi + b) ≥ 1 for all the patterns Xi , i = 1, 2, . . . , n.
The Lagrangian associated with the optimization problem is
1
n
L(W, b) = ||W ||2 − αi [yi (W t Xi + b) − 1].
2
i=1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 152
Classification 153
4.1. SVM–kNN
A hybrid method is used here where SVM is not applied to the entire
training data. A coarse and quick categorization is done by finding
the kNN. Then SVM is performed on the smaller set of examples
which are more relevant.
The algorithm is as follows:
C
n
1
minw,ξ≥0 wT w + ξi , (14)
2 n
i=1
1
n
c1
maxα > 0 αc − αc αc xTc xc (16)
n 2
cε{0,1}n cε{0,1 c ε{0,1}n
s.t.
αc ≤ C,
cε{0,1}n
where
1
n
xc = ci yi xi .
n
i=1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 155
Classification 155
Using Eq. (15), the algorithm given below can be used for training
the SVM.
1
n
minw,b wT w + C η(w; φ(xi ), yi ). (17)
2
i=1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 156
5. Logistic Regression
There are several linear models that have been successful in classi-
fying large datasets. We have seen one such model that is the linear
SVM. Another popular model that has been used on large datasets
is the logistic regression model. Here, we model the ratio of the like-
lihoods. Specifically, we assume that
P (X|C1 )
ln is linear in X.
P (X|C2 )
Observing that X is a vector, our assumption would result in the
log-likelihood ratio being equal to some scalar of the form W t X + q,
where W is a vector and q is a scalar; this expression is linear in X.
This may be easily achieved in the following case.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 157
Classification 157
P (X|C1 )
Then the likelihood ratio P (X|C2 ) is given by
exp − 12 (X−µ 1)
2
σ 2 1
2
= exp − 2 {(X − µ1 ) − (X − µ2 ) } 2
2
exp − 12 (X−µ
σ 2
2) 2σ
µ1 − µ2
= exp {2X + (µ1 + µ2 )} .
2σ 2
So,
P (X|C1 ) µ1 − µ2
ln = {2X + (µ1 + µ2 )}.
P (X|C2 ) 2σ 2
P (X|C1 )
ln = W t X + q,
P (X|C2 )
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 158
where b = q + ln PP (C1)
(C2 ) . This gives us a linear form for both the log-
arithm of the likelihood ratio and the logarithm of the ratio of the
posterior probabilities. We can simplify further to write P (C1 |X) in
terms of W t X + b as follows. We have
P (C1 |X)
ln = W t X + b.
P (C2 |X)
In a two-class problem P (C2 |X) = 1 − P (C1 |X). So,
P (C1 |X)
ln = W t X + b.
1 − P (C1 |X)
This implies, by taking exponentiation on both sides, that
P (C1 |X)
= exp(W t X + b).
1 − P (C1 |X)
By simplifying further, we get
So,
exp(W t X + b)
P (C1 |X) = .
(1 + exp(W t X + b))
Classification 159
1
n
{s(W t Xi + b) − yi }2 .
2
i=1
6. Semi-supervised Classification
It is generally assumed that there is a large amount of labeled training
data. In reality, this may not be true. Often, there is some labeled
data and a large amount of unlabeled data. This maybe due to the
fact that when preparing the training data, getting labels maybe a
difficult task or the cost of obtaining labels maybe very costly. This
maybe due to one of the following reasons:
i.e. n − l >> l.
Classification 161
regions, a graph is drawn using the data points in the dataset. Edges
are drawn between nodes which are NN. This graph is then used to
identify the low density regions.
Classification 163
1 1
and a normalized Laplacian D − 2 ∆D− 2 is used in the regularizer
giving the function
1 1
f T D− 2 ∆D − 2 f. (25)
4. Tikhonov Regularizer
This algorithm uses the loss function and the Tikhonov regularizer
giving
1
(fi − yi)2 + βf T ∆f. (26)
l
i
Classification 165
1. Co-EM:
• For every pattern x, each classifier probabilistically gives a label.
• Add (x, y) (where y is the label) with the weight P (y|x).
2. For different feature splits
• Create random feature splits.
• Apply co-training.
3. Use multiple classifiers
• Train multiple classifiers using labeled data.
• Classify unlabeled data with all the classifiers.
• Unlabeled data is labeled according to majority vote.
This method is simple and can be used with any existing classifier
but it is possible that the mistakes made in classification will keep
reinforcing themselves.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 166
The last term arises from assigning the label sign f (x) to the
unlabeled points.
3. Classify a test pattern x by sign(f (x)).
Here, Ijl depends only on the labeled data. Iju depends on both
the labeled and unlabeled data. β is a weighting factor.
|Yji |
Ijl = E(Yj ) − Σi=1,2,...,s E(Yji ), (29)
|Yj |
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 167
Classification 167
Classification 169
Classification 171
Some of the statistical models that can be used are Naive Bayes,
Gaussian, Poisson, Markov, and hidden markov model (HMM).
When using HMM, training examples are used to learn the tran-
sition probabilities between the states. An HMM consists of a set of
states, an alphabet, a probability transition matrix T = (tij ) and a
probability emission matrix M = (mik ). In state i, the system has a
probability of tij of moving to state j and a probability mik of emit-
ting symbol k. For each class, an HMM is built using the training
data. A new pattern is given the class label of the model which fits
the data the best.
An ANN can also be used. Two types of ANN used are MLP
and RNN.
Research Ideas
1. Is it possible to design better condensation algorithms compared to CNN and
MCNN in terms of space and condensation time requirements?
Relevant References
(a) V. S. Devi and M. N. Murty, An incremental prototype set building tech-
nique. Pattern Recognition, 35:505–513, 2002.
(b) V. S. Devi and M. N. Murty, Pattern Recognition: An Introduction.
Hyderabad, India: Universities Press, 2012.
(c) M. N. Murty and V. S. Devi, Pattern Recognition: An Algorithmic
Approach. London: Springer, 2012.
(d) S. Gracia, J. Derrac, J. R. Cano and F. Herrera, Prototype selection for near-
est neighbor classification: Taxonomy and empirical study. IEEE Trans-
actions on PAMI, 34:417–435, 2012.
2. The usual distance metrics such as ED do not work well in high-dimensional
spaces. Can we find metric or non-metric distance functions such as fractional
norms that work well in high-dimensional spaces?
Relevant References
(a) C. C. Aggarwal, Re-designing distance functions and distance-based appli-
cations for high dimensional data. SIGMOD Record, 30:13–18, 2001.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 172
Relevant References
(a) L. Breiman, Random forests. Machine Learning, 45(1):5–32, 2001.
(b) X. Z. Fern and C. E. Brodley, Random projection for high-dimensional
data clustering: A cluster ensemble approach. Proceedings of ICML, 2003.
(c) A. Andoni and P. Indyk, Near-optimal hashing algorithms for approxi-
mate nearest neighbors in high dimensions. Communications of the ACM,
51:117–122, 2008.
(d) Y. Ye, Q. Wu, H. Z. Huang, M. K. Ng and X. Li, Stratified sampling for
feature subspace selection in random forests for high dimensional data.
Pattern Recognition, 46:769–787, 2013.
4. Bagging and boosting are two useful techniques to improve classifiers perfor-
mance. How can one combine them in classification using random forests?
Relevant References
(a) L. Breiman, Bagging predictors. Machine Learning, 24:123–140, 1996.
(b) T. K. Ho, The random subspace method for constructing decision forests.
IEEE Transactions on PAMI, 20:832–844, 1998.
(c) P. J. Tan and D. L. Dowe, Decision forests with oblique decision trees.
Proceedings of MICAI, 2006.
5. Like the fuzzy random forests, is it possible to consider random forests based
on other soft computing tools?
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 173
Classification 173
Relevant References
(a) Q.-H. Hu, D.-R. Yu and M.-Y. Wang, Constructing rough decision forests,
in D. Slezak et al. (eds.). Berlin, Heidelberg: Springer-Verlag, 2005,
pp. 147–156. LNAI 3642.
(b) H. Shen, J. Yang, S. Wang and X. Liu, Attribute weighted Mercer kernel-
based fuzzy clustering algorithm for general non-spherical data sets. Soft
Computing, 10:1061–1073, 2006.
(c) A. Verikas, A. Gelzinis and M. Bacauskiene, Mining data with random
forests: A survey and results of new tests. Pattern Recognition, 44:2330–
2349, 2011.
6. What is the reason behind the success of linear SVM classifier in dealing with
classification in high-dimensional spaces?
Relevant References
(a) D. Liu, H. Qian, G. Dai and Z. Zhang, An iterative SVM approach to
feature selection and classification in high-dimensional datasets. Pattern
Recognition, 46:2531–2537, 2013.
(b) M.-H. Tsai, Y.-R. Yeh, Y.-J. Lee and Y.-C. Frank Wang, Solving nonlinear
SVM in linear time? A Nystrom approximated SVM with applications to
image classification. IAPR Conference on Machine Vision Applications,
2013.
(c) T. Joachims, Training linear SVMs in linear time. Proceedings of KDD,
2006.
(d) G.-X. Yuan, C.-H. Ho and C.-J. Lin, Recent advances of large-scale linear
classification. Proceedings of the IEEE, 100:2584–2603, 2012.
7. The so-called nonlinear SVM employs the kernel trick to obtain a linear deci-
sion boundary in a higher-dimensional space, thus effectively increasing the
dimensionality of the patterns. However, the random forest classifier consid-
ers a random subspace at a time to construct a decision tree which forms a
part of the forest. Also, there are plenty of other dimensionality reduction
techniques that perform well in classification. How can one reconcile to the
fact that both increase in the dimensionality (kernel SVM) and decrease in the
dimensionality (random forests and other classifiers) improve the classification
performance?
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 174
Relevant References
(a) M.-H. Tsai, Y.-R. Yeh, Y.-J. Lee and Y.-C. F. Wang, Solving nonlinear
SVM in linear time? A Nystrom approximated SVM with applications to
image classification. IAPR Conference on Machine Vision Applications,
2013.
(b) S. Haykin, Neural Networks and Learning Machines, Vol. 3. Upper Saddle
River: Pearson Education, 2009.
(c) G. Seni and J. F. Elder, Ensemble methods in data mining: Improving
accuracy through combining predictions. Synthesis Lectures on Data
Mining Knowledge Discovery, 2:1–126, 2010.
(d) X. Hu, C. Caramanis and S. Mannor, Robustness and regularization of
support vector machines. JMLR, 10:1485–1510, 2009.
(e) N. Chen, J. Zhu, J. Chen and B. Zhang, Dropout training for support vector
machines. arXiv:1404.4171v1, 16th April 2014.
8. Can we pose the semi-supervised classification problem as a simpler optimiza-
tion problem?
Relevant References
(a) I. S. Reddy, S. K. Shevade and M. N. Murty, A fast quasi-Newton method
for semi-supervised support vector machine. Pattern Recognition, 44:
2305–2313, 2011.
(b) X. Chen, S. Chen, H. Xue and X. Zhou, A unified dimensionality reduction
framework for semi-paired and semi-supervised multi-view data. Pattern
Recognition, 45:2005–2018, 2012.
(c) X. Ren, Y. Wang and X.-S. Zhang, A flexible convex optimization model
for semi-supervised clustering with instance-level constraints. Proceed-
ings of ISORA, 2011.
9. Is it possible to consider semi-supervised dimensionality reduction which can
help in efficient and effective classification?
Relevant References
(a) K. Kim and J. Lee, Sentiment visualization and classification via semi-
supervised nonlinear dimensionality reduction. Pattern Recognition, 47:
758–768, 2014.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 175
Classification 175
Relevant References
(a) S. Laxman and P. Sastry, A survey of temporal data mining. Sadhana,
31:173–198, 2006.
(b) Z. Xing, J. Pei and E. Keogh, A brief survey on sequence classification.
SIGKDD Explorations, 12:40–48, 2010.
(c) N. Piatkowski, S. Lee and K. Morik, Spatio-temporal random fields: Com-
pressible representation and distributed estimation. Machine Learning,
93:115–139, 2013.
11. It is possible to view patterns as transactions and use frequent itemset-based
classifiers. What is the role of frequent itemsets in classification?
Relevant References
(a) H. Cheng, X. Yan, J. Han and P. S. Yu, Direct discriminative pattern mining
for effective classification. Proceedings of ICDE, 2008.
(b) M. N. Murty and V. Susheela Devi, NPTEL Lecture Notes on Pat-
tern Recognition, http://nptel.ac.in/courses.php [accessed on 2 November
2014].
(c) B. Fernando, E. Fromont and T. Tuytelaars, Mining mid-level features for
image classification. International Journal of Computer Vision, 108:186–
203, 2014.
12. One way to reduce space and time requirements in classification is to compress
the data and design classifiers in the compressed domain. How to realize such
classifiers in practice?
Relevant References
(a) D. Xin, J. Han, X. Yan and H. Cheng, Mining compressed frequent pattern
ets. Proceedings of VLDB, 2005.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch05 page 176
Chapter 6
1. Introduction
Hard classifiers or the classical classification techniques make a hard
or definite decision on the class label of the test patterns. The many
classifiers we have discussed till now fall under this. However, several
current day applications require each pattern to belong to one or
more classes. For example, a document may belong to both sports
and politics. Such applications motivate the need for soft classifica-
tion. A soft classifier either gives the degree of classification of the
test pattern to every class label, or may classify the test pattern
as belonging to more than one class. Classifiers based on genetic
algorithms (GAs) or neural networks start with some random values
(for the candidate solution or the weights) and depending on the
performance on training patterns, these values are adapted.
Some of these methods which we will discuss in this chapter are
as follows:
1. Fuzzy Classifier: In this classifier, each pattern belongs to every
class with a membership value. To predict the class label of a
test pattern, its fuzzy membership to every class is determined
and the class to which its membership is maximum is the class
chosen. Further, in a multi-label scenario, classes could be ranked
based on the respective membership values and more than one
class label could be assigned to the test pattern based on the
ranking.
2. Rough Classifier: Here, every pattern belongs to the lower
approximation of one class or to the upper approximation of more
177
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch06 page 178
than one class. This type of classifier is suitable when the patterns
in the domain can belong to more than one class.
3. Genetic Algorithms (GAs) for Classification: GAs work
with a population of candidate solutions which are initialized ran-
domly. Each chromosome in the population is evaluated to see
how well it carries out the problem to be solved. This is called the
fitness function. The advantage of GAs is that they are generally
used for optimization and can handle problems which are complex,
non-differentiable and are multi-modal and multi-objective. Local
minima is easily avoided because of working with a population
of chromosomes. GAs for classification usually attempt to find
a dividing hyperplane between classes, find the set of rules for
classification etc.
4. Neural Networks: The neural network is inspired by the neural
system in human beings and the neurons in the brain. The neu-
ral network consists of the input layer, 0–2 hidden layers and an
output layer. The weights in the network are adjusted so as to get
the correct class label when the training patterns are input to the
network.
5. Multi-class Classification: The patterns in this scenario belong
to more than one class. If we have a label set L = {c1 , . . . , ck } then
each pattern belongs to a subset of L. One such application is when
the news items in the newspaper have to be classified. A news item
can belong to say both politics and movies if a movie star is in
politics. The task is much more complex here as it is necessary to
not only predict the subset of labels but also the ranking of the
labels.
2. Fuzzy Classification
In the conventional classification algorithms, each pattern to be clas-
sified belongs to one class. This is a crisp classification paradigm. In
fuzzy classification, each pattern belongs to each class with a mem-
bership value. If we consider a pattern P , µP 1 , µP 2 , . . . , µP C are the
membership values of pattern P to classes 1, 2, . . . , C. This can be
converted into crisp classification by assigning pattern P to the class
to which its membership value is highest.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch06 page 179
where R(x, y) gives the similarity between x and y and is defined as:
y − x−2/(m−1)
R(x, y) = −2/(m−1)
.
jN y − j
3. Rough Classification
Here, we use the notion of an upper and lower approximation of a set
to define approximate classification. An approximation space A con-
sists of A = (U, R) where U is the universe and R is a binary equiv-
alence relation over U , which is called the indiscernibility relation. If
(p, q)εR, then p and q are indiscernible in A. Equivalence classes of
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch06 page 180
If C is a partition of U , then
n
Pi Pj = Φ for every i, j, 1 ≤ i, j ≤ n, Pi = U.
i=1
U a b c
p1 0 4 1
p2 1 2 0
p3 1 1 3
p4 0 0 2
p5 1 3 2
p6 0 4 3
p7 0 3 0
Once the decision rules in the form of Eq. (2) are generated,
simplification is done so that as many of the condition attribute val-
ues are removed as possible without losing the required information.
This process is called value reduction. All the rules are kept in a
set Rule. One rule at a time is taken and copied to r. A condition is
removed from r and rule r is checked for decision consistency with
every rule belonging to Rule. If r is inconsistent, then the dropped
condition is restored. This is repeated for every condition of the rule.
The rule r after all these processes is the generalized rule. If r is
included in any rule belonging to Grule, r is discarded. If any rule in
Grule is included in r, these rules are removed from Grule. After all
the rules in Rule is processed, we get the rules in Grule. This set of
rules are called maximally general or minimal length.
4. GAs
GAs are a robust method of carrying out optimization based on
the principles of natural selection and genetics. It is a search pro-
cess where a candidate solution is used with an evaluation function.
A candidate solution is a chromosome or string and there is a popula-
tion of strings. These are evaluated and the next generation of strings
are generated by using the operations of selection, crossover and
mutation. After repeating this procedure for a number of iterations,
the candidate solution generated which gives the best evaluation is
the final solution.
The GA has been used for carrying out classification. The fol-
lowing sections discuss some algorithms using GA for classification.
[0.45 0.20 0.93 0.11 0.56 0.77 0.32 0.45 0.69 0.85].
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch06 page 183
total − correct
fitness = ,
total
where correct refers to the number of patterns correctly classified
and total refers to the total number of patterns.
Another fitness function that can be used gives weights which
result in the maximum class separation. This is given by:
total − correct nm /k
fitness = a ∗ +b∗ ,
total total
condition
: class − label
.
The condition part would give values for each dimension and for
each class. So the rule would be of the form
(y11 , . . . , y1i , . . . , yid ), . . . , (yj1 , . . . , yji , . . . , yjd ), . . . ,
(yc1 , . . . , yci , . . . , ycd ): ω.
correct invalid
fitness = +a∗ ,
total total
where invalid gives the number of attributes which have the same
value in the rule for all classes, i.e. if y1i = y2i . . . = yci , then the
attribute i is invalid. After running the GA for a number of iterations,
the rule which gives the best fitness is chosen.
1. Error rate: Each training pattern is classified by using all the rules
in an individual and the majority class label is assigned to the
pattern. This is compared with the class label given and the error
rate of the individual or string is the percent of training patterns
which are misclassified.
2. Entropy: In the training patterns that a rule R matches, if pi is
the percent of patterns belonging to class i, then
n
Entropy(R) = − pi log2 pi.
i=1
Rule − consistency(individual)
= −pcorr log2 pcorr − (1 − pcorr )log2 (1 − pcorr ),
1
C
Fitness = n i ai ,
n
i=1
1
N
fitnesscv = a(Pi , Pi ),
N
i=1
where µ(φ) = 0.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch06 page 188
Here
c if (c) (p + qf )dµ < A
ci =
1 otherwise
for i = 1, 2, . . . , m and
c if (c) (p + qf )dµ > A
ci = .
1 otherwise
where label is the class label (meaningful only if the node is a ter-
minal node), par is a pointer to the parent node, l and r are the
pointers to the left and right children respectively. S is an array
where S[0] stores the attribute id and S[1] stores the threshold for
the feature S[0] < S[1] which is boolean giving the outcomes “yes”
or “no”.
In another tree-based formulation, the nodes are either terminal
or function nodes. Terminal nodes consist of an attribute id, attribute
value or a class label. Function nodes consist of a 4-tuple given by:
where value gives the attribute value, l and r are children nodes.
Fixed length encoding is difficult to use for decision trees which
are non-binary. Generally only binary trees are considered for this.
One method, divides the genes into caltrops. Each caltrop consists
of the subtree formed using the node as the root and its two chil-
dren. A non-terminal node is identified by an attribute index and the
terminal node is identified by the value zero.
Another method of using fixed length encoding encodes each node
by two values. A node is represented by:
k
p(ωi | Yi )
J= p(Yi )p(ωi | Yi )log ,
p(ωi )
i=1
4
bj = wij ∗ Xj .
i=1
X1 W1j
W2j
X2
bj = ΣX i Wij f(bj )
W3j
X3
node j
X4 W4j
O1 O2 Oo
........
V=vij , i=1 to h
j=1 to o
H H ........ Hh
1 2
U=uij , i=1 to d,
j=1 to h
........
I1 I2 I
d
∆vij = α ∗ δj ∗ Hi
and vij = vij + ∆vij ,
where
The weights between the input layer and the hidden layer is
updated as follows:
∆uij = α ∗ δj ∗ Ii
and uij = uij + ∆uij ,
where
K
∆uij = fj (aj ) δk vjk .
k=1
Ii is the ith input; aj is the input of the hidden unit j; fj (aj )
gives the integration of the activation function applied to aj ; K is
the number of neurons in the next layer.
The learning rate α plays a critical role in the training. If α is
too low, the convergence of the weights to the optimum is very slow
and if α is too high, the weight values oscillate or get stuck in a local
minimum. To tackle this problem, a momentum term β can be added
to make the updation equation as:
t
∆vij = α ∗ δj ∗ Hi + β ∗ ∆vij
t−1
,
t−1
where ∆vij is the incremental change in vij done in the previous
iteration. α and β are values between 0 and 1. Usually α is a small
value like say 0.01 and β will have a larger value like 0.7 or 0.8.
Looking at the error in classification, the two values can be adjusted.
As mentioned earlier, the number of inputs to the input layer
is equal to the number of features and the output is the number
of classes. For example, in a digit recognition problem, there are
10 classes which consist of the digits 0–9. If the digits are represented
as an image with 8 × 8 pixels, then there are 64 features being input
to the neural network, the number of inputs being 64.
match the actual class label, the weights in the network are updated
using the backpropagation algorithm. Instead of using the backprop-
agation algorithm, a GA is used to find the best weights which fit
the network.
When the algorithm is started, all the above operators are given
equal probabilities. In the course of a run, the performance of the
operators is observed and the probability of these operators being
selected is increased if the operator is doing well and decreased if the
operator is doing poorly.
6. Multi-label Classification
In this type of problem, each instance is associated with a set of
labels. It is necessary to predict the set of labels of a test pattern using
the training instances with known label sets. For example, when we
are classifying the newsitems in a newspaper, the same article may
belong to Politics and Entertainment say if a filmstar is standing
for election. When classifying a scene, the same image may have a
set of labels say hill, river, tree etc. When carrying out sentiment
analysis of a document, the same document may express the senti-
ment sadness, anger and interest. There are a number of applications
such as annotation of images and video, text classification, functional
genomics and music categorization into emotions.
Let X εRd be the d-dimensional instance domain and let
L = {l1 , . . . , lq } be the set of labels or classes. Multi-label learning
entails learning a function h : X → 2L which maps each instance
xεX to a set of labels. This is called multi-label ranking since the
order of the labels is important. In other words, we need to predict
the labels with their ranking.
Given in Table 6.2 is an example dataset where each pattern
can belong to more than one class. The dataset has four features
fi , i = 1, . . . , 4 and labels lj , 1 ≤ j ≤ 5. The features have normalized
values. The labels for each instance gives the subset of classes to
which the instance belongs. The ranking of the class labels is also
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch06 page 203
f1 f2 f3 f4 1 2 3 4 5
taken into account. This means that for the first pattern, the first
label is 2 and the second label is 1. For a new instance, it is necessary
to predict the subset of class labels to which the instance belongs
ranked in the correct order.
So the category
vector lp can be determined using the prior pro-
c is the label and b = {0, 1} and the posterior
babilities P Ebc where
c c
probability P Ij Eb for j = 0, 1, . . . , k. These values can be directly
estimated from the training set.
of the given labels and Si . So Di = {(xj , Yj Si ), j = 1, . . . , n}
where n is the total number of patterns. Some patterns may have
the empty set as the label set. Given a new multi-label instance p,
the binary prediction hi of all classifiers for all labels ωj εSi are found
and used to find the multi-class classification vector.
In the case of overlapping labelsets, Ck is the set of all distinct
k-labelsets of C. The size of C k is |C k | = Lk . If we require l classi-
fiers and the label set size is k, we need to first select l l-labelsets
Si , i = 1, . . . , l from the set C k via random sampling without replace-
ment. Here the labelsets may overlap. Then l multi-label classifiers
hi , i = 1, . . . , l are learnt using LP. To classify a new instance p, every
classifier hi gives a binary prediction for each label in the correspond-
ing labelset Si . Taking all the decisions for the l models, the mean of
the predictions is found for each label ωj εC and decides on the label
if the value is greater than 0.5.
c1 c2 c l
..............
V=v
h,t
..............
bo b bm
1 U=us,h
..............
ao a a a
1 2 d
l i
i 2
j=1 oj − dj , where oj is the actual output of the
Here, ei = i
network on pattern xi on the jth class and dij is the desired output
on pattern xi on the jth class. The desired output is either +1(if
jεYi ) or −1(if j ∈
/ Yi).
Another formulation for the overall error is:
n
n
1
E= ei = exp − oip − oiq .
|Yi ||Ȳi |
i=1 i=1 (p,q)Yi X Ȳi
∂Ei
eg = − d
∂ s=1 as vsg
l
= (dt vgt )(1 + bg )(1 − bg )
t=1
instance i.e. Y = {t|ct > threshold , tεC}. The total number of weights
and biases in the network is given by:
N = (d + 1) ∗ g + (g + 1) ∗ C.
1 1
v
HL = |h(Xi )∆Yi ,
i|v| |y|
i=1
1
v
Eone = [argmax yY h(Xi , y) ∈
/ Yi ].
p
i=1
1
v
C= maxyYi rank h (Xi , y) − 1.
v
i=1
1 1
v
RL = {(y1 , y2 )|h(Xi , y1 )
v |Yi ||Ȳi |
i=1
is given by:
1 1 y |rank h (Xi , y ) ≤ rank h (Xi , y), y
Yi
v
P = .
v |Yi | rank h (Xi , y)
i=1 yYi
Research Ideas
1. Can we apply rough-fuzzy approach and the fuzzy-rough approach to pattern
classification?
Relevant References
(a) N. Verbiest, C. Comelis and F. Herrera, FRPS: A rough-fuzzy approach
for generating classification rules. Pattern Recognition, 46(10):2770–2782,
2013.
(b) S. K. Pal, S. K. Meher and S. Dutta, Class-dependent rough-fuzzy granular
space, dispersion index and classification. Pattern Recognition, 45(7):2690–
2707, 2012.
(c) R. Jensen and C. Comelis, Fuzzy-rough nearest neighbor classification.
Transactions on Rough Sets, LNCS, 6499:56–72, 2011.
(d) Y. Qu et al., Kernal-based fuzzy-rough nearest neighbor classification. Inter-
national Conference on Fuzzy Systems, FUZZ:1523–1529, 2011.
2. A neuro-fuzzy system (or a fuzzy neural network) is a fuzzy system which uses
the neural network to learn the parameters of the fuzzy system. How do we use
the neuro-fuzzy system for classification?
Relevant References
(a) A. Ghosh, B. U. Shankar and S. K. Meher, A novel approach to neuro-fuzzy
classification. Neural Networks, 22:100–109, 2009.
(b) R.-P. Li, M. Mukaidono and I. B. Turksen, A fuzzy neural network for pat-
tern classification and feature selection. Fuzzy Sets and Systems, 130:101–
108, 2002.
3. Instead of GAs, other stochastic search techniques such as simulated annealing
or Tabu search can be used. How do these techniques compare with the GA?
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch06 page 212
Relevant References
(a) D. Glez-Pena, M. Reboiro-Jato, F. Fdez-Riverola and F. Diaz, A simulated
annealing-based algorithm for iterative class discovery using fuzzy logic for
informative gene selection. Journal of Integrated Omics, 1:66–77, 2011.
(b) J. Pacheco, S. Casado and L. Nunez, A variable selection method based in
Tabu search for logistic regression models. European Journal of Operations
Research, 199:506–511, 2009.
4. Hybrid GAs combine GAs with operators from other search algorithms like
simulated annealing, local search etc. Can we improve the performance of GAs
by hybridizing them?
Relevant References
(a) W. Wan and J. B. Birch, An improved hybrid GAs with a new local search
procedure. Journal of Applied Mathematics, 2013.
(b) D. Molina, M. Lozano and F. Herrera, MA-SW-Chains: Memetic algorithm
based on local search chains for large scale continuous global optimization.
Proceedings of the 6th IEEE World Congress on Computational Intelligence
(WCCI’10), 2010.
(c) C. Grosan and A. Abraham, Hybrid evolutionary algorithms: Methodo-
logies, architectures, and reviews. Studies in Computational Intelligence
(SCI), 75:1–17, 2007.
5. A number of algorithms exist which mimic the behavior of a swarm of animals
such as Particle Swarm Optimization, Ant Colony Optimization etc. How do
we adapt these algorithms for pattern classification?
Relevant References
(a) B. Xue, M. Zhang and W. N. Browne, Particle swarm optimization for
feature selection in classification: A multi-objective approach. IEEE Trans-
actions on Cybernetics, 43(6):1656–1671, 2013.
(b) H. Dewan and V. S. Devi, A peer-peer particle swarm optimizer. 6th Interna-
tional Conference on Genetic and Evolutionary Computing, pp. 140–144,
2012.
(c) D. Martens, M. De Backer, R. Haesen, J. Vanthienen, M. Snoeck and
B. Baesens, Classification with ant colony optimization. IEEE Transactions
on Evolutionary Computation, 11(5):651–665, 2007.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch06 page 213
Relevant References
(a) N. Spolaor, E. A. Cherman, M. C. Monard and H. D. Lee, A comparison
of multi-label feature selection methods using the problem transformation
approach. Electronic Notes in Theoretical Computer Science, 292:135–151,
2013.
(b) X. Kong, N. Ng and Z. Zhou, Multi-label feature selection for graph classi-
fication. IEEE 10th International Conference on Data Mining (ICDM),
pp. 274–283, 2010.
(c) M. L. Zhang, J. M. Pena and V. Robles, Feature selection for multi-
label naive Bayes classification. Information Sciences, 179(19):3218–3229,
2009.
May 2, 2013 14:6 BC: 8831 - Probability and Statistical Theory PST˙ws
Chapter 7
Data Clustering
1. Number of Partitions
It is possible to depict onto functions from the set of data points
X to the set of clusters C as shown in Figure 7.1 where |X | ≥ |C|.
Onto functions are important in the context of counting the number
of partitions of a dataset.
215
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 216
X1
C1
X2 C2
C3
X3 ..
.
. CK
.
.
Xn
So, the value of Nv can be written in a compact form using (2) and
(3) as
K n K n K−1 K
Nv = (K − 1) − (K − 2) + · · · + (−1) 1n .
1 2 K −1
Nonto (n, K) = K n − Nv
n K n K
=K − (K − 1) − (K − 2)n
1 2
K−1 K
+ · · · + (−1) 1n
K−1
K
K−i K
= (−1) (i)n .
i
i=1
2. Clustering Algorithms
Conventionally clustering algorithms are either partitional or hier-
archical. Partitional algorithms generate a partition of the set of
data points and represent or abstract each cluster using one or more
patterns or representatives of the cluster. Consider the data points
shown in Figure 7.2. There is a singleton cluster and two other dense
clusters. Here, centroid of a cluster of patterns is used to represent
the cluster as depicted in the figure.
On the other hand, hierarchical clustering algorithms generate
a hierarchy of partitions. Such a hierarchy is typically generated by
either splitting bigger clusters into smaller ones (divisive clustering)
or by merging smaller clusters to form bigger clusters (agglomera-
tive clustering). Figure 7.3 shows a hierarchy which is also called
as dendrogram. There are two clusters at the top level; these are
Centroid
y
xxx
b4 xxxxx
xxx
Centroid
b3 x
xxx
x x Ox x x
b2 xxxxxxx
xxxx x
b1 xx
Outlier (singleton cluster)
a1 a2 a3 x
1 Cluster
2 Clusters
3 Clusters
4 Clusters
5 Clusters
6 Clusters
7 Clusters
8 Clusters
A B C D E F G H
Pattern
numbers Feature1 Feature2 Feature3
1 10 3.5 2.0
2 63 5.4 1.3
3 10.4 3.5 2.1
4 10.3 3.3 2.0
5 73.5 5.8 1.2
6 81 6.1 1.3
7 10.4 3.3 2.3
8 71 6.4 1.0
9 10.4 3.5 2.3
10 10.5 3.3 2.1
K-means Algorithm
argmax
X q+1 = (d(X 1 , X) + · · · + d(X q , X))
X
X ∈ X − {X 1 , X 2 , . . . , X q }.
scheme for initial centroid selection. One problem with this ini-
tialization is that it can lead to empty clusters. For example,
consider the dataset shown in Table 7.1. Let the three initial
centroids be (10, 3.5, 2.0), (81, 6.1, 1.3), and (40, 4.8, 1.7) where
the first two are the two extreme points in the dataset and the
third one is approximately at the middle of the line joining the
other two.
Using these three centroids, the clusters obtained are:
Cluster 1: {(10, 3.5, 2.0), (10.4, 3.5, 2.1), (10.3, 3.3, 2.0),
(10.4, 3.3, 2.3), (10.4, 3.5, 2.3), (10.5, 3.3, 2.1)}
Cluster 2: {(63, 5.4, 1.3), (71, 6.4, 1.0), (73.5, 5.8, 1.2),
(81, 6.1, 1.3)}
Cluster 3: { }
Note that based on the patterns, the minimum and maximum
values of feature1 are 10 and 81. So, the range is 10 · · · 81. Sim-
ilarly, for feature2 the range is 3.3 · · · 6.1 and for feature3 it is
1.0 · · · 2.3. The range box in this example is a hypercube based
on these three range values. So, even though all the three initial
centroids are legal and fall in the range box, one of the clusters
obtained using the K-means algorithms is empty in this exam-
ple. In general, one or more clusters could be empty when such
a scheme is used.
• Time and Space Requirements:
Each iteration of the K-means algorithm requires computation of
distance between every data point and each of the K centroids. So,
the number of distance computations per iteration is O(nK). If the
algorithm takes l iterations to converge then it is O(nKl). Further
if each data point and the centroid are p-dimensional, then it is
O(nklp). Also, it needs to store the K centroids in the memory;
so, the space requirement is of O(Kp).
Leader Algorithm
Input: Dataset, X ; Distance threshold, T
Output: A Partition of X , ΠnK
1. Select the first point as the leader of the first cluster. Set K = 1.
2. Consider the next point in X and assign it to the cluster whose
leader has a distance less than the user specified threshold T . Else
increment the value of K and start the Kth cluster with the current
point as its leader.
3. Repeat step 2 till all the points in X are considered for clustering.
Note that
• Threshold Size: The value of T decides the number of clusters
generated; for a given X and a small value of T the algorithm
generates a large number of small size clusters and for a larger
value of T , the algorithm generates a small number of large size
clusters.
• Order Dependence: The order in which points in X are con-
sidered plays an important role; for different orders the resulting
partitions could be different.
We illustrate with a two-dimensional example using the dataset
shown in Table 7.2.
1 1 1
2 2 1
3 2 2
4 3 3
5 6 6
6 7 6
7 7 7
8 8 8
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 225
• the first cluster is {(1, 1)t , (2, 1)t , (2, 2)t , (3, 3)t } and
• the second cluster is {(6, 6)t , (7, 6)t , (7, 7)t , (8, 8)t }.
C = {X1 , X2 , . . . , Xp }, then
p
ls j=1 Xj
Centroid of C = CentroidC = , =
p p
p 2 2
1
(1, (7,3)) (2, (28,5)) (2, (3,4)) (2, (12,7)) (2, (3,2)) (1, (2,2))
(1, (2,2)) (1, (2,2)) (1, (6,3)) (2, (3,4)) (1, (6,3))
(d)
the following:
1. We consider the first data point (2, 2). It forms a cluster and
the corresponding part of the CF-vector is (1, (2, 2)) as shown
in Figure 7.5(a).
2. Now we consider the next pattern (6, 3); the only neighbor is
(2, 2) (centroid of the cluster represented by (1, (2, 2)) at a dis-
tance of approximately 4.1 units which is greater than T (=1).
So, a new cluster has to be initiated; further, the leaf node can
accommodate one more CF entry (cluster) as L = 2. So, we
create a new cluster and the corresponding partial CF-vector
(1, (6, 3)) is inserted into the leaf node as shown in Figure 7.5(b).
3. Now we consider the point (1, 2); the nearest centroid (2, 2) is at
a distance of 1 unit. So, we insert (1, 2) into the cluster with the
centroid (2, 2); the updated part of the CF-vector is (2, (3, 4))
as shown in Figure 7.5(c). Note that after the updation, the
current centroid is (1.5, 2).
4. We consider the pattern (2, 1) next; it is at a distance of approx-
imately 1.1 units from (1.5, 2) (one centroid) and at a distance
of 4.5 units from the other centroid, (6, 3). So, we need to start
a new cluster; it cannot be accommodated in the existing leaf
as L = 2 and already 2 CF entries (clusters) are present in the
leaf node. So, a new leaf node is added. Next, we consider (6, 4)
which is inserted into the same cluster as (6, 3) leading to the
updated CF-vector (2, (12, 7)) as shown in Figure 7.5(d). Next
insert (7, 3) into a new cluster as none of the three existing
centroids is at a distance of less than or equal to T from (7, 3);
the new CF-vector is (1, (7, 3)) which is shown in Figure 7.5(d).
Next, we consider (1, 1) which is assigned to the same cluster
as (2, 1) and the corresponding CF-vector becomes (2, (3, 2));
this is also depicted in Figure 7.5(d).
5. Now by adding the remaining three points in the order (14, 2),
(14, 3), (2, 2), we get the final tree shown in Figure 7.4.
• Order Dependence
Like the Leader algorithm, BIRCH also suffers from order depen-
dence. Note that two copies of the point (2, 2) are assigned to
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 230
F2
H
E F
G
B
A D
C
F1
A
E
B D
C F
this case) is
3 0 0 0 0 0
0 4 0 0 0 0
0 0 3 0 0 0
W = .
0 0 0 4 0 0
0 0 0 0 3 0
0 0 0 0 0 3
where C1∗ and C2∗ (= V − C1∗ ) are the optimal values of C1 and C2 .
Such a C1∗ and its complement C2∗ correspond to the two required
clusters of the partition.
It is possible to abstract the mincut expression in a form suitable
for optimization by considering the following.
1
= sij (Ii − Ij )2 . (8)
2
Ii =Ij
(1, 1)
„ √ «
−3 + 17
1,
2
(1, 1)
„ √ «
3− 17
1,
2
„ √ «
−7 − 17
1,
2
„ √ «
3− 17
1,
2
TID i1 i2 i3 i4 i5 i6 i7 i8 i9
T1 0 0 1 0 0 1 0 0 1
T2 0 0 1 1 0 1 0 0 1
T3 0 0 1 0 0 1 1 0 1
T4 1 0 1 0 0 1 0 0 1
T5 0 0 1 0 1 1 0 0 1
T6 0 0 1 0 0 1 0 1 1
mij = (i − 1) ∗ 3 + j for i, j = 1, 2, 3.
dataset once we can observe that the frequencies of the nine items in
the collection of patterns are
TID i1 i2 i3 i4 i5 i6 i7 i8 i9
T7 1 1 1 0 0 1 0 0 1
T8 1 1 1 1 0 1 0 0 1
T9 1 1 1 0 0 1 1 0 1
T10 1 1 1 1 0 1 0 1 1
T11 1 1 1 0 1 1 0 0 1
T12 1 1 1 0 0 1 0 1 1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 239
Root
i3
i 3: 12
i6 i 6: 12
i9
i 9: 12 C1
i2 i1 : 6
i1
i2 : 6 C2
F IT1 i3 , i6 , i9
F IT2 i3 , i6 , i9
F IT3 i3 , i6 , i9
F IT4 i3 , i6 , i9
F IT5 i3 , i6 , i9
F IT6 i3 , i6 , i9
F IT7 i3 , i6 , i9 , i1 , i2
F IT8 i3 , i6 , i9 , i1 , i2
F IT9 i3 , i6 , i9 , i1 , i2
F IT10 i3 , i6 , i9 , i1 , i2
F IT11 i3 , i6 , i9 , i1 , i2
F IT12 i3 , i6 , i9 , i1 , i2
3. Why Clustering?
Clustering is useful in several machine learning and data mining
tasks including data compression, outlier detection, and pat-
tern synthesis.
Decision
Making Decision
Input Data Point
outlier shown in Figure 7.2 is within the range. Typical data mining
schemes for outlier detection are based on clustering or density of
the data in the vicinity. Once the data is clustered, one needs to
examine small size clusters for possible outliers; typically singleton
clusters are highly likely to be containing outliers.
Cluster 1
X
XXX Cluster 2
XX
feature3
X
X X
X
feature1
with (10.3, 2.1)t and (72.1, 1.2)t as the centroids. One can see two
clusters in the collection: Cluster 1 and Cluster 2. Note that pattern
3 is in Cluster 1. Now a simple and good estimate for the missing
value of feature2 of pattern 3 is the sample mean of the values of fea-
ture2 of the remaining patterns in Cluster 1. The values of feature2
of patterns falling in Cluster 1 are: 3.5, 3.3, 3.3, 3.5, 3.3; the average
value is approximately 3.4. Even though the value of 3.4 is different
from the actual value 3.5; the mean value is closer to 3.5 and so this
simple scheme gives acceptable estimates.
In the case of missing values, we are synthesizing a part or esti-
mating the feature value of a pattern. However, there could be appli-
cations where the entire pattern has to be synthesized. Classification
based on a small set of training patterns requires such synthesis. This
is because the number of training patterns required increases with
the dimensionality of the dataset for a good classification.
Clustering could be used to synthesize; specifically cluster repre-
sentatives could be used to synthesize patterns as follows:
For example, consider the two clusters shown in Figure 7.11. The cen-
troid of Cluster 1 is (10.33, 2.15)t and that of Cluster 2 is (72.1, 1.2)t .
Now we can perturb these centroids to generate new patterns. For
example, by adding a small value of 0.2 to the value of feature1 and
−0.1 to the value of feature3, we obtain a new pattern (10.53, 2.05)t
from the first centroid. In a similar manner, we can generate patterns
of Cluster 2 by randomly perturbing the values of its centroid.
It is possible to combine patterns in a cluster to generate addi-
tional samples. An algorithm for this is:
Pattern Cluster
number Feature1 Feature2 Feature3 Feature4 number
1 1 1 1 1 1
2 2 2 2 2 1
3 6 6 6 6 2
4 7 7 7 7 2
5 1 1 2 2 1
6 2 2 1 1 1
7 6 6 7 7 2
8 7 7 6 6 2
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 246
Pattern Class
number feature1 feature2 feature3 label
1 10 3.5 2.0 1
2 63 5.4 1.3 2
3 10.4 3.5 2.1 1
4 10.3 3.3 2.0 1
5 73.5 5.8 1.2 2
6 81 6.1 1.3 2
7 10.4 3.3 2.3 1
8 71 6.4 1.0 2
9 10.4 3.5 2.3 1
10 10.5 3.3 2.1 1
patterns, the time taken to compute all the distances will be large.
We can reduce this effort by clustering the datasets of each class sep-
arately and use the prototypes or representatives of clusters instead
of the entire training data.
In this example, by clustering patterns in each class separately
using the K-means algorithm with K = 2, we get the following
clusters:
• Class 1:
– Cluster11 = {1}; Centroid11 = (10, 3.5, 2.0)t
– Cluster12 = {3, 4, 7, 9, 10}; Centroid12 = (10.4, 3.4, 2.2)t
• Class 2:
– Cluster1 = {2, 8}; Centroid21 = (67, 5.9, 1.2)t
– Cluster2 = {5, 6}; Centroid22 = (77.2, 6, 1.2)t
Now the cluster centroid nearest to T is Centroid21 which is at a
squared Euclidean distance of 9.4. So, we assign T to Class 2 as the
nearest cluster centroid is from class 2. Here, we need to compute
only four distances from the test pattern T as there are only four
centroids. Note that clustering of the training data needs to be done
only once and it can be done beforehand (offline). Also note that
clustering is done once and centroids of the clusters are obtained. The
same centroids could be used to classify any number of test patterns.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 248
Di − Ri < Ds , (11)
where
Di = Distance from the centroid of Ci to the boundary of the SVM
obtained,
Ri = Radius of cluster Ci ,
Ds = Distance from support centroid to the boundary of the SVM.
• Obtain the SVM using these patterns that are accumulated.
• Repeat this expansion and obtaining SVM till no additional pat-
terns are accumulated.
N egative (−ve) Class: C1− = {(−1, 3)t , (1, 3)t }; C2− = {(2, 1)t };
C3− = {(−3, −2)t , (−1, −2)t }.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 249
1 −1 3 −
2 1 3 −
3 2 1 −
4 4 8 +
5 4 10 +
6 5 5 +
7 6 3 +
8 −3 −2 −
9 −1 −2 −
10 10 4 +
11 10 6 +
The centroids of these clusters are: (0, 3)t , (2, 1)t , and (−2, −2)t
respectively.
P ositive (+ve) Class: C1+ = {(4, 8)t , (4, 10)t }; C2+ = {(5, 5)t ;
C3+ = {(6, 3)t }; C4+ = {(10, 4)t , (10, 6)t }.
The corresponding centroids respectively are: (4, 9)t , (5, 5)t , (6, 3)t ;
and (10, 5)t .
• Obtain the Linear SVM using the seven centroids. The support
centroids are (2, 1)t , (5, 5)t , (6, 3)t . Expanding them will not add
any more patterns as in this simple case each of these clusters is a
singleton cluster. The corresponding W and b of the SVM are:
2 1 t
W = , and b = −2.
5 5
• The distance of a point X = (x1 , x2 )t from the decision boundary
t X+b
is given by WW
. So, the distance of support centroids (2, 1)t
√
from the decision boundary is 5 and similarly √ for the remaining
two support centroids also the distances are 5 each.
• For the remaining cluster centroids the distances are:
1. (0, 3)t : Distance is √7
5
2. (4, 9)t : Distance is √7
5
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 250
16
3. (−2, −2)t : Distance is √ 5
√
4. (10, 5)t : Distance is 3 5.
Pattern f1 f2
P1 0 0
P2 0 1
P3 1 1
P4 1 0
f1 ∧ f2 f1
Pattern f1 f2 f 1 ∧ f2 f2 (g2 ) (g1 ) f1 ∨ f2 f1 ∨ f 2
P1 0 0 0 1 0 1 1 1
P2 0 1 0 0 1 1 0 1
P3 1 1 0 0 0 0 1 1
For the sake of simplicity we consider the first three object types.
Now considering all possible boolean functions we have the data
shown in Table 7.13. If we had considered all the four object types
we would have got 16 boolean functions; instead we considered only
three types to have only eight boolean functions. We have chosen the
first three types; it will lead to a similar argument ultimately even if
we consider any other three types.
In Table 7.13, we have considered all possible boolean functions
which are eight in this case. In this representation, between any pair
of patterns exactly four predicates (boolean functions) differ. So,
distance or similarity based on this matching between any pair of
patterns is the same. One may argue that f1 and f2 are primitive
because they are given; others are derived. However, it is possible to
argue that g1 = f1 (where f1 is the negation of f1 ) and g2 = f1 ∧ f2
can be considered as primitive and f1 = g1 and f2 = g1 ∨ g2 means
we can derive f1 and f2 from g1 and g2 . So, it is not possible to fix
some as more primitive than others. Further, for the machine it does
not matter which is more primitive. This means we need to consider
all possible predicates (boolean functions) and the discrimination is
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 252
C
fi = , (12)
ri
Proximity(Xi , Xj ) = f (Xi , Xj ),
5. Combination of Clusterings
One of the more recent trends in clustering is to obtain a single
partition by using multiple partitions of the data. Here, a generic
framework is:
P1 1 1
P2 2 2
P3 3 3
P4 4 4
P5 5 5
P6 5 1
P7 6 2
P8 7 3
P9 8 4
P10 9 5
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 256
X X
X X
X X
X X
X X
2 (1,1),(9,5) {{(1,1),(2,2),(3,3),(4,4),(5,1)}
{(5,5),(6,2),(7,3),(8,4),(9,5)}}
2 (3,3),(7,3) {{(1,1),(2,2),(3,3),(4,4),(5,5)}
{(5,1),(6,2),(7,3),(8,4),(9,5)}}
2 (4,4),(6,2) {{(1,1),(2,2),(3,3),(4,4),(5,5)}
{(5,1),(6,2),(7,3),(8,4),(9,5)}}
2 (1,1),(5,1) {{(1,1),(2,2),(3,3)},
{(4,4),(5,5),(5,1),(6,2),(7,3),(8,4),(9,5)}}
4 (1,1),(5,1),(5,5),(9,5) {{(1,1),(2,2),(3,3)},{(4,4),(5,5)}
{(5,1),(6,2),(7,3)},{(8,4),(9,5)}}
3 (1,1),(6,2),(9,5) {{(1,1),(2,2),(3,3)},{(5,1),(6,2),(4,4),(5,5),(7,3)},
{(8,4),(9,5)}}
3 (3,3),(6,2),(9,5) {{(1,1),(2,2),(3,3),(4,4),(5,5)},{(5,1),(6,2),(7,3)},
{(8,4),(9,5)}}
Pattern (1,1) (2,2) (3,3) (4,4) (5,5) (5,1) (6,2) (7,3) (8,4) (9,5)
(1,1) 7 7 7 4 3 1 0 0 0 0
(2,2) 7 7 7 4 3 1 0 0 0 0
(3,3) 7 7 7 4 3 1 0 0 0 0
(4,4) 4 4 4 7 6 3 2 2 1 1
(5,5) 3 3 3 6 7 2 2 2 1 1
(5,1) 1 1 1 3 2 7 6 6 4 4
(6,2) 0 0 0 2 2 6 7 7 5 5
(7,3) 0 0 0 2 2 6 7 7 4 4
(8,4) 0 0 0 1 1 4 5 4 7 7
(9,5) 0 0 0 1 1 4 5 4 7 7
three out of the seven partitions. Using these counts we compute Sij
for all possible pairs and show the resultant matrix S in Table 7.16.
Using the SLA we get a two-partition based on the following steps:
• Merge pairs of points and form clusters based on the largest simi-
larity value of 7 between each pair of points. The clusters are:
{(1, 1), (2, 2), (3, 3)}, {(6, 2), (7, 3)}, {(8, 4), (9, 5)}, {(5, 1)},
{(4, 4)}, {(5, 5)}.
{(1, 1), (2, 2), (3, 3)}, {(5, 1), (6, 2), (7, 3)}, {(8, 4), (9, 5)},
{(4, 4), (5, 5)}.
{(1, 1), (2, 2), (3, 3)}, {(5, 1), (6, 2), (7, 3), (8, 4), (9, 5)},
{(4, 4), (5, 5)}.
{(1, 1), (2, 2), (3, 3), (4, 4), (5, 5)}, {(5, 1), (6, 2), (7, 3), (8, 4), (9, 5)}.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 258
Now we have two clusters. So, we stop here. Note that the base
algorithm used here in the form of K-means cannot generate the
resulting clusters using a single application because clusters in
Figure 7.12 are chain like clusters.
Research Ideas
1. In Section 1, the number of hard partitions of a set of n patterns into K clusters
is discussed. How do we control the number of such partitions? Do divide-
and-conquer based algorithms help?
Relevant References
(a) M. R. Anderberg, Cluster Analysis for Applications. New York: Academic
Press, 1973.
(b) M. N. Murty and G. Krishna, A computationally efficient technique for
data-clustering. Pattern Recognition, 12(3):153–158, 1980.
(c) S. Guha, A. Meyerson, N. Mishra, R. Motwani and L. O. Callaghan, Clus-
tering data streams: Theory and practice. IEEE Transactions on Knowledge
and Data Engineering, 15(3):515–528, 2003.
(d) C.-J. Hseieh, S. Si and I. Dhillon, A divide-and-conquer solver for Kernel
support vector machines. In Proceedings of ICML, 2014.
2. In addition to divide-and-conquer which other approaches help in reducing
the number of partitions being considered? Is it good to consider incremental
algorithms?
Relevant References
(a) H. Spath, Cluster Analysis Algorithms for Data Reduction and Classifica-
tion of Objects. London: E. Horwood, 1980.
(b) T. Zhang, R. Ramakrishnan and M. Livny, BIRCH: An efficient data clus-
tering method for very large databases. Proceedings of SIGMOD, 1996.
(c) V. Garg, Y. Narahari and M. N. Murty, Novel biobjective clustering (BIGC)
based on cooperative game theory. IEEE Transactions on Knowledge and
Data Engineering, 25(5):1070–1082, 2013.
3. Incremental algorithms are order-dependent. Which properties does an incre-
mental algorithm needs to satisfy so as to be order-independent?
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 259
Relevant References
(a) B. Shekar, M. N. Murty and G. Krishna, Structural aspects of semantic-
directed clusters. Pattern Recognition, 22(1):65–74, 1989.
(b) L. Rokach and O. Maimon, Clustering methods. In Data Mining and
Knowledge Discovery Handbook, O. Z. Maimon and L. Rokach (eds.).
New York: Springer, 2006.
4. Is it possible to characterize the order-dependence property of the Leader
algorithm as follows?
Relevant References
(a) B. Shekar, M. N. Murty and G. Krishna, A knowledge-based clustering
scheme. Pattern Recognition Letters, 5(4):253–259, 1987.
(b) J. Kleinberg, An impossibility theorem for clustering. Proceedings of
NIPS, 2002.
(c) S. Ben-David and M. Ackerman, Measures of clustering quality: A work-
ing set of axioms for clustering. Proceedings of NIPS, 2008.
6. In Section 2, both partitional and hierarchical clustering algorithms are con-
sidered. How does one hybridize these approaches?
Relevant References
(a) M. N. Murty and G. Krishna, A hybrid clustering procedure for concentric
and chain-like clusters. International Journal of Parallel Programming,
10(6):397–412, 1981.
(b) S. Zhong and J. Ghosh, A unified framework for model-based clustering.
Journal of Machine Learning Research, 4:1001–1037, 2003.
(c) L. Kankanala and M. N. Murty, Hybrid approaches for clustering. Pro-
ceedings of PREMI, LNCS 4815:25–32, 2007.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 260
Relevant References
(a) N. Mishra, R. Schreiber, I. Stanton and R. E. Tarjan, Clustering social
networks. Proceedings of WAW, LNCS 4863:56–67, 2007.
(b) U. Luxburg, A tutorial on spectral clustering. Statistics and Computing,
17(4):395–416, 2007.
(c) M. C. V. Nascimento and A. C. P. L. F. de Carvalho, Spectral methods
for graph clustering A survey. European Journal of Operations Research,
211:221–231, 2011.
8. Frequent itemsets have been used successfully in both classification and clus-
tering. What is the reason for frequent itemsets to be useful in clustering?
Relevant References
(a) S. Mimaroglu and D. A. Simovic, Clustering and approximate identifica-
tion of frequent item sets. Proceedings of FLAIRS, 2007.
(b) H. Cheng, X. Yan, J. Han and P. S. Yu, Direct discriminative pattern mining
for effective classification. Proceedings of ICDE, 2008.
(c) G. V. R. Kiran and V. Pudi, Frequent itemset based hierarchical document
clustering using Wikipedia as external knowledge. LNCS, 6277:11–20,
2010.
(d) A. Kiraly, A. Gyenesei and J. Abonyi, Bit-table based biclustering and
frequent closed itemset mining in high-dimensional binary data. The Sci-
entific World Journal, 2014. http://dx.doi.org/10.1155/2014/870406.
9. Clustering is a data compression tool. How to exploit this feature? Are there
better schemes for compression? Can we cluster compressed data?
Relevant References
(a) R. Cilibrasi and P. M. B. Vitányi, Clustering by compression. IEEE Trans.
on Information Theory, 51(4):1523–1545, 2005.
(b) T. R. Babu, M. N. Murty and S. V. Subrahmanya, Compression Schemes
for Mining Large Datasets: A Machine Learning Perspective. New York:
Springer, 2013.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 261
Relevant References
(a) P. Viswanath, M. N. Murty and S. Bhatnagar, Pattern synthesis for non-
parametric pattern recognition. Encyclopedia of Data Warehousing and
Mining: 1511–1516, 2009.
(b) M. Agrawal, N. Gupta, R. Shreelekshmi and M. N. Murty, Efficient
pattern synthesis for nearest neighbour classifier. Pattern Recognition,
38(11):2200–2203, 2005.
(c) H. Seetha, R. Saravanan and M. N. Murty, Pattern synthesis using multiple
Kernel learning for efficient SVM classification. Cybernetics and Infor-
mation Technologies, 12:77–94, 2012.
11. Clustering is usually associated with grouping unlabeled patterns. What is the
advantage in clustering labeled patterns?
Relevant References
(a) V. Sridhar and M. N. Murty, A knowledge-based clustering algorithm.
Pattern Recognition Letters, 12(8):511–517, 1991.
(b) M. Grbovic, N. Djuric, S. Guo and S. Vucetic, Supervised clustering of
label ranking data using label preference information. Machine Learning,
93(2–3):191–225, 2013.
12. How to incorporate knowledge from multiple sources to perform clustering
better?
Relevant References
(a) M. N. Murty and A. K. Jain, Knowledge-based clustering scheme for
collection management and retrieval of library books. Pattern Recognition,
28(8):949–963, 1995.
(b) X. Hu, X. Zhang, C. Lu, E. K. Park and X. Zhou, Exploiting Wikipedia as
external knowledge for document clustering. Proceedings of KDD, 2009.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch07 page 262
Relevant References
(a) A. K. Jain, Data clustering: 50 years beyond K-means. Pattern recognition
Letters, 31(8):651–666, 2010.
(b) S. Vega-Pons and J. Ruiz-Shulcloper, A survey of clustering ensemble
algorithms. International Journal of Pattern Recognition and Artificial
Intelligence, 25(3):337–372, 2011.
(c) T. R. Babu, M. N. Murty and V. K. Agrawal, Adaptive boosting with leader
based learners for classification of large handwritten data. Proceedings of
HIS, 2004.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 263
Chapter 8
Soft Clustering
Clustering has been one of the most popular tools in data mining.
Even though hard clustering has been popularly studied and tra-
ditionally used, there are several important applications where soft
partitioning is essential. Some of these applications include text min-
ing and social networks. Soft partitioning is concerned with assigning
a document in text mining or a person in a social network to more
than one cluster. For example, the same document may belong to
both sports and politics. In several cases softness has to be appro-
priately interpreted to make a hard decision; such a process permits
us to delay the decision making. So, one of the major characteristics
of softness is in delaying decision making as far as possible. Another
notion is to acknowledge the assignment of a pattern to more than
one category.
We can work out the number of soft partitions of n patterns into
K clusters as follows:
263
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 264
• So for each row in CIM we have 2K − 1 choices; for all the n rows
the number of possibilities is (2K − 1)n which is of O(2Kn ). This
is an upper bound because no column in CIM can be empty.
• Instead of storing a 1 or a 0 if we store one of P possible val-
ues to indicate the extent to which Xi belongs to Cj , then the
number of possibilities is bounded by (P K − 1)n as the number
of distinct values each entry in CIM can assume is P . In some of
the soft clustering algorithms the value of P could be very large;
theoretically it could be infinite.
2. Fuzzy Clustering
Typically each cluster is abstracted using one or more representa-
tive patterns. For example, centroid and leader are popularly used
representatives of a cluster. Let Ri be the set of representatives of
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 267
where
K
n
(µij )M Xj − Ri 2 .
i=1 i=1
Figure 8.1. Examples of patterns from two classes. (a) A Collection of Ones.
(b) A Collection of Sevens.
3. Rough Clustering
A rough set is characterized using the notion of the indiscernabi-
lity relation that is defined based on equivalence classes of objects.
Here, some patterns are known to definitely belong to a cluster.
For example consider the patterns shown in Figure 8.1. There are
two classes of character patterns in the figure; some are of charac-
ter 1 (shown in Figure 8.1(a)) and others are of character 7 (shown
in Figure 8.1(b)). There are four patterns that definitely belong to
1 and 4 patterns that definitely belong to 7. In the case of ones there
are three equivalence classes bounded by circles; characters in each
circle are similar to each other. Elements in the first two equivalence
classes (considered from left to right) shown in Figure 8.1(a) defi-
nitely belong to class of 1 s. Similarly in Figure 8.1(b) there are three
equivalence classes of sevens and these are bounded by rectangles
in the figure; all of them definitely belong to class of 7 s. However,
there is one equivalence class (third) in Figure 8.1(a) where the pat-
terns are either ones or sevens; these two patterns are possibly ones
or sevens.
Patterns in these equivalence classes are indiscernible; note that
1 s class consists of union of the two indiscernible equivalence classes
and similarly all the three equivalence classes in Figure 8.1(b) are also
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 270
F S = ∪ i Ei , where Ei ⊆ S, (1)
F S = ∪i Ei , where Ei ∩ S = φ. (2)
min
d(X, Rj ) = d(X, Ri ).
1≤i≤K
The above two steps are iteratively used to realize the Rough
K-means algorithm.
We briefly explain below how various steps listed above are realized
in clustering a collection of n patterns. For the sake of illustration
we consider a two-dimensional dataset: X = A : (1, 1)t ; B : (1, 2)t ;
C : (2, 2)t ; D : (6, 2)t ; E : (7, 2)t ; F : (6, 6)t ; G : (7, 6)t .
1. Initialization:
1 1 1 0 0 0 0; 0 0 0 1 1 0 0; 0 0 0 0 0 1 1
1 1 1 0 0 0 0; 1 0 0 1 1 0 0; 0 0 0 0 1 1 1
1 1 1 0 0 0 0; 1 0 0 1 1 0 0; 0 0 0 0 1 1 1,
Fitness =
String Squared 1/squared Selection
number Centroid1 Centroid2 Centroid3 error error probability
Fitness =
Squared 1/squared
String Centroid1 Centroid2 Centroid3 error error
Fitness =
Squared 1/squared
String Centroid1 Centroid2 Centroid3 error error
4
Input (1.5,1.5) (6.5,3.0) (6.5,6.0) 4.25
17
4
Output (1.5,1.5) (6.5,2.0) (6.5,6.0) 2.25
9
6. Statistical Clustering
K-means algorithm is a well-known hard partitional clustering algo-
rithm where we use the winner-take-all strategy. It is possible to have
an overlapping version of the K-means algorithm that generates a
covering instead of a partition. In a covering we have the clusters
C1 , C2 , . . . , CK of the given dataset X satisfying the following:
i=1 Ci = X
∪K and Ci = φ∀i
∀Xi ∈ X ∃Cj s.t. Xi ∈ Cj .
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 283
(d) Similarly for the other patterns we have the lists as follows:
1
Mk∗ = wi · Cki ,
wi
Xi ∈Ck
Xi ∈Ck
1
where wi = |Li |2 and Cki = |Li |Xi − Mj ∈Li −{Mk } Mj .
5. Now the updated centroids are
(1) 4 5 (1) 19 4 (1) 23 8
M1 = , , M2 = , , M3 = , .
3 3 3 3 3 3
{(1, 1), (1, 2), (2, 2)}; {(6, 1), (6, 3), (8, 1)}; {(8, 3), (6, 3), (8, 1)}.
Note that (6,3) and (8,1) belong to two clusters leading to a soft
partition.
9. However, initial centroid selection is important here. If the initial
centroids chosen are (1, 2); (6, 1); (8, 3) then the clusters obtained
using this algorithm are:
{(1, 1), (1, 2), (2, 2)}; {(6, 1), (6, 3)}; {(8, 3), (8, 1)} which is a hard
partition.
(7)–(6) gives us
left-hand side of (6) is ≤ left-hand side of (7) which shows that f (t)
is convex.
K
for 1 ≤ i ≤ K and αi = 1. (9)
i=1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 288
Using the base case we can show that the expression in (13) is
l
αi
≤ (1 − αl+1 )f ti + αl+1 f (tl+1 )
(1 − αl+1 )
i=1
l
αi
≤ (1 − αl+1 ) f (ti ) + αl+1 f (tl+1 )
(1 − αl+1 )
i=1
l+1
= αi f (ti ) = right-hand side of (12) thus proving the result.
i=1
Note that if f (x) = ln(x) then f (x) = − x12 which is strictly decreas-
ing if x > 0. So, ln(x) is concave and −ln(x) is convex.
We can rewrite Eq. (4) by multiplying and dividing by p(z|X , θi ),
where θi is the estimate of θ at the ith step of the iterative scheme
that we need to use, as
p(X , z|θ)
l(θ) = ln p(z|X , θi ) . (14)
z
p(z|X , θi )
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 290
By using the Jensen’s inequality seen earlier, we can show that the
above is
p(X , z|θ)
≥ p(z|X , θi ) ln . (15)
z
p(z|X , θi )
6.2.2. An example
Let us consider a one-dimensional example with two clusters. Let the
dataset be
(18)
4
Pij Xi
i=1
µj = . (19)
4
Pij
i=1
So, in this case, Eqs. (20) and (19) characterize the Expectation
and M aximization steps and they are repeated iteratively till
convergence.
Let θ0 = (2, 4)t be the initial selection of the µs. Let us consider
computation of P11 . By using (20) and σ = 1, it is given by
exp − 12 (2.1 − 2)2
P11 = = 0.728.
exp − 12 (2.1 − 2)2 + exp − 12 (2.1 − 4)2
j P 1j P 2j P 3j P 4j µj
j P 1j P 2j P 3j P 4j µj
j P 1j P 2j P 3j P 4j µj
−2 0 2 4 6 8 10
in Table 8.4. So, at the end of the first iteration we get µ1 = 2.76 and
µ2 = 4.53. So, θ1 = (2.76, 4.53)t . Using this value of θ, the correspond-
ing values of the parameters are given in Table 8.5. Table 8.6 shows
the parameters in the third iteration. After some more iterations
we expect the value of θ = (2, 5) to be reached. The corresponding
densities are shown in Figure 8.2.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 293
7. Topic Models
One of the important areas of current interest is large-scale document
processing. It has a variety of applications including:
• Web pages: These are easily the most popular type of semi-
structured documents. Every search engine crawls and indexes web
pages for possible information retrieval; typically search engines
return, for a query input by the user, a collection of documents
in a ranked manner as output. Here, in addition to the content
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 294
1 1 1 1 1 1
2 1 2 1 1 1
3 1 1 2 2 1
4 1 2 2 2 1
5 6 6 6 6 2
6 6 7 6 6 2
7 6 6 7 7 2
8 6 7 7 7 2
1 1 1 1 1 1
2 1 2 2 2 1
1 6 6 6 6 2
2 6 7 7 7 2
XX t β = λβ
X t (XX t β) = X t (λβ).
X t Xγ = λγ,
X = B D C,
and
1
.
1
The corresponding normalized vectors are:
1
√
2
1
−√
2
and
1
√
2
1 .
√
2
In a similar manner it is possible to observe that the matrix
XX t is
2 1 −1
1 1 0.
−1 0 1
The corresponding eigenvalues are 3, 1, and 0 and the eigenvectors
corresponding to the non-zero eigenvalues after normalization are
2
√
6
1
√
6
1
−√
6
and
0
1
√
2 .
1
√
2
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 302
The entries of the diagonal matrix Σ are the square roots of the
eigenvalues (singular values) 3 and 1. So, the various matrices are
2
√ 0
6
1 1
√ √
B= ,
6 2
1 1
−√ √
6 2
√
3 0
D= ,
0 1
1 1
√ −√
2 2
C= 1
.
1
√ √
2 2
4
The sample mean of the 4 data points is . The zero mean nor-
4
malized set of points is
−3 −2 2 3
.
−2 −3 3 2
1
(−3, −2)t (−3, −2) + (−2, −3)t (−2, −3)
4
+ (2, 3)t (2, 3) + (3, 1)t (3, 1)
1 26 24 6.5 6
= = .
4 24 26 6 6.5
The eigenvalues of
the
matrix
are 12.5 and 0.5 and the respective
1 1
eigenvectors are . After normalization we get the unit-
1 −1
1 1
√ √
2 2
norm orthogonal eigenvectors given by 1
. The four
√1 − 2
√
2
data points and the corresponding principal components are shown
in Figure 8.3.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 304
X
PC1
PC2
Note that the rows of the matrix C correspond to the principal com-
ponents of the data. Further, the covariance matrix is proportional
to X t X given by
1 1 1 1
√ √ √ √
2 2 2
X tX = 50 0 2 .
1 1 0 2 1 1
√ −√ √ −√
2 2 2 2
In most of the document analysis applications, it is not uncommon
to view a document collection as a document-term matrix, X as
specified earlier. Typically such a matrix is large in size; in a majority
of the practical applications the number of documents is large and
the number of terms in each document is relatively small. However,
such data is high-dimensional and so the matrix can be very sparse.
This is because even though the number of terms in a document is
small, the total number of distinct terms in the collection could be
very large; out of which a small fraction of terms appear in each of the
documents which leads to sparsity. This means that dimensionality
reduction is essential for applying various classifiers effectively.
Observe that the eigenvectors of the covariance matrix are the
principal components. Each of these is a linear combination of the
terms in the given collection. So, instead of considering all possible
eigenvectors, only a small number of the principal components are
considered to achieve dimensionality reduction. Typically the num-
ber of terms could be varying between 10,000 to 1 million whereas
the number of principal components considered could be between
10 to 100. One justification is that people use a small number of
topics in a given application context; they do not use all the terms
in the given collection.
Latent semantic analysis involves obtaining topics that are latent
and possibly semantic. It is based on obtaining latent variables in
the form of linear combinations of the original terms. Note that the
terms are observed in the given documents; however the topics are
latent which means topics are not observed. Principal components are
such linear combinations. The eigenvalues of the covariance matrix
represent variances in the directions of the eigenvectors. The first
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 306
XK = BDK C,
X − XK F = λK+1 ,
where λK+1 is the largest singular value ignored among the least
r − K singular values.
There are claims that the resulting reduced dimensional represen-
tation is semantic and can handle both synonymy and polysemy in
information retrieval and text mining. Here, by synonymy we mean
two words having the same meaning are synonymous. Similarly we
have polysemy when the same word has multiple meanings. This
will have impact on the similarity between two documents. Because
of synonymy the similarity value computed using dot product type
of functions will be less than the intended. For example, if car and
automobile are used interchangeably (synonymously) in a document
and only car is used in the query document then the similarity mea-
sure will fail to take into account the occurrences of automobile. In
a similar manner because of polysemy it is possible that similarity
between a pair of documents is larger than what it should be. For
example if tiger is used in a document both to mean an animal and
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 307
say airlines and the query document has tiger in only one sense then
the dot product could be larger than the intended value.
An important feature of the SVD is that it is a deterministic
factorization scheme and the factorization is unique. Here each row
of the matrix C is a topic and it is an assignment of weights to each
of the terms. The entry Cij is the weight or importance of the jth
term (j = 1, . . . , l) to the ith topic. The entry Dii in the diagonal
matrix indicates some kind of weight assigned to the entire ith topic.
K
K
P (t) = P (Ci )P (t|Ci ) where P (Ci ) = 1,
i=1 i=1
K
P (d, tj ) = P (d)P (tj |d) where P (tj |d) = P (tj |Ci )P (Ci |d).
i=1
n
l
L(θ) = P (di , tj )n(di ,tj ) .
i=1 j=1
n
l
l(θ) = n(di , tj )log P (di , tj ),
i=1 j=1
n
l
Hence, l(θ) = n(di , tj )log[P (di )P (tj |Ck )P (Ck |di )].
i=1 j=1
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 309
So, expected value of l(θ) with respect to the latent cluster Ck con-
ditioned on di and tj is given by
n
l
K
ECk |di ,tj (l(θ)) = n(di , tj ) P (Ck |tj , di )
i=1 j=1 k=1
n
l
l
P (di ) = 1; P (tj |Ck ) = 1; P (Ck |di ) = 1.
i=1 j=1 j=1
update rules.
(XC t )ij
Bij ← Bij ,
(BCC t )ij
(B t X)ij
Cij ← Cij .
(B t BC)ij
It is possible to show some kind of equivalence between N M F and
the K-means algorithm; also between N M F and PLSA.
7.7. LDA
Even though PLSA offers an excellent probabilistic model in charac-
terizing the latent topics in a given collection of documents, it is not
a fully generative model. It cannot explain documents which are not
part of the given collection. Some of the important features of the
PLSA are:
By using Bayes rule we can write P (tj , Ck |di ) as P (tj , di |Ck ) PP(Ck)
(di )
.
This will mean that
K
P (Ck )
P (di , tj ) = P (di ) P (tj , di |Ck )
P (di )
k=1
K
P (Ck )
= P (di ) P (tj |Ck )P (di |Ck ) .
P (di )
k=1
where
K
Γ i=1 αi
B(α) = K ,
i=1 Γ(α i )
where Γ stands for the Gamma function and α is the input param-
eter vector.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 313
p(θ, φ, C, d|α, β)
p(θ, φ, C|d, α, β) = .
p(d|α, β)
where C−i means all cluster associations other than Ci . We can show
using Bayes Rule
Both the integrands have Dirichlet priors in the form of p(θ|α) and
p(φ|β); also there are multinomial terms, one in each integrand. The
conjugacy of Dirichlet to the multinomial helps us here to simplify
the product to a Dirichlet with appropriate parameter setting. So, it
is possible to get the result as
n
B(α)
K
B(β)
p(D, C|α, β) = ,
B(ndi ,. + α) B(n.,k + β)
i=1 k=1
p(D, C|α, β)
p(Cm |C−m , D, α, β) =
p(D, C−m |α, β)
n−(m) + βt
(−m) t,k
∝ nd,k + αk (−m) .
n
t t ,k + βt
Here the superscript (−m) corresponds to not using the mth token in
counting; note that nd,k and nt,k are the counts where k is the topic,
d is the document, and t and t are terms. So, Gibbs sampling based
LDA essentially maintains various counters to store these count val-
ues. Basically, it randomly initializes K clusters/topics and iterates
in updating the probabilities specified by the above equation which
employs various counters and also in every iteration the counters are
suitably updated.
Concept i1 i2 i3 i4 i5 i6 i7 i8 i9
Class 1 1 0 0 1 0 0 1 0 0
Class 7 1 1 1 0 0 1 0 0 1
• Character 1: i1 ∧ i4 ∧ i7 → Class1
• Character 7: i1 ∧ i2 ∧ i3 ∧ i6 ∧ i9 → Class7
Research Ideas
1. Derive an expression for the number of soft clusterings of n patterns into K soft
clusters.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 318
2. Discuss Rough Fuzzy Clustering in the case of leader and what can happen to
other algorithms.
Relevant References
(a) S. Asharaf and M. N. Murty, An adaptive rough fuzzy single pass algorithm
for clustering large data sets. Pattern Recognition, 36:3015–3018, 2003.
(b) V. S. Babu and P. Viswanath, Rough-fuzzy weighted k-nearest leader clas-
sifier for large data sets. Pattern Recognition, 42:1719–1731, 2009.
(c) P. Maji and S. Paul, Rough-fuzzy clustering for grouping functionally simi-
lar genes from microarray data. IEEE/ACM Transactions on Computational
Biology Bioinformatics, 10:286–299, 2013.
3. A difficulty with the use of GAs is that they are not scalable. The problem gets
complicated further when one considers MOOP. How to design scalable GAs?
Relevant References
Relevant References
Relevant References
(a) T. Hoffman, Latent semantic models for collaborative filtering. ACM Trans-
actions on Information Systems, 22(1):89–115, 2004.
(b) D. Sontag and D. M. Roy, Complexity of inference in latent Dirichlet allo-
cation. Proceedings of NIPS, 2011.
(c) S.-K. Ng, Recent developments in expectation-maximization methods for
analyzing complex data. Wiley Interdisciplinary Reviews: Computational
Statistics, 5:415–431, 2013.
6. Matrix factorization is useful in clustering. It is possible to show equivalence
between PLSA and NMF; similarly between K -Means and NMF. Is it possible
to unify clustering algorithms through matrix factorization?
Relevant References
(a) A. Roy Chaudhuri and M. N. Murty, On the relation between K-Means and
PLSA. Proceedings of ICPR, 2012.
(b) C. Ding, T. Li and W. Peng, On the equivalence between non-negative
matrix factorization and probabilistic latent semantic indexing. Computa-
tional Statistics and Data Analysis, 52:3913–3927, 2008.
(c) J. Kim and H. Park, Sparse nonnegative matrix factorization for clustering.
Technical Report, Georgia Technical, GT-CSE-08-01.pdf, 2008.
7. Is it possible to view clustering based on frequent itemsets as a matrix factor-
ization problem?
Relevant References
(a) B. C. M. Fung, K. Wang and M. Ester, Hierarchical document clustering
using frequent itemsets. Proceedings of SDM, 2003.
(b) G. V. R. Kiran, R. Shankar and V. Pudi, Frequent itemset-based hierarchical
document clustering using Wikipedia as external knowledge. Proceedings
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch08 page 320
Relevant References
(a) D. M. Blei, Probabilistic topic models. Communications of the ACM, 55:77–
84, 2012.
(b) D. Newman, E. V. Bonilla and W. L. Buntine, Improving topic coherence
with regularized topic models. Proceedings of NIPS, 2011.
(c) H. M. Wallach, D. M. Mimno and A. McCallum: Rethinking LDA: Why
priors matter. Proceedings of NIPS, 2009.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch09 page 321
Chapter 9
1. Introduction
Social networks characterize different kinds of interactions among
individuals. Typically, a social network is abstracted using a
network/graph. The individuals are represented as nodes in a net-
work and interaction between a pair of individuals is represented
using an edge between the corresponding pair of nodes. Usually a
social network is represented as a graph. The nodes represent indi-
viduals or entities with attributes such as interests, profile, etc. The
interactions among the entities could be one of friendship, business
relationship, communication, etc.
These graphs could be either directed or undirected. Typically
friendship between two individuals is mutual; so, edges in a friendship
network are undirected. However, in influence networks the relation
may not be symmetric; a person A may influence B, but B may not
be able to influence A. So, in influence networks, the edges could
be directed. Note that author–co-author relation is symmetric and
network of authors is undirected whereas the citation network/graph
is directed.
Such a graphical representation can help in analyzing not only
the social networks but also other kinds of networks including
document/term networks. For example, in information retrieval typ-
ically a bag of words paradigm is used to represent document col-
lections. Here, each document is viewed as a vector of terms in the
collection; it only captures the frequency of occurrence of terms not
the co-occurrence of terms in the document collection. It is possible
321
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch09 page 322
2. Patterns in Graphs
It is found that networks have some distinguishing features. The main
features associated with them are
y(x) = Cx−γ ,
1000
100
Out-degree 10
we get
di ∝ riR .
fd ∝ dO .
10000
1000
100
Out-degree
10
1 10 100
Frequency
100
Eigenvalue
10
1
1 10
Order
λi ∝ iE .
1e+09
1e+08
N(h)
1e+07
10000
1000
1 10
h
Figure 9.4. Log–log plot of number of pairs of nodes within h hops versus h.
N (h) = hH .
5 8
3 4 1 6
2 7
1 {2,4,5,7,8} 0.3
2 {1,3,4} 0.66
3 {2,4,5} 0.66
4 {1,2,3,5} 0.75
5 {1,3,4} 0.66
6 {7,8} 1
7 {1,6,8} 0.66
8 {1,6,7} 0.66
the network and the clustering coefficient for each of the nodes in
the network is given in Table 9.1. Note that the average clustering
coefficient is 0.67; this gives a measure of how the nodes can form
clusters.
Modularity:
1. Graph partitioning
2. Spectral clustering
3. Linkage-based clustering
4. Hierarchical clustering
5. Random walks
1 2
σk = riC .
n
CεPk iεC
1 1 T
= Bij ui uj = u Bu.
4m 4m
ij
1 1 T
n
Q= (vi · u)viT B (vj · u)uj = (vi · u)2 βi ,
4m 4m
i j i=1
LQ X = X· C, (2)
1
Q= Bij (1 + ui uj ),
4m
ij
4. Link Prediction
There are several applications associated with networks/graphs
where predicting whether a pair of nodes X and Y which are not
currently connected will get connected (or have a link) in the future
is important. Based on the type of objects the nodes represent in
a network, we may have either a homogeneous or a heterogeneous
network.
6 7
5 8
3 4 1 9
2 10
• Spa (1, 3) = 6 × 4 = 24
• Spa (1, 6) = Spa (1, 9) = 12
• Spa (3, 7) = Spa (3, 8) = Spa (3, 9) = Spa (3, 10) = 8
• Spa (2, 4) = Spa (2, 5) = Spa (4, 5) = Spa (2, 6) = 4
• Spa (2, 7) = Spa (2, 8) = Spa (2, 9) = Spa (2, 10) = 4
• Spa (4, 6) = Spa (4, 7) = Spa (4, 8) = Spa (4, 9) = Spa (4, 10) = 4
• Spa (5, 6) = Spa (5, 7) = Spa (5, 8) = Spa (5, 9) = Spa (5, 10) = 4
• Spa (6, 8) = Spa (6, 9) = Spa (6, 10) = 4
• Spa (7, 8) = Spa (7, 9) = Spa (7, 10) = Spa (8, 10) = 4
So, the link (1, 3) has the largest similarity value; this is followed
by (1, 6) and (1, 9). Also note that this function considers possible
links between every pair of nodes that are not linked currently.
This captures the notion that two people will become friends if
they share a large number of friends or they have a large num-
ber of common friends. Note that for the missing links (currently
unconnected nodes) in the example graph
• Scn (5, 8) = Scn (5, 10) = Scn (4, 6) = Scn (4, 7) = Scn (4, 8) =
Scn (4, 10) = 1
• Scn (5, 6) = Scn (5, 7) = Scn (5, 8) = Scn (5, 10) = Scn (7, 8) =
Scn (7, 10) = 1
Observe that the node pairs with zero similarity value are not
shown; for example, Scn (6, 8) = 0. Based on these similar-
ity values, we can make out that the pair of nodes 1 and 3
has the largest similarity value of 3; this is followed by pairs
(1, 9), (2, 4), (2, 5), (2, 8), (2, 10), and so on. Here, the link (1, 9)
has a larger similarity value compared to the link (1, 6). Note
that the similarity function is symmetric because the graph is
undirected.
Note that unlike the common neighbors function which ranks the
link (1, 3) above the others the Jaccard coefficient ranks the links
(2, 4), (2, 5), (4, 5), (8, 10) (with similarity 1) above the pair (1, 3)
(similarity value is 0.5). Also links between 1 and 9 and 1 and 6
have lesser similarity values.
4. Adamic–Adar: This similarity function Saa may be viewed as
a weighted version of the common neighbors similarity function.
It gives less importance to high degree common neighbors and
more importance to low degree common neighbors. The similarity
function is given by
1
Saa (X, Y ) = .
log|N (z)|
z∈N (X)∩N (Y )
(l)
where pathX,Y is the set of all paths of length l between X and
Y . The similarity for some of the pairs in the example graph are:
• Sks (1, 3) = 2β 2 + β 3 = 0.02 + 0.001 = 0.021
• Sks (1, 9) = 2β 2 = 0.02
• Sks (1, 6) = β 2 + 3β 3 = 0.013
• Sks (7, 8) = β 2 + β 3 = 0.011
Note that the similarity values are computed using a value of 0.1
for β; the similarity between 1 and 6 is different from that between
1 and 9.
3. SimRank: It may be viewed as a recursive version of simple sim-
ilarity. It is given by
Ssr (P, Q)
P ∈N (X) Q∈N (Y )
Ssr (X, Y ) = .
|N (X)| · |N (Y )|
5. Information Diffusion
Any information communicated in a social network spreads in the
network. The study of how this information gets disseminated in the
network is called information diffusion. The shape of the network is
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch09 page 348
and (ii) linear threshold (LT). Both these models are synchronous
along the time-axis. IC is sender-centric whereas LT is receiver-
centric.
|Vut |
P (u, t) = ,
|Vu |
where Vu is the messages of user u and Vut gives the messages of user
u during time t.
For every pair of nodes u and v, the above parameters are used
to find 13 features. These features are Ac(u), Ac(v), S(u, i), S(v, i),
F (u, v), F (v, u), R(u), R(v), T (u), T (v), P (u, t), P (v, t) and H(u, v).
For a month, the data of the social network is used and according
to the spreading cascades available, each link with its 13 features is
classified as “diffusion” or “non-diffusion”. This data is then used to
predict the diffusion for the next month.
• Closeness Centrality
Closeness centrality of a node i is the inverse of the shortest total
distance from i to every connected node j. It can be written as:
1
Cc (i) = n ,
j=1 dij
where dij gives the distance between nodes i and j. In case the
network is not strongly connected, the closeness centrality depends
also on the number of nodes, Ri , reachable from i. It can be writ-
ten as:
Ri
n−1
Cc (i) = Pn .
j=1 d(i,j)
Ri
• Graph Centrality
For this measure, it is necessary to find the node k which is far
away in terms of distance from the node i. Graph centrality is the
inverse of the distance from i to k. It can be written as
1
Cg (i) = ,
maxj∈V (i) d(i, j)
where V (i) is the set of nodes reachable from i.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch09 page 354
• Betweenness Centrality
Considering all paths between pairs of nodes in the graph G,
betweenness centrality counts the number of times the path crosses
the node i. It can be written as
Njk (i)
Cb (i) = ,
Njk
i=j=k
where Njk is the number of paths between j and k and Njk (i) gives
the number of shortest paths between j and k passing through i.
• Page rank Centrality
The Page rank is given by
Cpr (j)
Cpr (i) = (1 − α) + α
outdegj
jεout(i)
where α is the damping factor and out(i) gives the out incident
nodes of i. In matrix form, this can be written as
Cpr = (1 − α)e + αCpr P,
where e is the (1 × n) unit vector, Cpr is the (1 × n) Page Rank
vector and P is the (n × n) transition matrix.
• Degree Centrality
This measure uses the topology of the graph. The in-degree cen-
trality of a node i is the in-degree of node i and can be written as
Cd (i) = indegi .
Similarly, the out-degree centrality of a node i is the out-degree of
the node i and can be written as
Cd (i) = outdegi .
• α-Centrality
This is a path-based measure of centrality. It can be defined as
k→∞
Cα = v αt At .
t=0
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch09 page 355
1
This converges only if α < |λ1 |
. In this method, α is called the
attenuation factor.
• Katz Score
The α-Centrality when v = αeA yielding:
7. Topic Models
Topic modeling is a method used to analyze large documents. A topic
is a probability distribution over a collection of words and a topic
model is a statistical relationship between a group of observed and
unknown(latent) random variables which specifies a generative model
for generating the topics. Generative models for documents are used
to model topic-based content representation. Each document is mod-
eled as a mixture of probabilistic topics.
Then
P (w, d) = P (z)P (d|z)P (w|z).
zεZ
A = L · U · R.
p(θ, z, w|α, β)
p(θ, z|w, α, β) = , (4)
p(w|α, β)
Taking the right hand side of Eq. (4), the numerator can be
written as
N k V
Γ( ki=1 αi ) αi −1
k
j i
p(θ, z, w|α, β) = k θi (θi βi,j )wn zn .
i=1 Γ(αi ) i=1 n=1 i=1 j=1
(5)
The denominator is
N k V
Γ k
α k
i=1 i j
p(w|α, β) = k θiαi (θi βij )wn dθ.
i=1 Γ(αi ) i=1 n=1 i=1 j=1
(6)
N
p(θ, z|γ, φ) = p(θ|γ) p(zn |φn ). (7)
n=1
β
1:k
α θd zd,n ωd,n N
M
γ φd,n
d
θd zd,n
N
M
Pmj + β Q +α
P (zi = j|wi = m, z−i , w−i )α dj ,
m Pm j + V β j Qdj + T α
Research Ideas
1. The similarity measure SimRank may be viewed as a recursive version of simple
similarity. It is given by
Ssr (P, Q)
P ∈N (X) Q∈N (Y )
Ssr (X, Y ) = .
|N (X)| · |N (Y )|
Note that it is a global measure of similarity. How do you justify its need against
its computational cost?
Relevant References
(a) G. Jeh and J. Widom, SimRank: A measure of structural-context similarity.
Proceedings of the ACM SIGKDD International Conference on KDD, July
2002.
(b) D. Liben-Nowell and J. Kleinberg, The link prediction problem for social
networks. Proceedings of CIKM, 2003.
(c) L. Lu and T. Zhou, Link prediction in complex networks: A survey. Phys-
ica A, 390:1150–1170, 2011.
(d) M. A. Hasan and M. J. Zaki, A survey of link prediction in social networks.
Social Network Data Analysis:243–275, 2011.
2. It was observed that Adamic–Adar and Resource Allocation Index are found to
perform better among the local similarity measures. Why?
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch09 page 362
Relevant References
(a) L. Adamic, and E. Adar, Friends and neighbours on the web. Journal of
Social Networks, 25:211–230, 2003.
(b) T. Zhou, L. Lu, and Y.-C. Zhang, Predicting missing links via local infor-
mation. Journal of European Physics B, 71:623–630, 2009.
(c) Z. Liu, W. Dong and Y. Fu, Local degree blocking model for missing link
prediction in complex networks, arXiv:1406.2203 [accessed on 29 October
2014].
(d) N. Rosenfeld, O. Meshi, D. Tarlow and A. Globerson, Learning structured
models with the AUC loss and its generalizations. Proceedings of AISTATS,
2014.
3. What is the relevance of the power-law degree distribution in link prediction?
Relevant References
(a) Y. Dong, J. Tang, S. Wu, J. Tian, N. V. Chawla, J. Rao and H. Cao, Link
prediction and recommendation across heterogeneous social networks. Pro-
ceedings of ICDM, 2012.
(b) S. Virinchi and P. Mitra, Similarity measures for link prediction using power
law degree distribution. Proceedings of ICONIP, 2013.
4. How can one exploit community detection in link prediction?
Relevant References
(a) S. Soundararajan and J. E. Hopcroft, Using community information to
improve the precision of link prediction methods. Proceedings of WWW
(Companion Volume), 2012.
(b) B. Yan and S. Gregory, Detecting community structure in networks using
edge prediction methods, Journal of Statistical Mechanics: Theory and
Experiment, P09008, 2012.
5. In link prediction we deal with adding links to the existing network. However,
in a dynamically changing network it makes sense to delete some of the links.
How to handle such deletions?
Relevant References
(a) J. Preusse, J. Kunegis, M. Thimm, S. Staab and T. Gottron, Structural
dynamics of knowledge networks. Proceedings of ICWSM, 2013.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch09 page 363
Relevant References
(a) M. E. J. Newman, Community detection and graph partitioning. CoRR
abs/1305.4974 [accessed on 29 October 2014].
(b) M. W. Mahoney, Community structure in large social and information net-
works. Workshop on Algorithms for Modern Massive Data Sets (MMDS),
2008.
(c) U. von Luxburg, A tutorial on spectral clustering. Statistics and Computing,
17(4):395–416, 2007.
(d) S. White and P. Smyth, A spectral clustering approach to finding commu-
nities in graph. SDM, 5:76–84, 2005.
(e) M. E. J. Newman and M. Girvan, Finding and evaluating community struc-
ture in networks. Physical Review E, 69(2):56–68, 2004.
7. Can we use the Modularity matrix to design better classifiers?
Relevant References
(a) P. Schuetz and A. Caflisch, Efficient modularity optimization: Multi-step
greedy algorithm and vertex mover refinement. CoRR abs/0712.1163,
2007.
(b) P. Schuetz and A. Caflisch, Multi-step greedy algorithm identifies com-
munity structure in real-world and computer-generated networks. CoRR
abs/0809.4398, 2008.
8. How does diffusion help in community detection?
Relevant References
(a) A. Guille, H. Hacid and C. Favre, Predicting the temporal dynamics of
information diffusion in social networks. Social and Information Networks,
2013.
(b) F. Wang, H. Wang and K. Xu, Diffusive logistic model towards predict-
ing information diffusion in online social networks. ICDCS’ Workshops,
pp. 133–139, 2012.
April 8, 2015 12:57 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-ch09 page 364
Relevant References
(a) V. D. Blondel, J.-L. Guillaume, R. Lambiotte and E. Lefebvre, Fast unfold-
ing of communities in large networks. Journal of Statistical Mechanics,
10:P10008, 2008.
(b) K. Wakita and T. Tsurumi, Finding community structure in mega-scale social
networks, eprint arXiv:cs/070248, 2007.
(c) L. Danon, A. Diaz-Guilera and A. Arenas, The effect of size heterogeneity
on community identification in complex networks. Journal of Statistical
Mechanics, P11010, 2006.
(d) P. Pons and M. Latapy, Computing communities in large networks using
random walks. Computer and Information Sciences, ISCIS 2005, Springer
Berlin Heidelberg, pp. 284–293, 2005.
(e) F. Radicchi, C. Castellano, F. Cecconi, V. Loreto and D. Parisi, Defining and
identifying communities in networks. Proceedings of the National Academy
of Science USA, 101:2658–2663, 2004.
(f) A. Clauset, M. E. J. Newman and C. Moore, Finding community structure
in very large networks. Physical Review E, 70:066111, 2004.
April 8, 2015 12:58 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-index page 365
Index
365
April 8, 2015 12:58 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-index page 366
366 Index
Index 367
CF-tree, 225, 227, 228, 248 classification rule, 211, 213, 317
CF-tree construction, 228 classification task, 180
chain-like clusters, 258, 259 classification technique, 177
character, 236, 238, 317 classification time, 9, 13, 20, 196
character pattern, 269 classification using ANN, 196
character set, 238 classification using GAs, 187
characteristic equation, 300 classification/test time, 9
chi-square statistic, 79, 102 classifier, 2, 165, 204, 205, 206, 209,
child chromosome, 201 296
children nodes, 192 classifiers in the compressed domain,
children strings, 279 175
Choquet distance, 188 clique, 322
Choquet Hyperplane, 188, 191 closed world assumption, 348
Choquet Hyperplane H, 187 closeness centrality, 353
Choquet integral, 188 cluster, 217, 219, 225, 227–229,
chromosome, 182, 183, 187, 189–191, 244–247, 249, 250, 252, 256, 258,
200, 340 263, 266, 270, 271, 281, 284, 286,
circuit, 331, 333 296, 297, 313–316
citation network, 321, 340 cluster analysis, 258
city-block distance, 55, 253, 299 cluster assignment, 220
class, 1, 185, 207, 246, 247, 298 cluster assumption, 160, 161
class descriptions, 317 cluster based support vector machine,
class feature vector, 185 248
class imbalance, 8, 12, 30 cluster centers, 274, 277
class label, 2, 37, 99, 102, 109, 112, cluster centroid, 247, 249
135, 136, 147, 149, 161, 168, 171, cluster compressed data, 260
177–179, 184, 186, 192, 194, cluster ensemble, 172
196–198, 205, 207, 246, 252, 286, Cluster Indicator Matrix, 263
298, 299, 317 cluster labels, 286
class separability, 90 cluster numbers, 274
class-conditional independence, 113, cluster representative, 5, 17, 242, 246,
114, 132 267, 296–298
classical logic, 316 cluster structure, 313, 314
classification, 1, 135, 139, 159, 160, cluster validity, 25
167, 168, 173–178, 180, 182, 183, clustering, 3, 148, 172, 174, 215, 224,
185, 187, 188, 196, 197, 200, 209, 230, 236, 238, 241–244, 247, 248,
211, 212, 242, 244, 260, 262, 293, 252, 259–261, 263, 265, 266, 272,
316, 326 273, 281, 282, 285, 286, 293, 316,
classification accuracy, 75, 139, 142, 318, 319, 326, 330, 336
160, 185, 191, 193, 194, 209 clustering algorithms, 160, 218, 223,
classification algorithm, 76, 136, 140, 261, 267, 318, 319
168, 169, 175, 178 clustering by compression, 260
classification model, 136 clustering coefficient, 326, 327
classification of the time series, 170 clustering ensemble algorithms, 262
classification performance, 173 Clustering Feature tree, 225
April 8, 2015 12:58 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-index page 368
368 Index
Index 369
370 Index
Index 371
entropy, 44, 45, 167, 186, 255 external knowledge, 134, 260, 261,
equivalence class, 179, 269, 270, 275 319
equivalence relation, 270, 275 external node, 348
error coding length, 194 extremal optimization, 336
error function, 281
error in classification, 209 F -score, 99, 100, 109
error rate, 92, 112, 136, 186 factorization, 295, 299, 310
estimate, 116, 244, 291, 309, 316, 350, false positives, 195
359 farthest neighbor, 295
estimation of parameters, 111 feasible solution, 278
estimation of probabilities, 115 feature, 183, 192, 230, 243, 296
estimation scheme, 266 feature elimination, 94, 96
Euclidean distance, 55, 64, 86, 119, feature extraction, 27–29, 75, 86, 105,
142, 168, 230, 246, 253, 267, 268, 108, 109, 169
282, 283 feature ranking, 99–103, 109
evaluating community structure, 363 feature selection, 26, 29, 75–78 80, 83,
evaluation function, 182, 200 91, 92, 96, 97, 103, 105, 108–110,
131, 139, 169, 172, 173, 175, 187,
evaluation metric, 209
211–213
evaluation of a solution, 336
feature set, 299
evaluation of the strings, 184
feature subset, 91, 92, 297
evolutionary algorithm, 23, 91, 102,
feature subspace selection, 172
108, 191, 193, 266, 272, 273, 279
feature vector, 168, 170, 187, 207
evolutionary algorithms for
feature weights, 131
clustering, 264
feature-based classification, 168, 169
evolutionary computation, 108
feedforward neural network, 199
evolutionary operator, 272
filter method, 26, 76, 77
evolutionary programming, 264, 280
final solution, 182
evolutionary schemes, 264
finding communities, 363
evolutionary search, 279 fingerprint, 293, 294
exhaustive enumeration, 76, 218 first principal component, 303
exhaustive search, 99 first-order feature, 170
expectation, 141, 290 Fisher’s kernel, 169
expectation maximization, 161, 266 Fisher’s linear discriminant, 139
expectation step, 308 fitness calculation, 184
expectation-maximization, 319, 356 fitness evaluation, 183
expected frequency, 80 fitness function, 91, 102, 183, 185,
expected label, 161 190, 191, 200, 272, 340
expected value, 308, 309 fitness value, 193, 272, 273, 276, 278,
explanation ability, 196 336
exploitation operator, 273 Fixed length encoding, 192
exploration, 273 fixed length representation, 195
exponential family of distributions, flow-based methods, 328
117 forensics, 293
external connectivity, 326 forest size, 147
April 8, 2015 12:58 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-index page 372
372 Index
Index 373
graph, 230, 231, 253, 260, 321, 325, hierarchical clustering, 14, 15, 218,
326, 328, 331–333, 340, 343–346, 225, 328
353, 354, 363 hierarchical document clustering, 260,
graph centrality, 353 319
graph classification, 175, 213 hierarchy of partitions, 218, 219
graph clustering algorithms, 328 high degree common neighbor, 345
graph data, 230 high degree nodes, 343
graph distance, 346 high density region, 160, 161
graph energy, 162 high frequency, 252
graph Laplacian, 329, 330 high-dimensional, 139, 142, 151, 172,
graph partitioning, 363 260, 281, 305–307
graph partitioning algorithms, 328 high-dimensional data, 86
Graph-based approach, 348 high-dimensional space, 7, 29, 89,
graph-based method, 162 171, 173, 295, 296
graphical representation, 321 hinge loss, 162
greedy algorithm, 363 HMM, 171, 286
greedy optimization, 335 Hole ratio, 186
homogeneity, 350, 351
greedy search algorithm, 99
homogeneous network, 340, 341
grouping, 318
hop-plot exponent, 325
grouping of data, 215
hybrid clustering, 21, 259
grouping phase, 252, 254
hybrid evolutionary algorithm, 212
grouping unlabeled patterns, 246, 261
hybrid feature selection, 110, 131
groups, 325
hybrid GA, 212
grow method, 192
hyper-parameter, 128
hyper-rectangular region, 145
Hamming loss, 209, 210
hyperlinks, 294
handwritten data, 262
hyperplane, 150, 189, 339
handwritten text, 294 hypothesis, 81
hard clustering, 263, 265, 267
hard clustering algorithm, 268, 285 identifying communities, 326, 364
hard decision, 263 identity matrix, 296, 310
hard partition, 215, 258, 274, 276, 285 implicit networks, 364
hashing algorithm, 172 impossibility of clustering, 259
health record, 294 impurity, 45, 145, 146
Hessian, 155 in-degree of node, 354
heterogeneous network, 340, 341 incident nodes, 354
heterogeneous social networks, 362 incomplete data, 285
heuristic technique, 336 incomplete knowledge, 71
hidden layer, 178, 197–199, 207 incremental algorithm, 224, 230, 258
Hidden Markov Model, 20, 168, 171, incremental clustering, 18, 21
285 independent attributes, 181
hidden node, 201 independent cascade model, 349
hidden unit, 197, 199, 200, 207, 208 index, 293
hidden variables, 313, 355 index vector, 233, 234
April 8, 2015 12:58 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-index page 374
374 Index
Index 375
376 Index
Index 377
modularity matrix, 336, 337, 339, 363 nearest neighbor classification, 171,
modularity maximization, 338 211
modularity optimization, 336, 339 nearest neighbor classifier, 5, 53, 132,
momentum term, 199 248, 296
MST, 230, 254 Needleman–Wunsch algorithm, 168
multi-class classification, 14, 178, 206 negative class, 151, 188
multi-class problem, 11, 150, 153 negative example, 146
multi-graph, 341 neighboring communities, 335
multi-label classification, 202, 206, neighboring community, 335
209, 213 neighbors, 326
multi-label kNN, 203 nervous system, 196
multi-label naive Bayes classification, network, 200, 207, 326, 327, 331, 340,
213 342, 347, 348, 352, 353, 355,
multi-label problem, 14 362–364
multi-label ranking, 202 neural network, 23, 169, 174, 177,
multi-layer feed forward network, 195, 178, 196, 199, 206, 208, 211, 266,
197 281
Multi-level recursive bisection, 328 neural networks for classification, 195
multi-lingual document classification, neural networks for clustering, 265
293 neuro-fuzzy classification, 211
multi-media document, 294 newspaper article, 294
multi-objective approach, 212 NMF, 310, 311, 319
multi-objective fitness function, 194, NN classifier, 140, 168
195 NNC, 246
Multi-objective Optimization, 277 node, 207, 230, 321, 326, 327, 341, 348
multinomial, 129, 312, 313, 315, 320, nominal feature, 41
356, 357 non-differentiable, 178
multinomial distribution, 313 non-dominated, 278
multinomial random variable, 129 non-isotropic, 231
multinomial term, 315 non-metric distance function, 171
multiobjective evolutionary non-negative matrix factorization, 24,
algorithm, 319 84, 310, 107, 319
multiple kernel learning, 261 non-parametric, 79, 135, 350
multivariate data, 167 non-spherical data, 173
multivariate split, 145 non-terminal node, 191, 192
mutation, 94, 182, 190, 191, 195, 200, non-zero eigenvalue, 295, 300
273, 279, 280 nonlinear classifier, 6
mutual information, 26, 78, 105, 106, nonlinear dimensionality reduction,
131, 252, 295 174
nonlinear SVM, 173, 174
Naive Bayes classifier, 7, 29, 113, 131 nonmetric similarity, 72
natural evolution, 272 normal distance, 151
natural selection, 91, 182 normalization, 301, 303
nearest neighbor, 19, 76, 135, 172, normalized cut metric, 325
183, 246, 253, 295, 297, 298 normalized gradient, 202
April 8, 2015 12:58 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-index page 378
378 Index
Index 379
pattern, 2, 149, 181, 186, 206, 207, precision, 97, 195, 362
223, 228–230, 242–248, 250, 252, predicate, 250, 316
253, 263, 265, 269, 271, 276, 281, predict, 167, 203, 204, 209, 210
282, 284, 286, 297 predictive model, 350, 356, 364
pattern classification, 70, 184, 195, preferential-attachment, 343, 363
211, 212 preprocessing, 8
pattern matrix, 2, 37 primal form, 154, 155
pattern recognition, 1, 19, 30, 261 primitive, 251
pattern synthesis, 8, 243, 245, 261 principal component analysis, 88,
patterns in graphs, 322 107, 302
PCA, 302 principle of inclusion and exclusion,
peak dissimilarity, 63 217
peaking phenomenon, 75 prior density, 121, 122, 124–126, 129,
penalty coefficient, 189 312
penalty function, 91 prior knowledge, 111, 115, 116
performance evaluation, 209 prior probabilities, 111, 113, 115, 119,
performance of GA, 212 158, 204, 255, 266, 357
performance of the classifier, 79 probabilistic assignment, 312
piecewise aggregate approximation,
probabilistic clustering, 265
103
probabilistic clustering algorithm, 266
Pin-Code recognition, 293
probabilistic convergence, 282
PLSA, 307, 311, 319
probabilistic latent semantic analysis,
PLSI, 266
307, 355
polynomial kernel, 169
probabilistic latent semantic
polynomial time, 339
indexing, 24, 28, 266, 319
polysemy, 306
probabilistic model, 311, 265, 266,
population, 91, 190, 265, 272, 273,
285
278
probabilistic selection, 336
population of chromosomes, 178, 191
probabilistic topic model, 312
population of strings, 182, 184
probability density function, 119
positive class, 12, 146, 151, 188
positive eigenvalue, 337 probability distribution, 23, 170, 201,
313, 322, 355
positive reflexivity, 51, 57
positive semi-definite, 329 probability emission matrix, 171
possibilistic clustering, 264, 267 probability mass function, 289
posterior distribution, 313, 314, 357, probability of error, 13
359 probability of selection, 278
posterior probabilities, 13, 111, 114, probability transition matrix, 171
116, 158, 266 product rule, 204
posterior probability, 112, 117, 147, protein sequence, 168
204 prototype, 8, 247, 281, 296
power law, 124, 133, 322 prototype selection, 138, 171
power law distribution, 125, 322 proximity, 61, 62, 253, 341
power law prior, 130 proximity function, 341, 342
power-law degree distribution, 362 Proximity measures, 50
April 8, 2015 12:58 Introduction to Pattern Recognition and Machine Learning - 9in x 6in b1904-index page 380
380 Index
Index 381
382 Index
Index 383
training example, 148, 171, 186, 205, validation set, 3, 91, 92, 102, 191,
209 193, 209
training loss, 155 variance, 108, 120, 125, 127, 129, 141,
training pattern, 2, 13, 132, 152, 159, 221, 290
177, 178, 185, 186, 199, 206, 207, variance impurity, 146
246 variational inference, 133
training phase, 9 vector, 265, 274
training time, 9, 75, 136, 196 vector space, 132, 265
transaction, 175, 237, 239, 240 vector space model, 265
verification, 293
transaction dataset, 236
visualization, 265, 326
transition matrix, 332, 338, 339, 354
vocabulary, 114, 308
translation invariant, 55
tree, 191, 227 warping window, 66
tree-based encoding, 192 wavelet transform, 104
tree-based representation, 195 web, 362
triangle inequality, 52, 53 web page, 293
tweet, 294, 352 weight, 183, 196, 200, 207–209, 277
two-class classification, 248 weight functions, 162
two-class problem, 11, 147, 188 weight matrix, 232, 236, 329, 330, 338
two-partition, 233, 234, 257 weight vector, 100, 109, 152
types of data, 29 weighted average, 284
weighted edge, 338
typical pattern, 139
weighted Euclidean distance, 183
weighted graph, 353
unary operator, 279
weighted kNN classifier, 182
unconstrained optimization, 100 Wikipedia, 134, 252, 254, 260, 261,
undirected, 232, 321, 344 294, 319
undirected graph, 341 Wilcoxon statistic, 102
uniform density, 124 winner node, 282
uniform Dirichlet distribution, 357 winner-take-all, 267, 281
uniform distribution, 141, 142 winner-take-most, 281
uniform prior, 116, 124 winning node, 282
unit resistance, 331, 333 with-in-group-error-sum-of-squares,
unit step function, 197, 198 276
unit vector, 354 Wolfe dual, 155
word probabilities, 357
unit-norm, 303
working set, 155
unit-sphere, 339
wrapper method, 76, 26
univariate, 157
WTM, 282
unlabeled data, 135, 159–161,
165–167 X-rays, 294
unsupervised feature extraction, 295
unsupervised learning, 24, 27, 135 zero-mean normalization, 304
upper approximation, 22, 177, 180, Zipf’s curve, 133
264, 270–272 Zipf’s law, 47, 124