An Improved Co-Similarity Measure for Document Clustering
Syed Fawad Hussain, Gilles Bisson
Laboratoire TIMC-IMAG, UMR 5525
University of Grenoble, France
{Fawad.Hussain,Gilles.Bisson}@imag.fr
Abstract—Co-clustering has been defined as a way to organize simultaneously subsets of instances and subsets of
features in order to improve the clustering of both of them.
In previous work [1], we proposed an efficient co-similarity
measure allowing to simultaneously compute two similarity
matrices between objects and features, each built on the basis of
the other. Here we propose a generalization of this approach by
introducing a notion of pseudo-norm and a pruning algorithm.
Our experiments show that this new algorithm significantly improves the accuracy of the results when using either supervised
or unsupervised feature selection data and that it outperforms
other algorithms on various corpora.
Keywords-co-clustering; similarity measure; text mining
I. I NTRODUCTION
Clustering task is used to organize or summarize data
coming from databases. Classically, the data are described as
a set of instances characterized by a set of features. In some
cases, these features are homogeneous enough to allow us
to cluster them, in the same way as we do for the instances.
For example, when using the Vector Space Model introduced
by [2], text corpora are represented by a matrix whose rows
represent document vectors and whose columns represent
the word vectors. The similarity between two documents
obviously depends on the similarity between the words they
contain and vice-versa. In the classical clustering methods,
such dependencies are not exploited. The purpose of coclustering is to take into account this duality between rows
and columns to identify the relevant clusters. In this regard,
co-clustering has been largely studied in recent years both
in Document clustering [3-6] and Bioinformatics [7-9].
In text analysis, the advantage of co-clustering is related
to the well-known problem that document and words vectors
tend to be highly sparse and suffer from the curse of dimensionality [10]. In such a scenario, traditional metrics such
as Euclidean distance or Cosine similarity do not always
make much sense [11]. Several methods have been proposed
to overcome these limitations by exploiting the dual relationship between documents and words to extract semantic
knowledge from the data. Consequently, the concept of
higher-order co-occurrences has been investigated in [12],
[13], among others, as a measure of semantic relationship
between words; one of the best known approach to acquire
such knowledge being the Latent Semantic Analysis [14].
Clément Grimal
Laboratoire d’Informatique de Grenoble, UMR 5217
University of Grenoble, France
Clement.Grimal@imag.fr
The underlying analogy is that humans do not necessarily
use the same vocabulary when writing about the same topic.
For example, let us consider a corpus in which a subset of
documents contains a significant number of co-occurrences
between the words sea and waves and another subset in
which the words ocean and waves co-occur. A human could
infer that the worlds ocean and sea are conceptually related
even if they do not directly co-occur in any document. Such
a relationship between waves and ocean (or sea and waves)
is termed as a first-order co-occurrence and the conceptual
association between sea and ocean is called a second-order
relationship. This concept can be generalized to higher-order
(3rd , 4th , 5th , etc) co-occurrences.
In this context, we recently introduced an algorithm,
called χ-Sim [1], exploiting the duality between words and
documents in a corpus as well as their respective higherorder co-occurrences. While most authors have focused
to directly co-cluster the data, in χ-Sim, we just built
two similarity matrices, one for the rows and one for the
columns, each being built iteratively on the basis of the other.
We call this process the co-similarity measure. Hence, when
the two similarity matrices are built, each of them contains
all the information needed to do a ‘separate’ co-clustering
of the data (documents and words) by using any classical
clustering algorithm (K-means, Hierarchical Clustering, etc).
In this way, the final user can choose the algorithm that is
the best suited to co-cluster his data.
In this paper we further analyze the behavior of χ-Sim
and we propose some ideas leading to dramatically improve
the quality of the co-similarity measures. First, we introduce
a new normalization schema for this measure that is more
consistent with the framework of the algorithm and that
offers new perspectives of research. Second, we propose
an efficient way to deal with noise in the data and thus
to improve the accuracy of the clustering.
The rest of the paper is organized as follows: in section II,
we explain a generalized version of χ-Sim. In section III,
we discuss the shortcomings of this algorithm and propose
some improvements, then a new algorithm is described.
Experimental results on some classical datasets are presented
in section IV and V using, respectively, some supervised
and unsupervised feature selection methods. Finally, in section VI we present conclusions and future work.
II. T HE χ-S IM S IMILARITY M EASURE
In this paper, we will use the following classical notations:
matrices (in capital letters) and vectors (in small letters) are
in bold and all variables are in italic.
Data matrix: let M be the data matrix representing a
corpus having r rows (documents) and c columns (words);
mij corresponds to the ‘intensity’ of the link between the
ith row and the j th column (for a document-word matrix,
it can be the number of occurrences of the j th word in
the ith document); mi: = [mi1 · · · mic ] is the row vector
representing the document i and m:j = [m1j · · · mrj ] is the
column vector corresponding to word j. We will refer to a
document as di when talking about documents casually and
refer to it as mi: when specifying its (row) vector in the
matrix M. Similarly, we will casually refer to a word as wj
and use the notation m:j when emphasizing the vector.
Similarity matrices: SR and SC represent the square and
symmetrical row similarity and column similarity matrices
of size r × r and c × c respectively, with ∀i, j = 1..r, srij ∈
[0, 1] and ∀i, j = 1..c, scij ∈ [0, 1].
Similarity function: function Fs (·, ·) is a generic function
that takes two elements mil and mjn of M and returns a
measure of the similarity Fs (mil , mjn ) between them.
A. Similarity measures
The χ-Sim algorithm is a co-similarity based approach
which builds on the idea of simultaneously generating the
similarity matrices SR (between documents) and SC (between words), each of them built on the basis of the other.
Similar ideas have also been used for supervised leaning
in [6] and for image retrieval in [15]. First, we present how to
compute the co-similarity matrix SR. Usually, the similarity
(or distance) measure between two documents mi: and mj:
is defined as a function – denoted here as Sim(mi: , mj: )
– that is more or less the sum of the similarities between
words occurring in both mi: and mj: :
Sim(mi: , mj: ) = Fs (mi1 , mj1 ) + · · · + Fs (mic , mjc ) (1)
Now let’s suppose we already know a matrix SC whose
entries provide a measure of similarity between the columns
(words) of the corpus. In parallel, let’s introduce, by analogy
to the norm Lk (Minkowski distance), the notion of a
pseudo-norm k. Then, (1) can be re-written as follows
without changing its meaning if scll = 1 and if k = 1:
v
u c
X
u
k
k
(Fs (mil , mjl )) × scll (2)
Sim(mi: , mj: ) = t
l=1
Now the idea is to generalize (2) in order to take into
account all the possible pairs of features (words) occurring
in documents mi: and mj: . In this way, we “capture”
not only the similarity between their common words but
also the similarity coming from words that are not directly
shared by the two documents. Of course, for each pair of
words not directly shared by the documents, we weight
their contribution to the document similarity srij by their
own similarity scln . Thus, the overall similarity between
documents mi: and mj: is defined in (3) in which the terms
for l = n were those occurring in (2):
v
u c c
XX
u
k
k
k
Sim (mi: , mj: ) = t
(Fs (mil , mjn )) × scln (3)
l=1 n=1
Assuming that our function Fs (mil , mjn ) is defined as a
product (see [1] for further details) of the elements mil and
mjn , i.e. Fs (mil , mjn ) = mil × mjn (as with the cosine
similarity measure), we can rewrite Equation (3) as:
q
k
k
k
(4)
Simk (mi: , mj: ) = (mi: ) × SC × mT
j:
h
i
k
k
k
where (mi: ) = (mij ) · · · (mic ) and mT
j: denotes the
transpose of the vector mj: . Finally, let’s introduce the term
N (mi: , mj: ) that is a normalization function depending on
mi: and mj: and allowing to map the similarity to [0, 1].
We obtain the following equation in which srij denotes an
element of the SR matrix:
q
k
k
k
(mi: ) × SC × mT
j:
(5)
srij =
N (mi: , mj: )
Equation (5) is a classic generalization of several wellknown similarity measures. For example, with k = 1,
the Jacard index can be obtained by setting SC to I and
N (mi: , mj: ) to kmi: k1 + kmj: k1 − mi: mT
j: , while the
Dice coefficient can be obtained by setting SC to 2I and
N (mi: , mj: ) to kmi: k1 + kmj: k1 . Furthermore, if SC is
set to a positive semi-definite matrix A, one can define the
following inner product < mi: , mj: >A = mi: × A × mT
j: ,
along with the associated norm km
i: , mi: >A .
pi: kA = < mp
Then by setting N (mi: , mj: ) to kmi: kA × kmj: kA ,
we obtain the Generalized Cosine similarity [16], as it
corresponds to the Cosine measure in the underlying inner
product space. Of course, by binding A to I, this similarity
becomes the standard Cosine measure.
B. The χ-Sim Co-Similarity Measure
The χ-Sim co-similarity measure, as defined in [1] can
be also reformulated with (5). We set k to 1, and since the
maximum value defined by the function scij is 1, it follows
from (3) that, the upper bound of Sim(mi: , mj: ) for 1 6
i, j 6 r, is given by the product of the sum of elements
(assumed as being positive numbers) of mi: and mj: denoted
by |mi: | × |mj: | (product of L1 -norms). This normalization
seems well-suited for textual datasets since it allows us to
take into consideration the actual length of the document
and word vectors when dealing with pairs of documents or
words of uneven length, which is common in text corpora.
Therefore, we can rewrite (5) as:
m
∀i, j ∈ 1..r, srij =
mT
j:
mi: × SC ×
|mi: | × |mj: |
(6a)
Similarly, the elements scij of SC correspond to:
∀i, j ∈ 1..c, scij =
mT
:i × SR × m:j
|m:i | × |m:j |
(6b)
Thus, equations (6a) and (6b) define a systems of linear equations, whose solutions correspond to the (co)similarities between two documents and two words. Thus,
the algorithm of χ-Sim is classically based on an iterative
approach – i.e. we compute alternatively the values scij and
srij . However, before detailing this algorithm for a more
generic case in section III-C, we are going to explain the
meaning, considering the associated bipartite graph, of these
iterations.
C. Graph Theoretical Interpretation
The graphical interpretation of the method helps to understand the working of the algorithm and provides some
intuition on how to improve it. Let’s consider the bipartite
graph representation of a sample data matrix in Fig. 1.
Documents and words are represented by square and circle
d1
d2
d3
d4
d2 −−24
→ w4 represents an order-1 path between words
(1)
w2 and w4 which is the same as sc24 . The contribution
(1)
of d2 in the similarity of sr14 can thus be re-written as
(1)
m12 × sc24 × m44 . This is a partial similarity measure since
d2 is not the only document that provides a link between d1
(1)
and d4 . The similarity via d3 is equal to m13 × sc35 × m55 .
To find the overall similarity measure between documents d1
and d4 , we need to add these partial similarity values given
(1)
(1)
by m12 × sc24 × m44 + m13 × sc35 × m55 . Hence, the
(2)
similarity matrix SR at the second iteration corresponds
to paths of order-2 between documents. It can be shown similarly that, the matrices SR(t) and SC(t) represent order-t
paths between documents and between words respectively.
Consequently, at each iteration t, when we compute the
value of equations (6a) and (6b) one or more new links may
be found between previously disjoint objects (documents or
words) corresponding to paths with length of order-t, and
existing similarity measures may be strengthened. It has
been shown that “in the long run”, the ending point of a
random walk does not depend on its starting point [17] and
hence it is possible to find a path (and hence similarity)
between any pair of nodes in a connected graph [18] by
iterating a sufficiently large number of times. However, cooccurrences beyond the 3rd and 4th order have little semantic
relevance [1], [13]. Therefore, the number of iterations is
usually limited to 4 or less.
III. D ISCUSSION AND I MPROVEMENTS OF χ-S IM
w1
Figure 1.
w2
w3
w4
w5
w6
A bi-partite graph view of a sample document-word matrix.
nodes respectively and an edge (of any kind) between a
document di and a word wj corresponds to a non-zero
entry mij in the document-word matrix. There is only
one order-1 path between documents d1 and d2 given by
m
m
d1 −−12
→ w2 −−22
→ d2 . If we consider that the SC matrix
is initialized with the identity matrix I, at the first iteration,
Sim(m1: , m2: ) corresponds to the inner product between
m1: and m2: as given by (6a) and equals m12 × m22 .
Omitting the normalization for the sake of clarity, the matrix
SR(1) = M×MT thus represents all order-1 paths between
all the possible pairs of documents di and dj . Similarly, each
element of SC(1) = MT × M represents all order-1 paths
between all the possible pairs of words wi and wj .
Now, documents d1 and d4 do not have an order-1 path
but are linked together through d2 (bold paths in Fig. 1) and
d3 (dotted paths in Fig. 1). Such paths with one intermediate
vertice are called order-2 paths, and will appear during
the second iteration. The similarity value contributed via
m
the document d2 can be explicitly represented as d1 −−12
→
m22
m24
m44
m
w2 −−→ d2 −−→ w4 −−→ d4 . The sub-sequence w2 −−22
→
In this section, first, we discuss a new normalization
schema for χ-Sim in order to satisfy (partially) the maximality property of a similarity measure (Sim(a, b) = 1),
then we propose a pruning method allowing to remove the
’noisy’ similarity values created during the iterations.
A. Normalization
In this paper, we investigate extensions of the Generalized
Cosine measure, by relaxing the positive semi-definiteness
of the matrix, and by adding a pseudo-norm parameter k.
Henceforth, using the equation (4) we define the elements of
the matrices SR and SC with the two new equations (7a)
and (7b):
∀i, j ∈ 1..r, srij =
q
Simk (mi: , mj: )
q
(7a)
Simk (mi: , mi: ) × Simk (mj: , mj: )
∀i, j ∈ 1..c, scij =
q
Simk (m:i , m:j )
q
(7b)
Simk (m:i , m:i ) × Simk (m:j , m:j )
However, this normalization is what we will call a pseudonormalization since if it guaranties that srii = 1, it does
not satisfy that ∀i, j ∈ 1..r, srij ∈ [0, 1] (and the same
for scij ). Consider for example a corpus having, among
many other documents, the documents d1 containing the
word orange (w1 ) and d2 containing the words red (w2 )
and banana (w3 ), along with SC – the similarity matrix of
all the words of the corpus – indicating that the similarity
between orange and red is 1, the similarity between orange
and banana is 1 and the similarity between red and banana
is 0. Thus, Sim1 (d1 , d1 ) = 1, Sim1 (d2 , d2 ) = 2 and
2
> 1.
Sim1 (d1 , d2 ) = 2. Consequently, sr12 = √1×2
One can notice that this problem arises from the polysemic
nature of the word orange. Indeed, the similarity between
these two documents is overemphasized because of the
double analogy between orange (the color) and red, and
between orange (the fruit) and banana. It is possible to
correct this problem by setting k = +∞ since the pseudonorm-k becomes max16k,l6c {mik × sckl × mil } and thus,
Sim∞ (d1 , d1 ) = Sim∞ (d2 , d2 ) = Sim∞ (d1 , d2 ) = 1,
implying sr12 = 1. Of course, k = +∞ is not necessarily a
good setting for real tasks and experimentally we observed
that the values of srij and scij remain generally smaller or
equal to 1.
In this framework it is nevertheless very interesting
to investigate the different results one can obtain from
varying k, including values lower than 1, as suggested
in [19] for the norm Lk , to deal with high dimensional
spaces. The resulting χ-Sim algorithm will be denoted by
χ-Simk . However, the situation is different from the norm
Lk in the sense that our method does not define a proper
Normed Vector Space. To understand the problem it is
worth looking closely to the simple case k = 1 where
Sim1 (mi: , mj: ) = mi: × SC × mT
i: is the general form of
an inner product, with the condition that SC is symmetric
positive semi-definite (PSD). Unfortunately, in our case due
to the normalization steps, SC is not necessarily PSD, as
√
the condition ∀i, j ∈ 1..c, |scij | 6 scii × scjj = 1 is not
verified (cf. previous example). Thus, our similarity measure
is just a bilinear form in a degenerated inner product space
(as the conjugate and linearity axioms are trivially satisfied)
in which it corresponds to the ’cosine’.
A straightforward solution would be to project SC (and
SR) after each iteration onto the set of PSD matrices [16].
By constraining the similarity matrices to be PSD, we would
ensure that the new space remains a proper inner product
space. However, we experimentally verified that such an
additional step did not improve the results though, as when
testing on real datasets, the similarity matrices are very close
to the set of PSD matrices. In addition, the projection step
is very time consuming, for these reasons, we won’t use it
in the remaining of this paper.
B. Dealing with ‘noise’ in SC and SR matrices
As explained in section II-C, the elements of the SR
matrix after the first iteration are the weighted order-1 paths
in the graph: the diagonal elements srii correspond to the
paths from each document to itself, while the non-diagonal
terms srij count the number of order-1 paths between a
document i and a neighbour j, which is based on the number
of words they have in common. SR(1) is thus the adjacency
matrix of the document graph. Iteration t amounts thus to
count the number of order-t between nodes. However, in a
corpus, we can observe there are a large number of words,
having a small number of occurrences, that are not really
relevant with the topic of the document, or to be more
precise, that are not specific to any families of documents
semantically related. These words are similar to a ’noise’ in
the dataset. Thus, during the iterations, these noisy words
allow the algorithm to create some new paths between
the different families of documents; of course these paths
have a very small similarity value but they are numerous
and we make the assumption that they blur the similarity
values between the classes of documents (same for the
words). Based on this observation, we thus introduce in the
χ-Sim algorithm a parameter, termed pruning threshold and
denoted by p, allowing us to set to zero the lowest p % of
the similarity values in the matrices SR and SC at each
iteration.
In the following, we will refer to this algorithm as χ-Simp
when using the previous normalization factor described
in (6a) and (6b), and χ-Simkp when using the new pseudonormalization factor described in (7a) and (7b).
C. A Generic χ-Sim Algorithm for χ-Simkp
Equations (7a) and (7b) allows us to compute the similarities between two rows and two columns. The extension over
all pair of rows and all pair of columns can be generalized
under the form of a simple matrix multiplication. We need
to introduce a new notation here, M◦k = (mij )k i,j which
is the element-wise exponentiation of M to the power of k.
The algorithm follows:
1) We initialize the similarity matrices SR (documents)
and SC (words) with the identity matrix I, since, at the
first iteration, only the similarity between a row (resp.
column) and itself equals 1 and zero for all other rows
(resp. columns). We denote these matrices as SR(0)
and SC(0) where the superscript denotes the iteration.
2) At each iteration t, we calculate the new similarity matrix between documents SR(t) by using the similarity
matrix between words SC(t−1) :
T
SR(t) = M◦k × SC(t−1) × (M◦k )
q
k
(t)
srij
(t)
and ∀i, j ∈ 1..r, srij ← q
2k
(t)
(t)
srii × srjj
(8a)
(8b)
We do the same thing for the columns similarity matrix
SC(t) :
T
SC(t) = (M◦k ) × SR(t−1) × M◦k
(9a)
q
k
(t)
scij
(t)
and ∀i, j ∈ 1..c, scij ← q
2k
(t)
(t)
scii × scjj
B. Experimental Measures
(9b)
3) We set to 0 the p % of the lowest similarity values in
the similarity matrices SR and SC.
4) Steps 2 and 3 are repeated t times (typically as we saw
in section 2 the value t = 4 is enough) to iteratively
update SR(t) and SC(t) .
It is worth noting here that even though χ-Simkp computes
the similarity between each pair of documents using all
pairs of words, the complexity of the algorithm remains
comparable to classical similarity measures like cosine or
Euclidean distances. Given that – for a generalized matrix
of size n by n – the complexity of matrix multiplication is
in O(n3 ) and the complexity to compute M◦n is in O(n2 ),
the overall complexity of χ-Sim is given by O(tn3 ).
IV. E XPERIMENTS
Here, to evaluate our system, we cluster the documents
coming from the well-known 20-Newsgroup dataset (NG20)
by using the document similarity matrices SR generated
by χ-Sim. We choose this dataset since it has been widely
used as a benchmark for document classification and coclustering [3], [4], [20], [21], thus allowing us to compare
our results with those reported in the literature.
A. Preprocessing and Methodology
We replicate the experimental procedures used by previous authors in [3], [4], [21]: 10 different samples of each of
the 6 subsets described in Table I are generated, we ignored
the subject lines, we removed stop words and we selected
the top 2,000 words based on supervised mutual information [22]. We will discuss further this last preprocessing step
in section V. With these six benchmarks, we compared our
co-similarity measures based on χ-Sim with three similarity
measures: Cosine, LSA [14] and SNOS [6]; as well as three
co-clustering methods: ITCC [3], BVD [4] and RSN [21].
Creation of the clusters. For the ’similarity based’ algorithms: χ-Sim, Cosine, LSA and SNOS, the clusters are generated by an Agglomerative Hierarchical Clustering (AHC)
method on the similarity matrices with Ward’s linkage. Then
we cut the clustering tree at the level corresponding to the
number of document clusters we are waiting for (two for
subset M2, five for for subset M5, etc).
Implementations. χ-Sim algorithms, as well as SNOS and
AHC have been implemented in JAVA, and Cosine and LSA
have been implemented in MatLab. For ITCC, we used the
implementation provided by the authors and the parameters
reported in [3]. For BVD and RSN, as we don’t have a
running implementation, we directly quote the best values
from [4] and [21] respectively.
We used the classical micro-averaged precision (Pr) [3]
for comparing the accuracy of the document classification;
the Normalized Mutual Information (NMI) [23] is also used
to compare χ-Sim with RSN. For SNOS, we perform four
iterations and set the λ parameter to the value proposed by
the authors [6]. For LSA, we tested the algorithm iteratively
keeping the h highest singular values from h = 10..200 by
steps of 10. We use the value of h providing, on average, the
highest micro-averaged precision. For ITCC, we ran three
times the algorithm using the different numbers of word
clusters, as suggested in [3], for each dataset. For χ-Simp ,
we performed the pruning step as described in section III-C
varying the value of p = 0 to 0.9 by steps of 0.1. For each
subset, we report the best micro-averaged precision obtained
with p.
The experimental results are summarized in Table II. In
all the versions, χ-Sim performs better than all the other
tested algorithms. Moreover, the new normalization schema
proposed in section III clearly improved the results of our
algorithm over the previous normalization based on the
length of the documents. The SNOS algorithm performs
poorly in spite of the fact that it is very closed to χ-Sim,
probably because it uses a different normalization.
It is interesting to notice that the gain obtained with the
pruning when using the previous version of χ-Sim on the
M10 and NG3 (the two hardest problems) is reduced to
almost negligible levels with the new algorithm. Finally, the
impact of the parameter k is small for all the subsets but
M10 and NG3. In these more complex datasets, we observe
that setting k to a value lower than 1 slightly improves
the clustering. This result seems to show that the results
provided by [19], suggesting to use a value of k lower than
1 with the norm Lk when dealing with high dimensional
space, are also relevant in our framework.
V. D ISCUSSION ABOUT THE P REPROCESSING
The feature selection step aims at improving the results
by removing words that are not useful to separate the
different clusters of documents. Moreover, this step is also
clearly needed due to the spatial and time complexity of the
algorithms in O(n3 ). Nevertheless, we are performing an
unsupervised learning task, thus using a supervised feature
selection method, i.e. selecting the top 2,000 words based on
how much information they bring to one class of documents
or another, may introduce some annoying bias since it leads
to ease the problem by building well-separated clusters.
In real applications, it is impossible to use this kind of
preprocessing for unsupervised learning. Thus to explore the
potential effects of this bias, we hereby propose to generate
similar subsets of the NG20 dataset but this time using an
unsupervised feature selection method.
Table I
S UBSETS OF THE NG20 DATASET USED TO EVALUATE OUR APPROACH . F OR EACH SUBSET, WE PROVIDE THE NUMBER OF CLUSTERS ( TOPIC ) IT
DESCRIBES , AND THE TOTAL NUMBER OF DOCUMENTS THIS SUBSET CONTAINS .
Name
Newsgroups included
M2
M5
M10
talk.politics.mideast, talk.politics.misc
comp.graphics, rec.motorcycles, rec.sport.baseball, sci.space, talk.politics.mideast
alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos, rec.sport.hockey, sci.crypt, sci.electronics,
sci.med, sci.space, talk.politics.gun
rec.sports.baseball, rec.sports.hockey
comp.os.ms-windows.misc, comp.windows.x, rec.motorcycles, sci.crypt, sci.space
comp.os.ms-windows.misc, comp.windows.x, misc.forsale, rec.motorcycles, sci.crypt, sci.space,
talk.politics.mideast, talk.religion.misc
NG1
NG2
NG3
#clusters.
#docs.
2
5
10
500
500
500
2
5
8
400
1000
1600
Table II
M ICRO - AVERAGED PRECISION ( AND NMI FOR χ-S IM BASED ALGORITHMS AND RSN) ALONG WITH STANDARD DEVIATION FOR THE VARIOUS
SUBSETS OF THE NEWSGROUP DATASET (NG20).
M2
M5
M10
NG1
NG2
NG3
Cosine
Pr
0.60 ± 0.00
0.63 ± 0.07
0.49 ± 0.06
0.90 ± 0.11
0.60 ± 0.10
0.59 ± 0.04
LSA
Pr
0.92 ± 0.02
0.87 ± 0.06
0.59 ± 0.07
0.96 ± 0.01
0.82 ± 0.03
0.74 ± 0.03
ITCC
Pr
0.79 ± 0.06
0.49 ± 0.10
0.29 ± 0.02
0.69 ± 0.09
0.63 ± 0.06
0.59 ± 0.05
BVD
Pr
best: 0.95
best: 0.93
best: 0.67
-
-
-
RSN
NMI
-
-
-
0.64 ± 0.16
0.75 ± 0.07
0.70 ± 0.04
SNOS
Pr
0.55 ± 0, 02
0.25 ± 0, 02
0.24 ± 0, 06
0.51 ± 0, 01
0.24 ± 0, 02
0.22 ± 0, 05
χ-Sim
Pr
NMI
0.91 ± 0.09
0.96 ± 0.00
0.69 ± 0.05
0.96 ± 0.01
0.76 ± 0.06
0.92 ± 0.01
0.79 ± 0.02
0.79 ± 0.06
0.72 ± 0.03
χ-Simp
Pr
NMI
0.94 ± 0.01
0.96 ± 0.00
0.73 ± 0.03
0.97 ± 0.01
0.78 ± 0.05
0.92 ± 0.01
0.79 ± 0.02
0.84 ± 0.05
0.73 ± 0.02
χ-Sim1
Pr
NMI
0.95 ± 0.00
0.96 ± 0.02
0.78 ± 0.03
0.97 ± 0.02
0.85 ± 0.07
0.94 ± 0.01
0.83 ± 0.03
0.86 ± 0.05
0.79 ± 0.03
χ-Sim1p
Pr
NMI
0.95 ± 0.00
0.97 ± 0.01
0.78 ± 0.03
0.98 ± 0.01
0.86 ± 0.04
0.94 ± 0.01
0.83 ± 0.03
0.87 ± 0.05
0.80 ± 0.02
χ-Sim0.8
Pr
NMI
0.95 ± 0.00
0.97 ± 0.01
0.79 ± 0.02
0.98 ± 0.01
0.87 ± 0.05
0.94 ± 0.01
0.84 ± 0.02
0.90 ± 0.01
0.81 ± 0.02
0.8
χ-Simp
Pr
NMI
0.95 ± 0.00
0.97 ± 0.01
0.80 ± 0.04
0.98 ± 0.00
0.88 ± 0.03
0.94 ± 0.01
0.85 ± 0.02
0.90 ± 0.02
0.81 ± 0.03
A. Unsupervised Feature Selection
To reduce the number of words in the learning set, we
used an approach consisting in selecting a representative
subset (sampling) of the words with the help of the Partitioning Around Medoids (PAM) [24] algorithm. This algorithm
is quite simple, more robust to noise and outliers than the
k-means algorithm, and less sensitive to initial values as
well. The procedure is the following: first, we remove from
the corpus the words appearing in just one document, as they
do not provide information to built the clusters; then, we run
PAM to get 2,000 classes corresponding to a selection of
2,000 words. We used the implementation of PAM provided
in the R project [25] with the Euclidean distance.
B. Results with PAM
Here, we use the same methodology as described in
section IV-A except for the feature selection step which
is now done with PAM instead of the supervised mutual
information. The new experimental results are summarized
in Table III.
We can observe that the results are very different. The
version of χ-Sim using the previous normalization method
obtains more or less the same results as LSA: a little bit
worse without pruning and slightly better with pruning. With
the new normalization the results are more contrasted and
now, differently from the first experiments, the impact of
the pruning factor p becomes very strong: without pruning
the new method performs poorly on several problems (M2,
M10, NG2, NG3), the results being lower than the Cosine
similarity, but by pruning the smallest values of the similarity
matrices the situation is completely opposite and χ-Simp
based algorithms obtain the best results on all the datasets.
As in section IV, we observe again that setting a value of
k lower than 1 improves the clustering in all the dataset but
one. Now it is interesting to see with more details the impact
of the different values of k and p on a given dataset.
Table III
M ICRO - AVERAGED PRECISION ALONG WITH STANDARD DEVIATION FOR THE VARIOUS SUBSETS OF THE NEWSGROUP DATASET (NG20),
PRE - PROCESSED USING THE PAM FEATURE SELECTION .
M2
M5
M10
NG1
NG2
NG3
Cosine
0.61 ± 0.04
0.54 ± 0.08
0.39 ± 0.03
0.52 ± 0.01
0.60 ± 0.05
0.49 ± 0.02
LSA
0.79 ± 0.09
0.66 ± 0.05
0.44 ± 0.04
0.56 ± 0.05
0.61 ± 0.06
0.52 ± 0.03
ITCC
0.70 ± 0.05
0.54 ± 0.05
0.29 ± 0.05
0.61 ± 0.06
0.44 ± 0.08
0.49 ± 0.07
SNOS
0.51 ± 0.01
0.26 ± 0.04
0.20 ± 0.02
0.51 ± 0.00
0.24 ± 0.01
0.22 ± 0.02
χ-Sim
0.58 ± 0.07
0.62 ± 0.12
0.43 ± 0.04
0.54 ± 0.03
0.60 ± 0.12
0.47 ± 0.05
χ-Simp
0.65 ± 0.09
0.68 ± 0.06
0.47 ± 0.04
0.62 ± 0.12
0.63 ± 0.14
0.57 ± 0.04
χ-Sim1
0.54 ± 0.06
0.62 ± 0.13
0.36 ± 0.04
0.53 ± 0.02
0.35 ± 0.09
0.30 ± 0.05
χ-Sim1p
0.80 ± 0.13
0.77 ± 0.08
0.53 ± 0.05
0.75 ± 0.07
0.73 ± 0.06
0.61 ± 0.03
χ-Sim0.8
0.54 ± 0.05
0.66 ± 0.07
0.37 ± 0.06
0.52 ± 0.02
0.38 ± 0.08
0.36 ± 0.04
0.8
χ-Simp
0.81 ± 0.10
0.79 ± 0.05
0.55 ± 0.04
0.81 ± 0.02
0.72 ± 0.02
0.64 ± 0.04
Figure 2 shows the evolution of the accuracy on NG1
subset according to the value of p. When the words are
selected by supervised mutual information the curve is quite
flat, but when the words are selected with PAM, we see
a different behavior: the accuracy first increases with the
pruning level, the best value being about 60% (it is worth
noticing that this value is very stable among the datasets).
This re-enforces our assumption that pruning the similarity
matrices can be a good way of dealing with ‘noise’. Indeed,
when the features are selected with Mutual Information,
the classes are relatively well separated thus, similarity
propagation as a result of higher order co-occurrences between documents (or words) of different categories as few
influence. However, with the unsupervised feature selection,
there are more ’noise’ in the data and the pruning process
helps significantly to alleviate this problem.
Figure 3 shows the evolution of the accuracy again on the
NG1 subset according to the value of k. As we can see, on
this dataset where the document and words vectors tend to
be highly sparse, the best values for this parameter seems to
be found between 0.5 and 1 as for the case of the norm Lk
[19], we choose the value 0.8 in the results tables. However,
this effect can only be seen when the pruning parameter is
activated (plain line on the figure).
Finally, it is worth noting that in this experiments with
PAM, the difference between LSA and Cosine strongly
decreases. All these results demonstrate that preprocessing
the data with a supervised feature selection approach totally
change (unsurprisingly) the behavior of the clustering methods by simplifying too much the problem.
VI. C ONCLUSION
In this paper, we proposed two empirical improvements
of the χ-Sim co-similarity measure. The new normalization
we presented for this measure is more consistent with the
framework of the algorithm and also (partially) satisfies the
reflexivity property. Furthermore, we showed that the χ-Sim
Figure 2. Evolution of the precision for NG1 (using χ-Sim0.6
p ) against p
along with error bars representing the standard deviation over the 10 folds.
The dotted line represents the supervised feature selection data, and the
plain one the unsupervised feature selection data.
Figure 3. Evolution of the precision for NG1 (using χ-Simk0 for the
dotted line, and χ-Simk0.6 for the plain one) against k along with error
bars representing the standard deviation over the 10 folds.
similarity measure is susceptible to noise and proposed
a way to alleviate this susceptibility and to improve the
precision. On the experimental part, our co-similarity based
approach performs significantly better than the other coclustering algorithms we tested for the task of document
clustering. In contrast to [3], [4], our algorithms does not
need to cluster the words (columns) for clustering the documents (rows), thus avoiding the need to know the number of
word clusters and the learning parameters p and k introduced
here seems relatively easy to tune. However, we will investigate how to automatically find the best values for these
parameters, using similarity matrix analysis from [19]. It is
also worth noting that our co-similarity measure performs
better than LSA. Unfortunately, as we saw in section III-A,
the current method is not well-defined from the theoretical
point of view and we need to analyze its behavior in order
to understand the role of the pseudo-normalization and to
see if it is possible to turn it into a real normalization.
ACKNOWLEDGMENT
This work is part of a PhD thesis funded by the Higher
Education Commission, Government of Pakistan. This work
is partially supported by the French ANR project FRAGRANCES under grant 2008-CORD 00801.
R EFERENCES
[1] G. Bisson and F. Hussain, “Chi-sim: A new similarity measure for the co-clustering task,” in Proceedings of the Seventh
ICMLA. IEEE Computer Society, Dec. 2008, pp. 211–217.
[2] G. Salton, The SMART Retrieval System—Experiments in
Automatic Document Processing. Upper Saddle River, NJ,
USA: Prentice-Hall, Inc., 1971.
[3] I. S. Dhillon, S. Mallela, and D. S. Modha, “Informationtheoretic co-clustering,” in Proceedings of the Ninth ACM
SIGKDD, 2003, pp. 89–98.
[4] B. Long, Z. M. Zhang, and P. S. Yu, “Co-clustering by block
value decomposition,” in Proceedings of the Eleventh ACM
SIGKDD. New York, NY, USA: ACM, 2005, pp. 635–640.
[5] M. Rege, M. Dong, and F. Fotouhi, “Bipartite isoperimetric
graph partitioning for data co-clustering,” Data Min. Knowl.
Discov., vol. 16, no. 3, pp. 276–312, 2008.
[6] N. Liu, B. Zhang, J. Yan, Q. Yang, S. Yan, Z. Chen, F. Bai,
and W. ying Ma, “Learning similarity measures in nonorthogonal space,” in Proceedings of the 13th ACM CIKM.
ACM Press, 2004, pp. 334–341.
[10] N. Slonim and N. Tishby, “The power of word clusters
for text classification,” in In 23rd European Colloquium on
Information Retrieval Research, 2001.
[11] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft,
“When is ”nearest neighbor” meaningful?” in In Int. Conf.
on Database Theory, 1999, pp. 217–235.
[12] K. Livezay and C. Burgess, “Mediated priming in highdimensional meaning space: What is ”mediated” in mediated
priming?” in Proceedings of the Cognitive Science Society,
1998.
[13] B. Lemaire and G. Denhière, “Effects of high-order cooccurrences on word semantic similarities,” Current Psychology Letters - Behaviour, Brain and Cognition, vol. 18(1),
2008.
[14] S. Deerwester, S. T. Dumais, G. W. Furnas, Thomas, and
R. Harshman, “Indexing by latent semantic analysis,” Journal
of the American Society for Information Science, vol. 41, pp.
391–407, 1990.
[15] X.-J. Wang, W.-Y. Ma, G.-R. Xue, and X. Li, “Multi-model
similarity propagation and its application for web image
retrieval,” in Proceedings of the 12th annual ACM MULTIMEDIA. New York, NY, USA: ACM, 2004, pp. 944–951.
[16] A. M. Qamar and E. Gaussier, “Online and batch learning
of generalized cosine similarities,” in Proceedings of the
Ninth IEEE ICDM. Washington, DC, USA: IEEE Computer
Society, 2009, pp. 926–931.
[17] E. Seneta, Non-Negative Matrices and Markov Chains.
Springer, 2006.
[18] S. Zelikovitz and H. Hirsh, “Using lsi for text classification
in the presence of background text,” in Proceedings of the
10th ACM CIKM. ACM Press, 2001, pp. 113–118.
[19] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the
surprising behavior of distance metrics in high dimensional
space,” in Lecture Notes in Computer Science. Springer,
2001, pp. 420–434.
[20] D. Zhang, Z.-H. Zhou, and S. Chen, “Semi-supervised dimensionality reduction,” in Proceedings of the SIAM ICDM,
2007.
[21] B. Long, X. Wu, Z. M. Zhang, and P. S. Yu, “Unsupervised
learning on k-partite graphs,” in Proceedings of the 12th ACM
SIGKDD. New York, NY, USA: ACM, 2006, pp. 317–326.
[22] Y. Yang and J. O. Pedersen, “A comparative study on feature
selection in text categorization,” in ICML, 1997, pp. 412–420.
[7] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms
for biological data analysis: A survey,” 2004.
[23] A. Banerjee and J. Ghosh, “Frequency sensitive competitive
learning for clustering on high-dimensional hyperspheres,”
2002.
[8] N. Speer, C. Spieth, and A. Zell, “A memetic clustering
algorithm for the functional partition of genes based on the
gene ontology,” 2004.
[24] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An
Introduction to Cluster Analysis. John Wiley John Wiley &
Sons, 1990.
[9] Y. Cheng and G. M. Church, “Biclustering of expression
data,” in Proceedings of the International Conference on
Intelligent System for Molecular Biology, Boston, 2000, pp.
93–103.
[25] R Development Core Team, R: A Language and Environment
for Statistical Computing, R Foundation for Statistical
Computing, Vienna, Austria, 2010, ISBN 3-900051-07-0.
[Online]. Available: http://www.R-project.org