An Improved Co-Similarity Measure for Document Clustering

Fawad hussain

An Improved Co-Similarity Measure for Document Clustering Syed Fawad Hussain, Gilles Bisson Laboratoire TIMC-IMAG, UMR 5525 University of Grenoble, France {Fawad.Hussain,Gilles.Bisson}@imag.fr Abstract—Co-clustering has been defined as a way to organize simultaneously subsets of instances and subsets of features in order to improve the clustering of both of them. In previous work [1], we proposed an efficient co-similarity measure allowing to simultaneously compute two similarity matrices between objects and features, each built on the basis of the other. Here we propose a generalization of this approach by introducing a notion of pseudo-norm and a pruning algorithm. Our experiments show that this new algorithm significantly improves the accuracy of the results when using either supervised or unsupervised feature selection data and that it outperforms other algorithms on various corpora. Keywords-co-clustering; similarity measure; text mining I. I NTRODUCTION Clustering task is used to organize or summarize data coming from databases. Classically, the data are described as a set of instances characterized by a set of features. In some cases, these features are homogeneous enough to allow us to cluster them, in the same way as we do for the instances. For example, when using the Vector Space Model introduced by [2], text corpora are represented by a matrix whose rows represent document vectors and whose columns represent the word vectors. The similarity between two documents obviously depends on the similarity between the words they contain and vice-versa. In the classical clustering methods, such dependencies are not exploited. The purpose of coclustering is to take into account this duality between rows and columns to identify the relevant clusters. In this regard, co-clustering has been largely studied in recent years both in Document clustering [3-6] and Bioinformatics [7-9]. In text analysis, the advantage of co-clustering is related to the well-known problem that document and words vectors tend to be highly sparse and suffer from the curse of dimensionality [10]. In such a scenario, traditional metrics such as Euclidean distance or Cosine similarity do not always make much sense [11]. Several methods have been proposed to overcome these limitations by exploiting the dual relationship between documents and words to extract semantic knowledge from the data. Consequently, the concept of higher-order co-occurrences has been investigated in [12], [13], among others, as a measure of semantic relationship between words; one of the best known approach to acquire such knowledge being the Latent Semantic Analysis [14]. Clément Grimal Laboratoire d’Informatique de Grenoble, UMR 5217 University of Grenoble, France Clement.Grimal@imag.fr The underlying analogy is that humans do not necessarily use the same vocabulary when writing about the same topic. For example, let us consider a corpus in which a subset of documents contains a significant number of co-occurrences between the words sea and waves and another subset in which the words ocean and waves co-occur. A human could infer that the worlds ocean and sea are conceptually related even if they do not directly co-occur in any document. Such a relationship between waves and ocean (or sea and waves) is termed as a first-order co-occurrence and the conceptual association between sea and ocean is called a second-order relationship. This concept can be generalized to higher-order (3rd , 4th , 5th , etc) co-occurrences. In this context, we recently introduced an algorithm, called χ-Sim [1], exploiting the duality between words and documents in a corpus as well as their respective higherorder co-occurrences. While most authors have focused to directly co-cluster the data, in χ-Sim, we just built two similarity matrices, one for the rows and one for the columns, each being built iteratively on the basis of the other. We call this process the co-similarity measure. Hence, when the two similarity matrices are built, each of them contains all the information needed to do a ‘separate’ co-clustering of the data (documents and words) by using any classical clustering algorithm (K-means, Hierarchical Clustering, etc). In this way, the final user can choose the algorithm that is the best suited to co-cluster his data. In this paper we further analyze the behavior of χ-Sim and we propose some ideas leading to dramatically improve the quality of the co-similarity measures. First, we introduce a new normalization schema for this measure that is more consistent with the framework of the algorithm and that offers new perspectives of research. Second, we propose an efficient way to deal with noise in the data and thus to improve the accuracy of the clustering. The rest of the paper is organized as follows: in section II, we explain a generalized version of χ-Sim. In section III, we discuss the shortcomings of this algorithm and propose some improvements, then a new algorithm is described. Experimental results on some classical datasets are presented in section IV and V using, respectively, some supervised and unsupervised feature selection methods. Finally, in section VI we present conclusions and future work. II. T HE χ-S IM S IMILARITY M EASURE In this paper, we will use the following classical notations: matrices (in capital letters) and vectors (in small letters) are in bold and all variables are in italic. Data matrix: let M be the data matrix representing a corpus having r rows (documents) and c columns (words); mij corresponds to the ‘intensity’ of the link between the ith row and the j th column (for a document-word matrix, it can be the number of occurrences of the j th word in the ith document); mi: = [mi1 · · · mic ] is the row vector representing the document i and m:j = [m1j · · · mrj ] is the column vector corresponding to word j. We will refer to a document as di when talking about documents casually and refer to it as mi: when specifying its (row) vector in the matrix M. Similarly, we will casually refer to a word as wj and use the notation m:j when emphasizing the vector. Similarity matrices: SR and SC represent the square and symmetrical row similarity and column similarity matrices of size r × r and c × c respectively, with ∀i, j = 1..r, srij ∈ [0, 1] and ∀i, j = 1..c, scij ∈ [0, 1]. Similarity function: function Fs (·, ·) is a generic function that takes two elements mil and mjn of M and returns a measure of the similarity Fs (mil , mjn ) between them. A. Similarity measures The χ-Sim algorithm is a co-similarity based approach which builds on the idea of simultaneously generating the similarity matrices SR (between documents) and SC (between words), each of them built on the basis of the other. Similar ideas have also been used for supervised leaning in [6] and for image retrieval in [15]. First, we present how to compute the co-similarity matrix SR. Usually, the similarity (or distance) measure between two documents mi: and mj: is defined as a function – denoted here as Sim(mi: , mj: ) – that is more or less the sum of the similarities between words occurring in both mi: and mj: : Sim(mi: , mj: ) = Fs (mi1 , mj1 ) + · · · + Fs (mic , mjc ) (1) Now let’s suppose we already know a matrix SC whose entries provide a measure of similarity between the columns (words) of the corpus. In parallel, let’s introduce, by analogy to the norm Lk (Minkowski distance), the notion of a pseudo-norm k. Then, (1) can be re-written as follows without changing its meaning if scll = 1 and if k = 1: v u c X u k k (Fs (mil , mjl )) × scll (2) Sim(mi: , mj: ) = t l=1 Now the idea is to generalize (2) in order to take into account all the possible pairs of features (words) occurring in documents mi: and mj: . In this way, we “capture” not only the similarity between their common words but also the similarity coming from words that are not directly shared by the two documents. Of course, for each pair of words not directly shared by the documents, we weight their contribution to the document similarity srij by their own similarity scln . Thus, the overall similarity between documents mi: and mj: is defined in (3) in which the terms for l = n were those occurring in (2): v u c c XX u k k k Sim (mi: , mj: ) = t (Fs (mil , mjn )) × scln (3) l=1 n=1 Assuming that our function Fs (mil , mjn ) is defined as a product (see [1] for further details) of the elements mil and mjn , i.e. Fs (mil , mjn ) = mil × mjn (as with the cosine similarity measure), we can rewrite Equation (3) as: q k k k (4) Simk (mi: , mj: ) = (mi: ) × SC × mT j: h i k k k where (mi: ) = (mij ) · · · (mic ) and mT j: denotes the transpose of the vector mj: . Finally, let’s introduce the term N (mi: , mj: ) that is a normalization function depending on mi: and mj: and allowing to map the similarity to [0, 1]. We obtain the following equation in which srij denotes an element of the SR matrix: q k k k (mi: ) × SC × mT j: (5) srij = N (mi: , mj: ) Equation (5) is a classic generalization of several wellknown similarity measures. For example, with k = 1, the Jacard index can be obtained by setting SC to I and N (mi: , mj: ) to kmi: k1 + kmj: k1 − mi: mT j: , while the Dice coefficient can be obtained by setting SC to 2I and N (mi: , mj: ) to kmi: k1 + kmj: k1 . Furthermore, if SC is set to a positive semi-definite matrix A, one can define the following inner product < mi: , mj: >A = mi: × A × mT j: , along with the associated norm km i: , mi: >A . pi: kA = < mp Then by setting N (mi: , mj: ) to kmi: kA × kmj: kA , we obtain the Generalized Cosine similarity [16], as it corresponds to the Cosine measure in the underlying inner product space. Of course, by binding A to I, this similarity becomes the standard Cosine measure. B. The χ-Sim Co-Similarity Measure The χ-Sim co-similarity measure, as defined in [1] can be also reformulated with (5). We set k to 1, and since the maximum value defined by the function scij is 1, it follows from (3) that, the upper bound of Sim(mi: , mj: ) for 1 6 i, j 6 r, is given by the product of the sum of elements (assumed as being positive numbers) of mi: and mj: denoted by |mi: | × |mj: | (product of L1 -norms). This normalization seems well-suited for textual datasets since it allows us to take into consideration the actual length of the document and word vectors when dealing with pairs of documents or words of uneven length, which is common in text corpora. Therefore, we can rewrite (5) as: m ∀i, j ∈ 1..r, srij = mT j: mi: × SC × |mi: | × |mj: | (6a) Similarly, the elements scij of SC correspond to: ∀i, j ∈ 1..c, scij = mT :i × SR × m:j |m:i | × |m:j | (6b) Thus, equations (6a) and (6b) define a systems of linear equations, whose solutions correspond to the (co)similarities between two documents and two words. Thus, the algorithm of χ-Sim is classically based on an iterative approach – i.e. we compute alternatively the values scij and srij . However, before detailing this algorithm for a more generic case in section III-C, we are going to explain the meaning, considering the associated bipartite graph, of these iterations. C. Graph Theoretical Interpretation The graphical interpretation of the method helps to understand the working of the algorithm and provides some intuition on how to improve it. Let’s consider the bipartite graph representation of a sample data matrix in Fig. 1. Documents and words are represented by square and circle d1 d2 d3 d4 d2 −−24 → w4 represents an order-1 path between words (1) w2 and w4 which is the same as sc24 . The contribution (1) of d2 in the similarity of sr14 can thus be re-written as (1) m12 × sc24 × m44 . This is a partial similarity measure since d2 is not the only document that provides a link between d1 (1) and d4 . The similarity via d3 is equal to m13 × sc35 × m55 . To find the overall similarity measure between documents d1 and d4 , we need to add these partial similarity values given (1) (1) by m12 × sc24 × m44 + m13 × sc35 × m55 . Hence, the (2) similarity matrix SR at the second iteration corresponds to paths of order-2 between documents. It can be shown similarly that, the matrices SR(t) and SC(t) represent order-t paths between documents and between words respectively. Consequently, at each iteration t, when we compute the value of equations (6a) and (6b) one or more new links may be found between previously disjoint objects (documents or words) corresponding to paths with length of order-t, and existing similarity measures may be strengthened. It has been shown that “in the long run”, the ending point of a random walk does not depend on its starting point [17] and hence it is possible to find a path (and hence similarity) between any pair of nodes in a connected graph [18] by iterating a sufficiently large number of times. However, cooccurrences beyond the 3rd and 4th order have little semantic relevance [1], [13]. Therefore, the number of iterations is usually limited to 4 or less. III. D ISCUSSION AND I MPROVEMENTS OF χ-S IM w1 Figure 1. w2 w3 w4 w5 w6 A bi-partite graph view of a sample document-word matrix. nodes respectively and an edge (of any kind) between a document di and a word wj corresponds to a non-zero entry mij in the document-word matrix. There is only one order-1 path between documents d1 and d2 given by m m d1 −−12 → w2 −−22 → d2 . If we consider that the SC matrix is initialized with the identity matrix I, at the first iteration, Sim(m1: , m2: ) corresponds to the inner product between m1: and m2: as given by (6a) and equals m12 × m22 . Omitting the normalization for the sake of clarity, the matrix SR(1) = M×MT thus represents all order-1 paths between all the possible pairs of documents di and dj . Similarly, each element of SC(1) = MT × M represents all order-1 paths between all the possible pairs of words wi and wj . Now, documents d1 and d4 do not have an order-1 path but are linked together through d2 (bold paths in Fig. 1) and d3 (dotted paths in Fig. 1). Such paths with one intermediate vertice are called order-2 paths, and will appear during the second iteration. The similarity value contributed via m the document d2 can be explicitly represented as d1 −−12 → m22 m24 m44 m w2 −−→ d2 −−→ w4 −−→ d4 . The sub-sequence w2 −−22 → In this section, first, we discuss a new normalization schema for χ-Sim in order to satisfy (partially) the maximality property of a similarity measure (Sim(a, b) = 1), then we propose a pruning method allowing to remove the ’noisy’ similarity values created during the iterations. A. Normalization In this paper, we investigate extensions of the Generalized Cosine measure, by relaxing the positive semi-definiteness of the matrix, and by adding a pseudo-norm parameter k. Henceforth, using the equation (4) we define the elements of the matrices SR and SC with the two new equations (7a) and (7b): ∀i, j ∈ 1..r, srij = q Simk (mi: , mj: ) q (7a) Simk (mi: , mi: ) × Simk (mj: , mj: ) ∀i, j ∈ 1..c, scij = q Simk (m:i , m:j ) q (7b) Simk (m:i , m:i ) × Simk (m:j , m:j ) However, this normalization is what we will call a pseudonormalization since if it guaranties that srii = 1, it does not satisfy that ∀i, j ∈ 1..r, srij ∈ [0, 1] (and the same for scij ). Consider for example a corpus having, among many other documents, the documents d1 containing the word orange (w1 ) and d2 containing the words red (w2 ) and banana (w3 ), along with SC – the similarity matrix of all the words of the corpus – indicating that the similarity between orange and red is 1, the similarity between orange and banana is 1 and the similarity between red and banana is 0. Thus, Sim1 (d1 , d1 ) = 1, Sim1 (d2 , d2 ) = 2 and 2 > 1. Sim1 (d1 , d2 ) = 2. Consequently, sr12 = √1×2 One can notice that this problem arises from the polysemic nature of the word orange. Indeed, the similarity between these two documents is overemphasized because of the double analogy between orange (the color) and red, and between orange (the fruit) and banana. It is possible to correct this problem by setting k = +∞ since the pseudonorm-k becomes max16k,l6c {mik × sckl × mil } and thus, Sim∞ (d1 , d1 ) = Sim∞ (d2 , d2 ) = Sim∞ (d1 , d2 ) = 1, implying sr12 = 1. Of course, k = +∞ is not necessarily a good setting for real tasks and experimentally we observed that the values of srij and scij remain generally smaller or equal to 1. In this framework it is nevertheless very interesting to investigate the different results one can obtain from varying k, including values lower than 1, as suggested in [19] for the norm Lk , to deal with high dimensional spaces. The resulting χ-Sim algorithm will be denoted by χ-Simk . However, the situation is different from the norm Lk in the sense that our method does not define a proper Normed Vector Space. To understand the problem it is worth looking closely to the simple case k = 1 where Sim1 (mi: , mj: ) = mi: × SC × mT i: is the general form of an inner product, with the condition that SC is symmetric positive semi-definite (PSD). Unfortunately, in our case due to the normalization steps, SC is not necessarily PSD, as √ the condition ∀i, j ∈ 1..c, |scij | 6 scii × scjj = 1 is not verified (cf. previous example). Thus, our similarity measure is just a bilinear form in a degenerated inner product space (as the conjugate and linearity axioms are trivially satisfied) in which it corresponds to the ’cosine’. A straightforward solution would be to project SC (and SR) after each iteration onto the set of PSD matrices [16]. By constraining the similarity matrices to be PSD, we would ensure that the new space remains a proper inner product space. However, we experimentally verified that such an additional step did not improve the results though, as when testing on real datasets, the similarity matrices are very close to the set of PSD matrices. In addition, the projection step is very time consuming, for these reasons, we won’t use it in the remaining of this paper. B. Dealing with ‘noise’ in SC and SR matrices As explained in section II-C, the elements of the SR matrix after the first iteration are the weighted order-1 paths in the graph: the diagonal elements srii correspond to the paths from each document to itself, while the non-diagonal terms srij count the number of order-1 paths between a document i and a neighbour j, which is based on the number of words they have in common. SR(1) is thus the adjacency matrix of the document graph. Iteration t amounts thus to count the number of order-t between nodes. However, in a corpus, we can observe there are a large number of words, having a small number of occurrences, that are not really relevant with the topic of the document, or to be more precise, that are not specific to any families of documents semantically related. These words are similar to a ’noise’ in the dataset. Thus, during the iterations, these noisy words allow the algorithm to create some new paths between the different families of documents; of course these paths have a very small similarity value but they are numerous and we make the assumption that they blur the similarity values between the classes of documents (same for the words). Based on this observation, we thus introduce in the χ-Sim algorithm a parameter, termed pruning threshold and denoted by p, allowing us to set to zero the lowest p % of the similarity values in the matrices SR and SC at each iteration. In the following, we will refer to this algorithm as χ-Simp when using the previous normalization factor described in (6a) and (6b), and χ-Simkp when using the new pseudonormalization factor described in (7a) and (7b). C. A Generic χ-Sim Algorithm for χ-Simkp Equations (7a) and (7b) allows us to compute the similarities between two rows and two columns. The extension over all pair of rows and all pair of columns can be generalized under the form of a simple matrix multiplication. We need to introduce a new notation here, M◦k = (mij )k i,j which is the element-wise exponentiation of M to the power of k. The algorithm follows: 1) We initialize the similarity matrices SR (documents) and SC (words) with the identity matrix I, since, at the first iteration, only the similarity between a row (resp. column) and itself equals 1 and zero for all other rows (resp. columns). We denote these matrices as SR(0) and SC(0) where the superscript denotes the iteration. 2) At each iteration t, we calculate the new similarity matrix between documents SR(t) by using the similarity matrix between words SC(t−1) : T SR(t) = M◦k × SC(t−1) × (M◦k ) q k (t) srij (t) and ∀i, j ∈ 1..r, srij ← q 2k (t) (t) srii × srjj (8a) (8b) We do the same thing for the columns similarity matrix SC(t) : T SC(t) = (M◦k ) × SR(t−1) × M◦k (9a) q k (t) scij (t) and ∀i, j ∈ 1..c, scij ← q 2k (t) (t) scii × scjj B. Experimental Measures (9b) 3) We set to 0 the p % of the lowest similarity values in the similarity matrices SR and SC. 4) Steps 2 and 3 are repeated t times (typically as we saw in section 2 the value t = 4 is enough) to iteratively update SR(t) and SC(t) . It is worth noting here that even though χ-Simkp computes the similarity between each pair of documents using all pairs of words, the complexity of the algorithm remains comparable to classical similarity measures like cosine or Euclidean distances. Given that – for a generalized matrix of size n by n – the complexity of matrix multiplication is in O(n3 ) and the complexity to compute M◦n is in O(n2 ), the overall complexity of χ-Sim is given by O(tn3 ). IV. E XPERIMENTS Here, to evaluate our system, we cluster the documents coming from the well-known 20-Newsgroup dataset (NG20) by using the document similarity matrices SR generated by χ-Sim. We choose this dataset since it has been widely used as a benchmark for document classification and coclustering [3], [4], [20], [21], thus allowing us to compare our results with those reported in the literature. A. Preprocessing and Methodology We replicate the experimental procedures used by previous authors in [3], [4], [21]: 10 different samples of each of the 6 subsets described in Table I are generated, we ignored the subject lines, we removed stop words and we selected the top 2,000 words based on supervised mutual information [22]. We will discuss further this last preprocessing step in section V. With these six benchmarks, we compared our co-similarity measures based on χ-Sim with three similarity measures: Cosine, LSA [14] and SNOS [6]; as well as three co-clustering methods: ITCC [3], BVD [4] and RSN [21]. Creation of the clusters. For the ’similarity based’ algorithms: χ-Sim, Cosine, LSA and SNOS, the clusters are generated by an Agglomerative Hierarchical Clustering (AHC) method on the similarity matrices with Ward’s linkage. Then we cut the clustering tree at the level corresponding to the number of document clusters we are waiting for (two for subset M2, five for for subset M5, etc). Implementations. χ-Sim algorithms, as well as SNOS and AHC have been implemented in JAVA, and Cosine and LSA have been implemented in MatLab. For ITCC, we used the implementation provided by the authors and the parameters reported in [3]. For BVD and RSN, as we don’t have a running implementation, we directly quote the best values from [4] and [21] respectively. We used the classical micro-averaged precision (Pr) [3] for comparing the accuracy of the document classification; the Normalized Mutual Information (NMI) [23] is also used to compare χ-Sim with RSN. For SNOS, we perform four iterations and set the λ parameter to the value proposed by the authors [6]. For LSA, we tested the algorithm iteratively keeping the h highest singular values from h = 10..200 by steps of 10. We use the value of h providing, on average, the highest micro-averaged precision. For ITCC, we ran three times the algorithm using the different numbers of word clusters, as suggested in [3], for each dataset. For χ-Simp , we performed the pruning step as described in section III-C varying the value of p = 0 to 0.9 by steps of 0.1. For each subset, we report the best micro-averaged precision obtained with p. The experimental results are summarized in Table II. In all the versions, χ-Sim performs better than all the other tested algorithms. Moreover, the new normalization schema proposed in section III clearly improved the results of our algorithm over the previous normalization based on the length of the documents. The SNOS algorithm performs poorly in spite of the fact that it is very closed to χ-Sim, probably because it uses a different normalization. It is interesting to notice that the gain obtained with the pruning when using the previous version of χ-Sim on the M10 and NG3 (the two hardest problems) is reduced to almost negligible levels with the new algorithm. Finally, the impact of the parameter k is small for all the subsets but M10 and NG3. In these more complex datasets, we observe that setting k to a value lower than 1 slightly improves the clustering. This result seems to show that the results provided by [19], suggesting to use a value of k lower than 1 with the norm Lk when dealing with high dimensional space, are also relevant in our framework. V. D ISCUSSION ABOUT THE P REPROCESSING The feature selection step aims at improving the results by removing words that are not useful to separate the different clusters of documents. Moreover, this step is also clearly needed due to the spatial and time complexity of the algorithms in O(n3 ). Nevertheless, we are performing an unsupervised learning task, thus using a supervised feature selection method, i.e. selecting the top 2,000 words based on how much information they bring to one class of documents or another, may introduce some annoying bias since it leads to ease the problem by building well-separated clusters. In real applications, it is impossible to use this kind of preprocessing for unsupervised learning. Thus to explore the potential effects of this bias, we hereby propose to generate similar subsets of the NG20 dataset but this time using an unsupervised feature selection method. Table I S UBSETS OF THE NG20 DATASET USED TO EVALUATE OUR APPROACH . F OR EACH SUBSET, WE PROVIDE THE NUMBER OF CLUSTERS ( TOPIC ) IT DESCRIBES , AND THE TOTAL NUMBER OF DOCUMENTS THIS SUBSET CONTAINS . Name Newsgroups included M2 M5 M10 talk.politics.mideast, talk.politics.misc comp.graphics, rec.motorcycles, rec.sport.baseball, sci.space, talk.politics.mideast alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, talk.politics.gun rec.sports.baseball, rec.sports.hockey comp.os.ms-windows.misc, comp.windows.x, rec.motorcycles, sci.crypt, sci.space comp.os.ms-windows.misc, comp.windows.x, misc.forsale, rec.motorcycles, sci.crypt, sci.space, talk.politics.mideast, talk.religion.misc NG1 NG2 NG3 #clusters. #docs. 2 5 10 500 500 500 2 5 8 400 1000 1600 Table II M ICRO - AVERAGED PRECISION ( AND NMI FOR χ-S IM BASED ALGORITHMS AND RSN) ALONG WITH STANDARD DEVIATION FOR THE VARIOUS SUBSETS OF THE NEWSGROUP DATASET (NG20). M2 M5 M10 NG1 NG2 NG3 Cosine Pr 0.60 ± 0.00 0.63 ± 0.07 0.49 ± 0.06 0.90 ± 0.11 0.60 ± 0.10 0.59 ± 0.04 LSA Pr 0.92 ± 0.02 0.87 ± 0.06 0.59 ± 0.07 0.96 ± 0.01 0.82 ± 0.03 0.74 ± 0.03 ITCC Pr 0.79 ± 0.06 0.49 ± 0.10 0.29 ± 0.02 0.69 ± 0.09 0.63 ± 0.06 0.59 ± 0.05 BVD Pr best: 0.95 best: 0.93 best: 0.67 - - - RSN NMI - - - 0.64 ± 0.16 0.75 ± 0.07 0.70 ± 0.04 SNOS Pr 0.55 ± 0, 02 0.25 ± 0, 02 0.24 ± 0, 06 0.51 ± 0, 01 0.24 ± 0, 02 0.22 ± 0, 05 χ-Sim Pr NMI 0.91 ± 0.09 0.96 ± 0.00 0.69 ± 0.05 0.96 ± 0.01 0.76 ± 0.06 0.92 ± 0.01 0.79 ± 0.02 0.79 ± 0.06 0.72 ± 0.03 χ-Simp Pr NMI 0.94 ± 0.01 0.96 ± 0.00 0.73 ± 0.03 0.97 ± 0.01 0.78 ± 0.05 0.92 ± 0.01 0.79 ± 0.02 0.84 ± 0.05 0.73 ± 0.02 χ-Sim1 Pr NMI 0.95 ± 0.00 0.96 ± 0.02 0.78 ± 0.03 0.97 ± 0.02 0.85 ± 0.07 0.94 ± 0.01 0.83 ± 0.03 0.86 ± 0.05 0.79 ± 0.03 χ-Sim1p Pr NMI 0.95 ± 0.00 0.97 ± 0.01 0.78 ± 0.03 0.98 ± 0.01 0.86 ± 0.04 0.94 ± 0.01 0.83 ± 0.03 0.87 ± 0.05 0.80 ± 0.02 χ-Sim0.8 Pr NMI 0.95 ± 0.00 0.97 ± 0.01 0.79 ± 0.02 0.98 ± 0.01 0.87 ± 0.05 0.94 ± 0.01 0.84 ± 0.02 0.90 ± 0.01 0.81 ± 0.02 0.8 χ-Simp Pr NMI 0.95 ± 0.00 0.97 ± 0.01 0.80 ± 0.04 0.98 ± 0.00 0.88 ± 0.03 0.94 ± 0.01 0.85 ± 0.02 0.90 ± 0.02 0.81 ± 0.03 A. Unsupervised Feature Selection To reduce the number of words in the learning set, we used an approach consisting in selecting a representative subset (sampling) of the words with the help of the Partitioning Around Medoids (PAM) [24] algorithm. This algorithm is quite simple, more robust to noise and outliers than the k-means algorithm, and less sensitive to initial values as well. The procedure is the following: first, we remove from the corpus the words appearing in just one document, as they do not provide information to built the clusters; then, we run PAM to get 2,000 classes corresponding to a selection of 2,000 words. We used the implementation of PAM provided in the R project [25] with the Euclidean distance. B. Results with PAM Here, we use the same methodology as described in section IV-A except for the feature selection step which is now done with PAM instead of the supervised mutual information. The new experimental results are summarized in Table III. We can observe that the results are very different. The version of χ-Sim using the previous normalization method obtains more or less the same results as LSA: a little bit worse without pruning and slightly better with pruning. With the new normalization the results are more contrasted and now, differently from the first experiments, the impact of the pruning factor p becomes very strong: without pruning the new method performs poorly on several problems (M2, M10, NG2, NG3), the results being lower than the Cosine similarity, but by pruning the smallest values of the similarity matrices the situation is completely opposite and χ-Simp based algorithms obtain the best results on all the datasets. As in section IV, we observe again that setting a value of k lower than 1 improves the clustering in all the dataset but one. Now it is interesting to see with more details the impact of the different values of k and p on a given dataset. Table III M ICRO - AVERAGED PRECISION ALONG WITH STANDARD DEVIATION FOR THE VARIOUS SUBSETS OF THE NEWSGROUP DATASET (NG20), PRE - PROCESSED USING THE PAM FEATURE SELECTION . M2 M5 M10 NG1 NG2 NG3 Cosine 0.61 ± 0.04 0.54 ± 0.08 0.39 ± 0.03 0.52 ± 0.01 0.60 ± 0.05 0.49 ± 0.02 LSA 0.79 ± 0.09 0.66 ± 0.05 0.44 ± 0.04 0.56 ± 0.05 0.61 ± 0.06 0.52 ± 0.03 ITCC 0.70 ± 0.05 0.54 ± 0.05 0.29 ± 0.05 0.61 ± 0.06 0.44 ± 0.08 0.49 ± 0.07 SNOS 0.51 ± 0.01 0.26 ± 0.04 0.20 ± 0.02 0.51 ± 0.00 0.24 ± 0.01 0.22 ± 0.02 χ-Sim 0.58 ± 0.07 0.62 ± 0.12 0.43 ± 0.04 0.54 ± 0.03 0.60 ± 0.12 0.47 ± 0.05 χ-Simp 0.65 ± 0.09 0.68 ± 0.06 0.47 ± 0.04 0.62 ± 0.12 0.63 ± 0.14 0.57 ± 0.04 χ-Sim1 0.54 ± 0.06 0.62 ± 0.13 0.36 ± 0.04 0.53 ± 0.02 0.35 ± 0.09 0.30 ± 0.05 χ-Sim1p 0.80 ± 0.13 0.77 ± 0.08 0.53 ± 0.05 0.75 ± 0.07 0.73 ± 0.06 0.61 ± 0.03 χ-Sim0.8 0.54 ± 0.05 0.66 ± 0.07 0.37 ± 0.06 0.52 ± 0.02 0.38 ± 0.08 0.36 ± 0.04 0.8 χ-Simp 0.81 ± 0.10 0.79 ± 0.05 0.55 ± 0.04 0.81 ± 0.02 0.72 ± 0.02 0.64 ± 0.04 Figure 2 shows the evolution of the accuracy on NG1 subset according to the value of p. When the words are selected by supervised mutual information the curve is quite flat, but when the words are selected with PAM, we see a different behavior: the accuracy first increases with the pruning level, the best value being about 60% (it is worth noticing that this value is very stable among the datasets). This re-enforces our assumption that pruning the similarity matrices can be a good way of dealing with ‘noise’. Indeed, when the features are selected with Mutual Information, the classes are relatively well separated thus, similarity propagation as a result of higher order co-occurrences between documents (or words) of different categories as few influence. However, with the unsupervised feature selection, there are more ’noise’ in the data and the pruning process helps significantly to alleviate this problem. Figure 3 shows the evolution of the accuracy again on the NG1 subset according to the value of k. As we can see, on this dataset where the document and words vectors tend to be highly sparse, the best values for this parameter seems to be found between 0.5 and 1 as for the case of the norm Lk [19], we choose the value 0.8 in the results tables. However, this effect can only be seen when the pruning parameter is activated (plain line on the figure). Finally, it is worth noting that in this experiments with PAM, the difference between LSA and Cosine strongly decreases. All these results demonstrate that preprocessing the data with a supervised feature selection approach totally change (unsurprisingly) the behavior of the clustering methods by simplifying too much the problem. VI. C ONCLUSION In this paper, we proposed two empirical improvements of the χ-Sim co-similarity measure. The new normalization we presented for this measure is more consistent with the framework of the algorithm and also (partially) satisfies the reflexivity property. Furthermore, we showed that the χ-Sim Figure 2. Evolution of the precision for NG1 (using χ-Sim0.6 p ) against p along with error bars representing the standard deviation over the 10 folds. The dotted line represents the supervised feature selection data, and the plain one the unsupervised feature selection data. Figure 3. Evolution of the precision for NG1 (using χ-Simk0 for the dotted line, and χ-Simk0.6 for the plain one) against k along with error bars representing the standard deviation over the 10 folds. similarity measure is susceptible to noise and proposed a way to alleviate this susceptibility and to improve the precision. On the experimental part, our co-similarity based approach performs significantly better than the other coclustering algorithms we tested for the task of document clustering. In contrast to [3], [4], our algorithms does not need to cluster the words (columns) for clustering the documents (rows), thus avoiding the need to know the number of word clusters and the learning parameters p and k introduced here seems relatively easy to tune. However, we will investigate how to automatically find the best values for these parameters, using similarity matrix analysis from [19]. It is also worth noting that our co-similarity measure performs better than LSA. Unfortunately, as we saw in section III-A, the current method is not well-defined from the theoretical point of view and we need to analyze its behavior in order to understand the role of the pseudo-normalization and to see if it is possible to turn it into a real normalization. ACKNOWLEDGMENT This work is part of a PhD thesis funded by the Higher Education Commission, Government of Pakistan. This work is partially supported by the French ANR project FRAGRANCES under grant 2008-CORD 00801. R EFERENCES [1] G. Bisson and F. Hussain, “Chi-sim: A new similarity measure for the co-clustering task,” in Proceedings of the Seventh ICMLA. IEEE Computer Society, Dec. 2008, pp. 211–217. [2] G. Salton, The SMART Retrieval System—Experiments in Automatic Document Processing. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1971. [3] I. S. Dhillon, S. Mallela, and D. S. Modha, “Informationtheoretic co-clustering,” in Proceedings of the Ninth ACM SIGKDD, 2003, pp. 89–98. [4] B. Long, Z. M. Zhang, and P. S. Yu, “Co-clustering by block value decomposition,” in Proceedings of the Eleventh ACM SIGKDD. New York, NY, USA: ACM, 2005, pp. 635–640. [5] M. Rege, M. Dong, and F. Fotouhi, “Bipartite isoperimetric graph partitioning for data co-clustering,” Data Min. Knowl. Discov., vol. 16, no. 3, pp. 276–312, 2008. [6] N. Liu, B. Zhang, J. Yan, Q. Yang, S. Yan, Z. Chen, F. Bai, and W. ying Ma, “Learning similarity measures in nonorthogonal space,” in Proceedings of the 13th ACM CIKM. ACM Press, 2004, pp. 334–341. [10] N. Slonim and N. Tishby, “The power of word clusters for text classification,” in In 23rd European Colloquium on Information Retrieval Research, 2001. [11] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ”nearest neighbor” meaningful?” in In Int. Conf. on Database Theory, 1999, pp. 217–235. [12] K. Livezay and C. Burgess, “Mediated priming in highdimensional meaning space: What is ”mediated” in mediated priming?” in Proceedings of the Cognitive Science Society, 1998. [13] B. Lemaire and G. Denhière, “Effects of high-order cooccurrences on word semantic similarities,” Current Psychology Letters - Behaviour, Brain and Cognition, vol. 18(1), 2008. [14] S. Deerwester, S. T. Dumais, G. W. Furnas, Thomas, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol. 41, pp. 391–407, 1990. [15] X.-J. Wang, W.-Y. Ma, G.-R. Xue, and X. Li, “Multi-model similarity propagation and its application for web image retrieval,” in Proceedings of the 12th annual ACM MULTIMEDIA. New York, NY, USA: ACM, 2004, pp. 944–951. [16] A. M. Qamar and E. Gaussier, “Online and batch learning of generalized cosine similarities,” in Proceedings of the Ninth IEEE ICDM. Washington, DC, USA: IEEE Computer Society, 2009, pp. 926–931. [17] E. Seneta, Non-Negative Matrices and Markov Chains. Springer, 2006. [18] S. Zelikovitz and H. Hirsh, “Using lsi for text classification in the presence of background text,” in Proceedings of the 10th ACM CIKM. ACM Press, 2001, pp. 113–118. [19] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” in Lecture Notes in Computer Science. Springer, 2001, pp. 420–434. [20] D. Zhang, Z.-H. Zhou, and S. Chen, “Semi-supervised dimensionality reduction,” in Proceedings of the SIAM ICDM, 2007. [21] B. Long, X. Wu, Z. M. Zhang, and P. S. Yu, “Unsupervised learning on k-partite graphs,” in Proceedings of the 12th ACM SIGKDD. New York, NY, USA: ACM, 2006, pp. 317–326. [22] Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in ICML, 1997, pp. 412–420. [7] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biological data analysis: A survey,” 2004. [23] A. Banerjee and J. Ghosh, “Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres,” 2002. [8] N. Speer, C. Spieth, and A. Zell, “A memetic clustering algorithm for the functional partition of genes based on the gene ontology,” 2004. [24] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley John Wiley & Sons, 1990. [9] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Proceedings of the International Conference on Intelligent System for Molecular Biology, Boston, 2000, pp. 93–103. [25] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2010, ISBN 3-900051-07-0. [Online]. Available: http://www.R-project.org

RELATED PAPERS

RELATED TOPICS

Log In

An Improved Co-Similarity Measure for Document Clustering

An Improved Co-Similarity Measure for Document Clustering

Related Papers

RELATED PAPERS

RELATED TOPICS