Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
  • Canada
In this paper, we present a multiple data clusterings combiner, based on a proposed Weighted Shared nearest neighbors Graph. (WSnnG). While combining of multiple classifiers (supervised learners) is now an active and mature area, only a... more
In this paper, we present a multiple data clusterings combiner, based on a proposed Weighted Shared nearest neighbors Graph. (WSnnG). While combining of multiple classifiers (supervised learners) is now an active and mature area, only a limited number of contemporary research in combining multiple data clusterings (un-supervised learners) appear in the literature. The problem addressed in this paper is that of generating a reliable clustering to represent the natural cluster structure in a set of patterns, when a number of different clusterings of the data is available or can be generated. The underlying model of the proposed shared nearest neighbors based combiner is a weighted graph, whose vertices correspond to the set of patterns, and are assigned relative weights based on a ratio of a balancing factor to the size of their shared nearest neighbors population. The edges in the graph exist only between patterns that share a pre-specified portion of their nearest neighborhood. The graph can be further partitioned into a desired number of clusters. Preliminary experiments show promising results, and comparison with a recent study justifies the combiner’s suitability to the pre-defined problem domain.
Cluster analysis is an un-supervised learning technique that is widely used in the process of topic discovery from text. The research presented here proposes a novel un-supervised learning approach based on aggregation of clusterings... more
Cluster analysis is an un-supervised learning technique that is widely used in the process of topic discovery from text. The research presented here proposes a novel un-supervised learning approach based on aggregation of clusterings produced by different clustering techniques. By examining and combining two different clusterings of a document collection, the aggregation aims at revealing a better structure of the data rather than imposing one that is imposed or constrained by the clustering method itself. When clusters of documents are formed, a process called topic extraction picks terms from the feature space (i.e. the vocabulary of the whole collection) to describe the topic of each cluster. It is proposed at this stage to re-compute terms weights according to the revealed cluster structure. The work further investigates the adaptive setup of the parameters required for the clustering and aggregation techniques. Finally, a topic accuracy measure is developed and used along with the F-measure to evaluate and compare the extracted topics and the clustering quality (respectively) before and after the aggregation. Experimental evaluation shows that the aggregation can successfully improve the clustering quality and the topic accuracy over individual clustering techniques.
We recently introduced the idea of solving cluster ensembles using a Weighted Shared nearest neighbors Graph (WSnnG). Preliminary experiments have shown promising results in terms of integrating different clusterings into a combined one,... more
We recently introduced the idea of solving cluster ensembles using a Weighted Shared nearest neighbors Graph (WSnnG). Preliminary experiments have shown promising results in terms of integrating different clusterings into a combined one, such that the natural cluster structure of the data can be revealed. In this paper, we further study and extend the basic WSnnG. First, we introduce the use of fixed number of nearest neighbors in order to reduce the size of the graph. Second, we use refined weights on the edges and vertices of the graph. Experiments show that it is possible to capture the similarity relationships between the data patterns on a compact refined graph. Furthermore, the quality of the combined clustering based on the proposed WSnnG surpasses the average quality of the ensemble and that of an alternative clustering combining method based on partitioning of the patterns’ co-association matrix.
Over the past few years, there has been a renewed interest in the consensus clustering problem. Several new methods have been proposed for finding a consensus partition for a set of n data objects that optimally summarizes an ensemble. In... more
Over the past few years, there has been a renewed interest in the consensus clustering problem. Several new methods have been proposed for finding a consensus partition for a set of n data objects that optimally summarizes an ensemble. In this paper, we propose new consensus clustering algorithms with linear computational complexity in n. We consider clusterings generated with a random number of clusters, which we describe by categorical random variables. We introduce the idea of cumulative voting as a solution for the problem of cluster label alignment, where unlike the common one-to-one voting scheme, a probabilistic mapping is computed. We seek a first summary of the ensemble that minimizes the average squared distance between the mapped partitions and the optimal representation of the ensemble, where the selection criterion of the reference clustering is defined based on maximizing the information content as measured by the entropy. We describe cumulative vote weighting schemes and corresponding algorithms to compute an empirical probability distribution summarizing the ensemble. Given the arbitrary number of clusters of the input partitions, we formulate the problem of extracting the optimal consensus as that of finding a compressed summary of the estimated distribution that preserves the maximum relevant information. An efficient solution is obtained using an agglomerative algorithm that minimizes the average generalized Jensen-Shannon divergence within the cluster. The empirical study demonstrates significant gains in accuracy and superior performance compared to several recent consensus clustering algorithms.