Data clustering faces lots of
studies and researches and at last the results
being competitive to conventional algorithms,
even though using these techniques finally we
are getting an incomplete information. The
existed partitioned-information matrix contains
particular cluster-data point relations only, with
lot entries which are not recognized. The paper
explores researches that preferres this crisis
decomposes the efficiency of the clustering
result, and it contains a new link-based
approach, which increases the conventional
matrix by revealing the entries which are not
recognized based upon the common things
which are present both clusters and in ensemble.
Often, a perfect link-based algorithm is invented
and used for the underlying common
assessment. After all those, to gain the
maximum clustering outputs, a graph
partitioning technique is used for a weighted
bipartite graph that is formulated from the
refined matrix. Results on various real data sets
suggest that the proposed link-based method
mostly performs both conventional clustering
algorithms for categorical data and also most
common cluster ensemble techniques.
Data clustering faces lots of
studies and researches and at last the results
being competitive to conventional algorithms,
even though using these techniques finally we
are getting an incomplete information. The
existed partitioned-information matrix contains
particular cluster-data point relations only, with
lot entries which are not recognized. The paper
explores researches that preferres this crisis
decomposes the efficiency of the clustering
result, and it contains a new link-based
approach, which increases the conventional
matrix by revealing the entries which are not
recognized based upon the common things
which are present both clusters and in ensemble.
Often, a perfect link-based algorithm is invented
and used for the underlying common
assessment. After all those, to gain the
maximum clustering outputs, a graph
partitioning technique is used for a weighted
bipartite graph that is formulated from the
refined matrix. Results on various real data sets
suggest that the proposed link-based method
mostly performs both conventional clustering
algorithms for categorical data and also most
common cluster ensemble techniques.
Efficient Data Clustering with Link Approach 1 Y. Sireesha, 2 CH. Srinivas, 3 K.C. Ravi Kumar.
1 PG Scholar, Department of Computer Science and Engineering, Sridevi Womens Engineering College Hyderabad, A.P, India
2 Associate Professor, Department of Computer Science and Engineering, Sridevi Womens Engineering College Hyderabad, A.P, India
3 Head of the Department, Department of Computer Science and Engineering, Sridevi Womens Engineering College Hyderabad, A.P, India
Abstract: Data clustering faces lots of studies and researches and at last the results being competitive to conventional algorithms, even though using these techniques finally we are getting an incomplete information. The existed partitioned-information matrix contains particular cluster-data point relations only, with lot entries which are not recognized. The paper explores researches that preferres this crisis decomposes the efficiency of the clustering result, and it contains a new link-based approach, which increases the conventional matrix by revealing the entries which are not recognized based upon the common things which are present both clusters and in ensemble. Often, a perfect link-based algorithm is invented and used for the underlying common assessment. After all those, to gain the maximum clustering outputs, a graph partitioning technique is used for a weighted bipartite graph that is formulated from the
refined matrix. Results on various real data sets suggest that the proposed link-based method mostly performs both conventional clustering algorithms for categorical data and also most common cluster ensemble techniques. I. INTRODUCTION To examine the data set we have different approaches through get down to its structure data clustering is the efficient way. Because of the beneficial characteristics of clustering like mining, machine learning and pattern recognition. Clustering deals with the data to get stick with similar ones. Those similar ones will get into group or cluster. There are sort of algorithms for clustering like k-means and PAM which are used for clustering the numerical data, these are used to International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page3649
get the distance between feature vectors. The drawback is this not get inherited directly for clustering purpose on the categorical data, where domain values are discrete and have no ordering defined. As a result, many categorical data clustering algorithms have been introduced in recent years, with applications to interesting domains such as protein interaction data. The initial method was developed by making use of Gowers similarity coefficient. Following that, the k-modes algorithm in extended the conventional k-means with a simple matching dissimilarity measure and a frequency-based method to update centroids.
Fig 1. The basic process of cluster ensembles. As a single-pass algorithm, Squeezer makes use of a prespecified similarity threshold to determine which of the existing clusters to which a data point under examination is assigned. LIMBO is a hierarchical clustering algorithm that uses the Information Bottleneck (IB) framework to define a distance measure for categorical tuples. The concepts of evolutionary computing and genetic algorithm have also been adopted by a partitioning method for categorical data, i.e., GAClust. Cobweb is a model-based method primarily exploited for categorical data sets. Different graph models have also been investigated by the STIRR, ROCK, and CLICK techniques. In addition, several density-based algorithms have also been devised for such purpose, for instance, CACTUS, COOLCAT, and CLOPE. Although, a large number of algorithms have been introduced for clustering categorical data, the No Free Lunch theorem suggests1 there is no single clustering algorithm that performs best for all data sets and can discover all types of cluster shapes and structures presented in data. Each algorithm has its own strengths and weaknesses. For a particular data set, different algorithms, or even the same algorithm with different parameters, usually provide distinct solutions. Therefore, it is difficult for users to decide which algorithm would be the proper alternative for a given set of data. Recently, cluster ensembles have emerged as an effective solution that is able to overcome these limitations, and improve the robustness as well as the quality of clustering results. The main objective of cluster ensembles is to combine different clustering decisions in such a way as to achieve accuracy superior to that of any individual clustering. Examples of well-known ensemble methods are: 1. The feature-based approach that transforms the problem of cluster ensembles to clustering categorical data. International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page3650
2. The direct approach that finds the final partition through relabeling the base clustering results. 3. graph-based algorithms that employ a graph partitioning methodology, and 4. The pairwise-similarity approach that makes use of co-occurrence relations between data points. Despite notable success, these methods generate the final data partition based on incomplete information of a cluster ensemble. The underlying ensemble-information matrix presents only cluster-data point relationships while completely ignores those among clusters. As a result, the performance of existing cluster ensemble techniques may consequently be degraded as many matrix entries are left unknown. This paper introduces a link-based approach to refining the aforementioned matrix, giving substantially less unknown entries. A link-based similarity measure is exploited to estimate unknown values from a link network of clusters. This research uniquely bridges the gap between the task of data clustering and that of link analysis. It also enhances the capability of ensemble methodology for categorical data, which has not received much attention in the literature. In addition to the problem of clustering categorical data that is investigated herein, the proposed framework is generic such that it can also be effectively applied to other data types. II. DISCUSSION The difficulty of categorical data analysis is characterized by the fact that there is no inherent distance (or similarity) between attribute values. The RM matrix that is generated within the LCE approach allows such measure between values of the same attribute to be systematically quantified. The concept of link analysis [34], [35], [36] is uniquely applied to discover the similarity among attribute values, which are modeled as vertices in an undirected graph. In particular, two vertices are similar if the neighboring contexts in which they appear are similar. In other words, their similarity is justified upon values of other attributes with which they co-occur. While the LCE methodology is novel for the problem of cluster ensemble, the concept of defining similarity among attribute values (especially with the case of direct ensemble, Type-I) has been analogously adopted by several categorical data clustering algorithms. Initially, the problem of defining a context-based similarity measure has been investigated in [61] and [62]. In particular, an iterative algorithm, called Iterated Contextual Distances (ICD), is introduced to compute the proximity between two values. Similar to LCE, the underlying distance metric is based on the occurrence statistics of attribute values. The WTQ algorithm is summarized below International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page3651
However, the fundamental information model that is used by ICD and LCE to capture the associations between data points and attribute values are notably different: a sequential probabilistic chain and a link network for ICD and LCE, respectively. Note that LCE makes use of WTQ that is a single-pass similarity algorithm, while ICD requires the chain model to be randomly initialized and iteratively updated to a fixed point. III. SYSTEM DEVELOPMENT The link based approach will be done in following steps K-Means with Euclidian Distance (1) Transform categorical (text) into numerical value On Numerical data apply K-Means with Euclidian distance measure. The outcome of the 1 is {C11, C12, C13 .. C1n} where C1 is the cluster 1 with 1 clustering algorithm. K-Means with cosine similarity (2) Consider numerical values are categorical data On categorical data apply (2) for clustering. The outcome of the 2 is {C21, C22, C23.. C2n} K-Means with Jaccords Coefficient (3) Consider numerical values as categorical data On categorical data, apply (3) for clustering The outcome of the 3 is {C31, C32, C33, C34 C3n} Cluster Ensemble Direct Ensemble Full Space Ensemble Subspace Ensemble Generating Refined Matrix (RM) Prepare RM as a matrix where each record of the original dataset D as {x1, x2, x3 xm}
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page3652
Weighted Triple Quality (WTQ) Wxy = Lx Ly Lx U Ly Apply WTQ on RM, The output of the module is refined clusters. IV. RELATED WORK Despite pursuing an objective analogous to that of the LCE approach, several categorical data clustering methods have been developed using different mechanisms to specify a distance between attribute values: STIRR, ROCK, and CACTUS, for instance. STIRR is an iterative algorithm based on nonlinear dynamical systems. A database is encoded into a graph structure, where each weighted node stands for a specific attribute value. STIRR iteratively updates the weight configuration until a stable point (called basin) is reached. This is achieved using a user-defined combiner function to estimate a node weight from those of others that associate to the same data records. Unlike LCE, the similarity between any node pair cannot be explicitly measured here. In fact, STIRR only divides nodes of each attribute into two groups (one with large positive weights and the other with small negative weights) that correspond to projections of clusters on the attribute. Yet, the post processing required to generate the actual clusters is nontrivial and not addressed in the original work. While LCE is generally robust to parameter settings, it is hard to analyze the stability of the STIRR system for any useful combiner function [63]. Rigorous experimentation and fine tuning of parameters are needed for the generation of a meaningful clustering [64]. ROCK [14] makes use of a link graph, in which nodes and links represent data points (or tuples) and their similarity, respectively. Two tuples are similar if they shared a large number of attribute values. Note that the link connecting two nodes is included only when the corresponding similarity exceeds a user-defined threshold. With tuples being initially regarded as singleton clusters, ROCK merges clusters in an agglomerative hierarchical fashion, while optimizing a cluster quality that is defined in terms of the number of links across clusters. Note that the graph models International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page3653
used by ROCK and LCE are dissimilarthe graph of data points and that of attribute values (or clusters), respectively. Since the number of data points is normally greater than that of attribute values, ROCK is less efficient than LCE. As a result, it is unsuitable for large data sets. Also, the selection of a smooth function that is used to estimate a cluster quality is a delicate and difficult task for average users. CACTUS also relies on the co-occurrence among attribute values. In essence, two attribute values are strongly connected if their support (i.e., the proportion of tuples in which the values co-occur) exceeds a prespecified value. By extending this concept to all attributes, CACTUS searches for the distinguishing sets, which are attribute value sets that uniquely occur within only one cluster. These sets correspond to cluster projections that can be combined to formulate the final clusters. Unlike LCE, the underlying problem is not designed using a graph based concept. It is also noteworthy that CACTUS and its recent extension assume each cluster to be identified by a set of attribute values that occur in no other cluster. While such conjecture may hold true for some data sets, it is unnatural and unnecessary for the clustering process. This rigid constraint is not implemented by the LCE method. Besides these approaches, traditional categorical data analysis also utilizes the market-basket numerical representation of the nominal data matrix. This transformed matrix is similar to the BM, which has been refined to the RM counterpart by LCE. A similar attempt in identifies the connection between category utility of the conceptual clustering (Cobweb) and the classical objective function of k-means.
Fig 2. An example of a cluster network, where each edge is marked with its weight.
As a result, the so-called market-basket matrix used by the former is transformed to a variation that can be efficiently utilized by the latter. The intuitions of creating this rescaled matrix and the RM are fairly similar. However, the methods used to generate them are totally different. LCE discovers unknown entries (i.e., 0) in the original BM from known entries (1), which are preserved and left unchanged. On the other hand, the method in maps the attribute-value-specific 1 and 0 entries to the unique standardized values. Unlike the RM, this matrix does not International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page3654
conserve the known fact (1 entries), whose values are now different from one to another attribute. Despite the fact that many clustering algorithms and LCE are developed with the capability of comparing attribute values in mind, they achieve the desired metric differently, using specific information models. LCE uniquely and explicitly models the underlying problem as the evaluation of link- based similarity among graph vertices, which stand for specific attribute values (for Type-I ensemble) or generated clusters (for Type-II and Type-III). The resulting system is more efficient and robust, as compared to other clustering techniques emphasized thus far. In addition to SPEC, many other classical clustering techniques, k-means and PAM among others, can be directly used to generate the final data partition from the proposed RM. The LCE framework is generic such that it can be adopted for analyzing other types of data. V. CONCLUSION This paper presents a novel, highly effective link-based cluster ensemble approach to categorical data clustering. It transforms the original categorical data matrix to an information-preserving numerical variation (RM), to which an effective graph partitioning technique can be directly applied. The problem of constructing the RM is efficiently resolved by the similarity among categorical labels (or clusters), using the Weighted Triple-Quality similarity algorithm. The empirical study, with different ensemble types, validity measures, and data sets, suggests that the proposed link-based method usually achieves superior clustering results compared to those of the traditional categorical data algorithms and benchmark cluster ensemble techniques. The prominent future work includes an extensive study regarding the behavior of other link-based similarity measures within this problem context. Also, the new method will be applied to specific domains, including tourism and medical data sets. REFERENCES [1] D.S. Hochbaum and D.B. Shmoys, A Best Possible Heuristic for the K-Center Problem, Math. of Operational Research, vol. 10, no. 2, pp. 180-184, 1985. [2] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Publishers, 1990. [3] A.K. Jain and R.C. Dubes, Algorithms for Clustering. Prentice-Hall, 1998. [4] P. Zhang, X. Wang, and P.X. Song, Clustering Categorical Data Based on Distance Vectors, The J. Am. Statistical Assoc., vol. 101, no. 473, pp. 355-367, 2006. [5] J. Grambeier and A. Rudolph, Techniques of Cluster Algorithms in Data Mining, Data Mining and Knowledge Discovery, vol. 6, pp. 303-360, 2002. International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page3655
[6] K.C. Gowda and E. Diday, Symbolic Clustering Using a New Dissimilarity Measure, Pattern Recognition, vol. 24, no. 6, pp. 567- 578, 1991. [7] J.C. Gower, A General Coefficient of Similarity and Some of Its Properties, Biometrics, vol. 27, pp. 857-871, 1971. [8] Z. Huang, Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and Knowledge Discovery, vol. 2, pp. 283-304, 1998. [9] Z. He, X. Xu, and S. Deng, Squeezer: An Efficient Algorithm for Clustering Categorical Data, J. Computer Science and Technology, vol. 17, no. 5, pp. 611-624, 2002. [10] P. Andritsos and V. Tzerpos, Information- Theoretic Software Clustering, IEEE Trans. Software Eng., vol. 31, no. 2, pp. 150-165, Feb. 2005. [11] D. Cristofor and D. Simovici, Finding Median Partitions Using Information- Theoretical-Based Genetic Algorithms, J. Universal Computer Science, vol. 8, no. 2, pp. 153-172, 2002. [12] D.H. Fisher, Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, vol. 2, pp. 139-172, 1987.
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB