Ijdkp 030205
Ijdkp 030205
Ijdkp 030205
2, March 2013
ABSTRACT
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high dimensional data. Many significant subspace clustering algorithms exist, each having different characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive classification scheme is essential which will consider all such characteristics to divide subspace clustering approaches in various families. The algorithms belonging to same family will satisfy common characteristics. Such a categorization will help future developers to better understand the quality criteria to be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms Family). Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc. As an illustration, we further provided a comprehensive, systematic description and comparison of few significant algorithms belonging to Axis parallel, overlapping, density based SCAF.
KEY WORDS
Axis parallel clusters, Density based clustering, High dimensional data, Subspace clustering,
1. INTRODUCTION
Clustering is the most common data mining process which aims at dividing datasets into subsets or clusters in such a way that the objects in one subset are similar to each other with respect to a given similarity measure, while objects in different subsets are dissimilar [1]. Clustering is commonly and heavily used in variety of application areas such as in medical science, environmental science, astronomy, geology, business intelligence and so on [2]. It helps users in understanding natural grouping in a dataset or structure of the dataset. Clustering is also treated as a form of data compression. Thus, in general, clustering can be treated as a first step in various data processing methods such as classification, indexing, data compression, etc. While lot of work has been already done in the area of clustering [1, 3, 4], new approaches need to be proposed to cope with the modern capabilities of huge data generation. Clustering the real world datasets need to deal with objects modeled by high dimensional data, where each object is described by hundreds or thousands of attributes. For instance, there are many computer vision applications, such as motion segmentation, face clustering with varying illumination, pattern classification, temporal video segmentation etc. In such applications, image data is very high dimensional. Another example of high dimensional data can be found in the area of molecular biology [5] and CAD (Computer Aided Design) databases. However, such high dimensional data initiates different challenges for conventional clustering approaches [6]. In particular, the
DOI : 10.5121/ijdkp.2013.3205 69
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
traditional clustering algorithms [7, 8] fail in such cases due to the inherent sparsity of the data objects and do not produce meaningful clusters. In high dimensional data, clusters are embedded in various subsets of the entire dimension space [9]. A new research area of high dimensional data clustering detects such clusters embedded in different, variable length subspaces. There exist lot of approaches for subspace clustering and numerous algorithms are being proposed nearly every day. However, as this field is yet latest and emerging, there is no common ground on the basis of which we can bring all the algorithms on a universal platform, so as to compare their results. Hence, we need to effectively classify all these approaches to enable compare their various characteristics / features. Few surveys of high dimensional data clustering approaches are available in literature [5, 10, 11, 12, 13]. An extremely comprehensive survey [10] illustrates different terminologies used and discusses various assumptions, heuristics or intuitions forming the basis of different high dimensional data clustering approaches. However, there is a need of classifying all subspace clustering approaches using multiple parameters which will group algorithms of similar characteristics in one family. Thus, the purpose of this paper is to provide the concept of clustering family SCAF (Subspace Clustering Algorithms Family). For example, Axis parallel, overlapping, bottom up, density based subspace clustering algorithms will form one family. A researcher, who is using the same techniques to develop his / her clustering algorithm, will compare only with the algorithms belonging to this family. This paper is structured as follows. The Remaining part of this section provides a short lead up to challenges involved in high dimensional data clustering and traditional methods of dimensionality reduction. Section 2 presents a detailed survey of various existing classification schemes of subspace clustering approaches followed by the introduction to the notion of SCAF. For ready reference, section 3 presents a comparative study of few significant algorithms belonging to Axis parallel, overlapping, density based SCAF. A comparative chart is shown indicating working principles, heuristics used, shape and size of the clusters, run time, accuracy as well as limitations etc. of different algorithms, followed by the conclusion in section 4.
70
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
Dimension 2 Dimension 2
Dimension 1
Dimension 1
Figure 1. Dividing data in 2-clear groups using the distance between data points Determination of simple Euclidian distance may not be useful in clustering high dimensional data [6]. Traditional clustering algorithms consider all the dimensions of an input dataset to measure such a distance between any two data points. While dealing with high dimensional data, clustering faces the hitches of Curse of dimensionality [15] and the related sparsity problems. These hitches result from the fact that a fixed number of data points become increasingly sparse as dimensionality increases. In effect, the amount of data to sustain a given density increases exponentially with the dimensionality of the input space. On the other hand, the sparsity increases exponentially given a constant amount of data, with data points tending to become equidistant from one another. This will badly affect any clustering method which is based on either density or the distance between data points. Curse of dimensionality is illustrated in a simple way in Figure 2.
Figure 2. The curse of dimensionality In Figure 2, randomly generated 200 data points have been used and difference between maximum distance and minimum distance among every pair of points is computed. Ref. [6, 16] show that, for certain data distributions, the relative difference of the distances between closest and farthest data points tends to 0 (for Ld metric, and d 3) as number of dimensions increases. i.e.
lim d
(1)
where d is the number of dimensions [17]. Thus, eq. (1) shows the potential problems in high dimensional data clustering, in the cases where the data distribution generates relatively uniform distance between data points.
71
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
1.1.2 Irrelevant Dimensions Apart from the curse of dimensionality, high dimensional data contains many of the dimensions often irrelevant to clustering or data processing [9]. These irrelevant dimensions confuse clustering algorithms by hiding clusters in noisy data. In such a case, the common approach is to reduce the dimensionality of the data, of course, without losing important information. Thus, clustering is often preceded by a feature selection step which attempts to remove irrelevant features from the data. However, in high dimensional data, clusters are embedded in various subspaces. One dimension may be useful in some combinations of subspace for clustering the data, whereas, it can be irrelevant in some other subspace formation. Thus a global filtering approach for feature selection is not feasible. 1.1.3 Correlations among Dimensions As there is large number of dimensions, there is some correlation among attributes. So, it may be possible that the clusters are not aligned to axis parallel, but can be arbitrarily oriented. These problems make the average density in the data space quite low. Not only that the density in data space is low, but the noise values are also uniformly distributed in high dimensional space [18]. Thus, it is not effective to search for clusters in high dimensional data using the traditional clustering approaches. There can be two ways to deal with the problem of high dimensionality. The first way would be to use variety of techniques performing dimensionality reduction prior to clustering. It reduces the number of dimensions in the given data, so that one can use existing clustering approaches without changing the meaning of the data. The other way is known as subspace clustering which solves the problems of high dimensional data clustering by building clusters hidden in the lower dimensional subspaces of the original dimension space.
72
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
While, all these dimensionality reduction techniques are quite successful on a large set of database applications, they face difficulty when clusters exist in various subspaces of the dimension space.
2. CLASSIFICATION SCHEMES
Existing subspace clustering approaches can be categorized using various classification schemes. There exist few survey papers which classify various subspace clustering approaches in different ways. The well known survey specified by P. Lance et al. [13] suggests two major types of subspace clustering as top down and bottom up, based on the search strategy used. The top down approaches are further classified as per cluster weighting methods and per instance weighting methods. However, in this classification, the authors have considered only grid based approaches
73
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
and have classified them further based on the size of grid as static grid and adaptive grid approaches. No clear division is stated by the authors about bottom up approaches as grid based or density based. We have shown this classification of bottom up approaches in the following sections. Ilango et al [11] classify high dimensional clustering approaches as partitioning approaches, hierarchical approaches, density based approaches, grid based approaches and model based approaches and further present a survey of various grid based approaches. However, there is no specific categorization about subspace clustering approaches or traditional clustering approaches. Karlton S. et al [33] classify subspace clustering approaches into two categories: density based clustering and projected clustering. As per the authors, density based clustering approaches such as CLIQUE (Clustering In QUEst) [9], MAFIA (Merging Adaptive Finite Intervals And is more than a clique) [28], SUBCLU (density connected SUBspace CLUstering) [14] are based on density of data. Projected clustering is observed in approaches such as PROCLUS (PROjected CLUStering) [30], CLARANS [34], ORCLUS (arbitrarily Oriented projected CLUStering) [35], DOC (Density based Optimal projective Clustering) [31] etc. However, it is not clearly stated by the authors on which basis the density based approaches differ from projected approaches. H.P. Kriegel et al [10] classify different high dimensional data clustering approaches as subspace clustering (or axis parallel clustering), correlation clustering (arbitrarily oriented clustering) and pattern based clustering. Correlation clustering approaches aim at finding clusters, which may exist in any arbitrarily oriented subspaces, e.g. ORCLUS [35]. Pattern based clustering aims at grouping objects in clusters exhibiting similar trend in a subset of attributes, e.g. p-Cluster [36]. Axis parallel subspace clustering algorithms are further classified using problem oriented categorization as subspace clustering, projected clustering, soft projected clustering and hybrid algorithms. Projected clustering approaches aim at finding a unique assignment of each object to exactly one subspace cluster or noise, e.g. PreDeCon (subspace PREference weighted DEnsity CONnected clustering) [32]. In soft projected clustering algorithms, the number k of clusters is known in advance and an objective function is defined which is optimized to generate k-number of clusters, e.g. COSA (Clustering Objects on Subsets of Attributes) [37]. Subspace clustering algorithms aim at finding all subspaces where clusters can be identified, e.g. SUBCLU [14]; and hybrid algorithms aim at finding something in between, i.e. these algorithms may find overlapping clusters, but may not claim that these will search for every possible subspace and every possible cluster. e.g. FIRES (FIlter REfinement Subspace clustering) [29]. One more simple classification is stated in [38]. Depending on the underlying cluster definition and parameterization of the resulting clusters, authors have classified subspace clustering approaches as cell based, density based and clustering oriented approaches. Cell based approaches search for sets of fixed or variable grid cells containing more than a certain threshold objects, e.g. CLIQUE [9]. Density based approaches define clusters as dense regions separated by sparse regions, e.g. SUBCLU [14] and clustering oriented approaches define properties of the entire set of clusters, like the number of clusters, their average dimensionality or statistically oriented properties, e.g. PROCLUS [30]. In fact, there exist other classifications, which the authors have not taken care of.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
based subspace clustering algorithms form one family. This concept will help categorize various existing approaches not only in different classes but also in different families. Thus, a new algorithm proposed by a researcher will be compared only with the algorithms belonging to the respective family. We propose that, various classes decide the characteristics of a common family so that similar algorithms will belong to that family. With this view, the characteristics of a family will be based on the classes as mentioned in Table 1. With reference to Table 1, combinations of different classes build different SCAF, such as i. Axis parallel, overlapping, bottom up, grid based SCAF ii. Axis parallel, overlapping, bottom up, density based SCAF iii. Axis parallel, non-overlapping, bottom up, grid based SCAF iv. Axis parallel, non-overlapping, bottom up, density based SCAF v. Arbitrarily oriented, overlapping, bottom up, grid based SCAF vi. Arbitrarily oriented, overlapping, bottom up, density based SCAF vii. Arbitrarily oriented, non-overlapping, bottom up, grid based SCAF viii. Arbitrarily oriented, non-overlapping, bottom up, density based SCAF Table 1. SCAF classes and corresponding characteristics Sr. No. [1] [2] [3] [4] Class Cluster orientation Overlap of dimensions or objects Search methods Use of grid Characteristics of the family Axis parallel Arbitrarily Oriented Overlapping Non-overlapping Bottom up Top down Grid based Density based
Similarly, different combinations of top down approaches could be identified. These classes building various SCAFs are discussed in the following sections.
75
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
Figure 3. Overlapping and non-overlapping clusters Overlapping cluster algorithms aim at finding every possible cluster in every possible subspace of the feature space. The major examples of such algorithms include CLIQUE [9], ENCLUS (ENtropy based subspace CLUStering) [27], MAFIA [28], SUBCLU [14], FIRES [29] etc. Significant examples of non-overlapping approaches are PROCLUS [30], DOC [31], PreDeCon [32] etc. Subspace Clustering Algorithms
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
when it cannot find any more quality subspaces. Few significant algorithms of this category are CLIQUE [9], DOC [31] and MAFIA [28]. Subspace Clustering Algorithms
Figure 5. Top down / bottom up search based subspace clustering Both top down and bottom up approaches are commonly used in the domain of data mining. Both types of algorithms attempt to detect subspace clusters efficiently. Top down approaches first identify cluster members and then decide the related subspaces of these clusters. Whereas, bottom up approaches first predict interesting subspaces and then search for the clusters in those subspaces. Figure 5 presents a classification scheme based on search techniques employed in subspace clustering approaches.
Figure 6. Axis parallel clusters and arbitrarily oriented clusters Figure 6 shows that, clusters can be better expressed in arbitrarily oriented subspaces. However, if we consider arbitrarily oriented clusters, the computational efficiency goes quite low compared to axis-parallel approaches, as the number of possible subspaces goes to infinite. Thus, depending on the applications, sometimes it is reasonable to simply locate axis parallel clusters, as finding such clusters is more efficient compared to correlation clusters.
77
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
Figure 7. Classification based on use of grid and density notion Density connected clustering approaches are based on the clustering paradigm specified by DBSCAN [42]. They compute the density around a certain point by searching its neighborhood. It requires two input parameters, -threshold and radius, to define whether an area is dense or not. A cluster is then defined as a set of dense objects having more than number of objects in neighborhood. A density based subspace cluster (O, S) with respect to density threshold and radius, can be defined as an objects o is dense : o DB, |N(o)| , where |N(o)| is the neighborhood of object o and can be derived as
78
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
|N(o)| = { p DB | DistS(o,p) }. DistS designates the distance function applied on a set of dimensions S. Thus, all core objects together which share common neighbourhood, define the outline of a cluster. Non-core objects within the neighbourhood of core objects form the boundary of the cluster. The objects which do not belong to any of the clusters are regarded as noise points. Density based approaches can thus find clusters of any shape and size and are noise tolerant. The first density based subspace clustering algorithm is SUBCLU [14]. Other examples are FIRES [29], PreDeCon [32], DUSC (Dimensionality Unbiased Subspace Clustering) [43] and SURFING (SUbspace Relevant For ClusterING) [44]. Density based subspace clustering approaches can detect clusters of any size / shape which may be positioned arbitrarily, thus eliminating the problems associated with grid based approaches. However, as the density measurement is again based on distance, density based clustering approaches compute distances by considering only the relevant dimensions. Until now, we have discussed individual classes, classifying various subspace clustering algorithms. SCAF helps classify these algorithms into families making it friendlier to researchers.
79
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
distances to objects in its area of influence sum up to more than a given density threshold i.e.
(o) . It further uses Epanechnikov kernel estimator [43] to estimate density value at any
position in the data space. It assigns decreasing weights to objects with increasing distance. Density of any object is then measured with respect to the expected density (S). Thus, an object o is dense in subspace S according to the expected density (S), if and only if,
S 1 (o ) F (s)
80
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
where F denotes the density threshold. F is independent of the dimensionality and data set size, and is much easier to specify than traditional density thresholds. DUSC also combines the major paradigms for improving the runtime. The experiments on large high dimensional synthetic and real world data sets show that DUSC outperforms other subspace clustering algorithms in terms of accuracy and runtime.
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
Section 3 thus covered a reasonable description of all those state of the art algorithms belonging to the same subspace clustering algorithms family. Further, table 2 summarizes the important aspects, characteristics of all these algorithms.
4. CONCLUSION
In this paper, a detailed introduction to cluster analysis in high dimensional data and challenges faced by such clustering approaches are presented. The major challenges in high dimensional data clustering are curse of dimensionality, irrelevant dimensions and correlations among various dimensions. We described in brief, traditional approaches such as feature selection and feature transformation, to solve the problem of high dimensional data clustering. We then presented details about subspace clustering - a most commonly used high dimensional data clustering approach. Lots of approaches exist for subspace clustering and numerous algorithms are being proposed nearly every day. Proper selection of a clustering approach to suit a particular application and data, should be based on i. Understanding of the exact requirement of clustering application and ii. Principles of working of available approaches. Hence, an attempt is made to present various classification schemes for existing subspace clustering algorithms to better understand group-characteristics of various families of algorithms. The concept of SCAF, Subspace Clustering Algorithm Family, is presented to help solve the problem of building a uniform platform to classify and hence test new subspace clustering algorithms. Examples of few families are created by assigning different values to classes which define SCAF. A comparative study of few specific algorithms belonging to Axis parallel, overlapping, density based SCAF has been presented for ready reference. Their comparison based on different parameters such as run time, shape and size of the clusters etc. is also presented. We implemented few of these techniques on OpenSubspace (Weka) to better understand their working, strengths and limitations; although there is a need for more extensive testing and comparative study of all these techniques. Finally, it is not possible that every clustering approach will be suitable to every type of data. We limited the scope of this paper only to continuous valued data; though, there exist many clustering algorithms which are specially designed for stream data, graph data, spatial data, text data, heterogeneous data etc. We hope to stimulate further research in these areas.
82
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
ACKNOWLEDGEMENT
We would like to thank our student Shweta Daptari, for providing help in implementing and testing the algorithms.
REFERENCES
[1] [2] [3] [4] [5] [6] Kaufman, L. & Rousseeuw P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York. Han, J. & M. Kamber (2001) Data Mining : Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, CA. Jain, A. & Dubes, R. (1988) Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ. Xu, R. (2005) Survey of Clustering Algorithms, IEEE Transaction on Neural networks, Vol. 16, Issue 3, pp 645678. Daxin, J., Tang, C. & Zhang, A. (2004) Cluster Analysis for Gene Expression Data: A Survey, IEEE Transaction on Knowledge and Data Engineering, Vol. 16 Issue 11, pp. 1370-1386. Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1998) When is Nearest Neighbor Meaningful?, In Proceedings of 7th International Conference on Database Theory (ICDT-1999), Jerusalem, Israel, pp. 217-235. Cutting, D., Karger, D., Pedersen J. & Tukey, J. (1992) Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections, Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, June 21-24, Copenhagen, Denmark, pp: 318-329. 83
[7]
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
[8] [9]
[10]
[15]
[16] [17]
[18]
[24]
[28] [29]
[30]
[31]
Frank, I.E. & Roberto, T. (1994), Data Analysis Handbook, Elsevier Science Inc., New York, pp: 227228. Agrawal, R., Gehrke, J., Gunopulos, D. & Raghavan (1998) Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, In Proceedings of the SIGMOD, Vol. 27 Issue 2, pp. 94-105. Kriegel, H. P., Kroger, P. & Zimek, A. (2009) Clustering High-Dimensional Data : A Survey on Subspace Clustering, Pattern-Based Clustering, & Correlation Clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 3, Issue 1, Article 1. Ilango, M. R. & Mohan, V. (2010) A survey of Grid Based Clustering Algorithms, International Journal of Engineering Science and Technology, Vol. 2(8), 3441-3446. Patrikainen, A. & Meila, M. (2006) Comparing Subspace Clusterings, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, Issue 7, pp. 902-916. Lance, P., Haque, E. & Liu, H. (2004) Subspace Clustering for High Dimensional Data: A Review, ACM SIGKDD Explorations Newsletter, Vol. 6 Issue 1, pp 90105. Kailing, K., Kriegel, H.P. & Kroger, P. (2004) Density-Connected Subspace Clustering for High Dimensional Data, In Proceedings of the 4th SIAM International Conference on Data Mining, Orlando, FL, pp. 46-257. Friedman, J. (1994) An Overview of Computational Learning and Function Approximation, In: From Statistics to Neural Networks. Theory and Pattern Recognition Applications. (Cherkassky, Friedman, Wechsler, eds.) Springer-Verlag 1 Gilbert Strang (1986) Linear Algebra and its Applications. Harcourt Brace Jovanovich, third edition. Hinneburg, A., Aggarwal, C. & Keim D. (2000) What is the Nearest Neighbor in High Dimensional Spaces?, In Proceedings of 26th International Conference on Very Large Data Bases (VLDB-2000), Cairo, Egypt, September 2000, pp 506-515. Berchtold, S., Bohm, C., Keim, D. & Kriegel, H.-P. (1997) A Cost Model for Nearest Neighbour Search in High Dimensional Data Space, In Proceedings of the 16th Symposium on Principles of Database Systems (PODS), pp. 78-86. Gao, J., Kwan, P.W. & Guo, Y. (2009) Robust Multivariate L1 Principal Component Analysis and Dimensionality Reduction, Neurocomputing, Vol. 72: 1242-1249. Fukunaga, K. (1990) Introduction to Statistical Pattern Recognition, Academic Press, New York. Blum, A. & Langley, P. (1997) Selection of Relevant Features and Examples in Machine Learning, Artificial Intelligence, Vol. 97:245271. Liu, H. & Motoda, H. (1998), Feature Selection for Knowledge Discovery & Data Mining, Boston: Kluwer Academic Publishers. Pena, J. M., Lozano, J. A., Larranaga, P. & Inza, I. (2001) Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23(6):590 - 603. Yu, L. & Liu, H. (2003), Feature Selection for High Dimensional Data: A Fast Correlation Based Filter Solution, In Proceedings of the Twentieth International Conference on Machine Learning, pp. 856-863. Kohavi, R. & John, G. (1997) Wrappers for feature subset selection, Artificial Intelligence, 97(12):273-324. Blum, A. & Rivest, R. (1992) Training a 3-node Neural Networks is NP-complete, Neural Networks, 5:pp. 117-127. Cheng, C. H., Fu A. W. & Zhang, Y. (1999) Entropy-Based Subspace Clustering for Mining Numerical Data, In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pp. 84-93. Technical Report CPDC-TR-9906-010 (1999) MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets, Goil, S., Nagesh, H. & Choudhary, A., Northwestern University. Kriegel, H.P., Kroger, P., Renz, M. & Wurst, S. (2005) A Generic Framework for Efficient Subspace Clustering of High Dimensional Data, In Proceedings of the 5th International Conference on Data Mining (ICDM), Houston, TX, pp. 250-257. Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C. & Park J. S. (1999) Fast Algorithms for Projected Clustering, In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp.61-72. Procopiuc, C., Jones, M., Agarwal, P. K. & Murali, T. M. (2002) A Monte Carlo Algorithm for Fast Projective Clustering, In Proceedings of the 2002 ACM SIGMOD International conference on Management of data, pp. 418-427. 84
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
[32] Bohm, C., Kailing, K., Kriegel, H.-P. & Kroger, P. (2004) Density Connected Clustering with Local Subspace Preferences, In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM-04), Washington DC, USA, pp. 27-34. [33] Karlton, S. & Zaki, M. (2005) SCHISM: A New Approach to Interesting Subspace Mining, Int. J. of Business Intelligence and Data Mining, Vol. 1, No. 2, pp. 137-160. [34] Ng, R.T. & Han, J. (2002) CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Transaction on Knowledge Data Engg. 14 (5) pp. 10031016. [35] Aggarwal, C. & Yu, P. (2000) Finding Generalized Projected Clusters in High Dimensional Spaces. In Proc. of ACM SIGMOD Intl. Conf. Management of Data, pages 7081, 2000. [36] Wang, H., Wang, W., Yang, J. & Yu, P. (2002) Clustering by Pattern Similarity in Large Data Sets, In Proceedings of the ACM International Conference on Management of Data (SIGMOD-02), pp. 394405. [37] Friedman, J. & Meulman, J. (2004) Clustering objects on subsets of attributes. J. R. Stat. Soc. Ser. B, 66: 815-849. [38] Mller E., Gnnemann S., Assent I. & Seidl T. (2009) Evaluating Clustering in Subspace Projections of High Dimensional Data, In Proc. of the Very Large Data Bases Endowment, Volume 2 issue 1, pp. 1270-1281. [39] Woo, K., Lee, J., Kim, M. & Lee, Y. (2002) FINDIT: A Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting, Information and Software Technology, Vol-46 Issue 4, pp.255271. [40] Wang, W., Yang, J. & Muntz, R. (1997) STING: A Statistical Information Grid Approach to Spatial Data Mining, In Proc. of 23rd Int. Conf. on VLDB, pages 186-195. [41] Liao, W.K., Liu, Y. & Choudhary, A. (2004) A Grid-Based Clustering Algorithm using Adaptive Mesh Refinement, Appears in 7th Workshop on Mining Scientific and Engineering Data Sets, Lake Buena Vista, FL, USA. [42] Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR., pp. 226-231. [43] Assent, I., Krieger, R., Muller, E. & Seidl, T. (2007) DUSC: Dimensionality Unbiased Subspace Clustering. In Proc. IEEE Intl. Conf. on Data Mining (ICDM 2007), Omaha, Nebraska, pp 409-414. [44] Baumgartner, C., Plant, C., Kailing, K., Kriegel, H. P. & Kroger, K. (2004) Subspace Selection for Clustering High-Dimensional Data. In Proc. 4th IEEE Int. Conf. on Data Mining (ICDM 04), Brighton, UK, pp 1118 [45] Assent, I., Krieger, R., Mller, E. & Seidl, T. (2008) INSCY: Indexing Subspace Clusters with In Process-Removal of Redundancy", Eighth IEEE International Conference on Data Mining In ICDM, pp. 414425 [46] Muller, E., Assesnt, I., Gunnemann, S. & Seidl, T. (2011) Scalable Density based Subspace Clustering. Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM11), pp: 1076-1086. [47] Chu, Y.H., Huang, J.W., Chuang, K.T., Yang D.N. & Chen, M.S. (2010) Density conscious subspace clustering for high-dimensional data. IEEE Trans. Knowledge Data Eng., 22: 16-30.
85
International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.2, March 2013
Authors
Sunita Jahirabadkar is working as Asst. Professor in Computer Department, Cummins College of Engineering, Pune (India) for more than 14 years. She has 10 research publications in various national / international conferences / journals. She is a co-author of the book "e-business" by Oxford Publications. Her areas of interest and research include Data Mining, Artificial Intelligence, Machine Learning, Computer Architectures etc.
Dr. Parag Kulkarni holds PhD from IIT Kharagpur. UGSM Monarch Business School - Switzerland conferred DSc - Higher Doctorate on him. He is the founder and Chief Scientist of EKLat Research where he has empowered businesses through machine learning, knowledge management, and systemic management. He has been working within the IT industry for over twenty years. The recipient of several awards, Dr. Kulkarni is a pioneer in the field of Systemic Machine Learning. He has over 120 research publications including more than half a dozen books and 3 patents. His areas of research and product development include Mmaps, intelligent systems, text mining, image processing, decision systems, forecasting, IT strategy, artificial intelligence, and machine learning.
86