article

Free access

Fast algorithms for projected clustering

Authors:

Charu C. Aggarwal,

Cecilia Procopiuc,

Jong Soo ParkAuthors Info & Claims

ACM SIGMOD Record, Volume 28, Issue 2

Pages 61 - 72

https://doi.org/10.1145/304181.304188

Published: 01 June 1999 Publication History

Abstract

The clustering problem is well known in the database literature for its numerous applications in problems such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In such high dimensional spaces not all dimensions may be relevant to a given cluster. One way of handling this is to pick the closely correlated dimensions and find clusters in the corresponding subspace. Traditional feature selection algorithms attempt to achieve this. The weakness of this approach is that in typical high dimensional data mining applications different sets of points may cluster better for different subsets of dimensions. The number of dimensions in each such cluster-specific subspace may also vary. Hence, it may be impossible to find a single small subset of dimensions for all the clusters. We therefore discuss a generalization of the clustering problem, referred to as the projected clustering problem, in which the subsets of dimensions selected are specific to the clusters themselves. We develop an algorithmic framework for solving the projected clustering problem, and test its performance on synthetic data.

References

[1]

R. Agrawal, J. Gehrke, D. Gunopolos, P. Raghavan. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Proceedings of ~he A CM SIGMOD International Conference on Management of Data, 1998.

Digital Library

[2]

D. Hand, Order Statistics. John Wiley and Sons, New York, 1981.

[3]

M. Berger, I. Rigoutsos. An Algorithm for Point Clustering and Grid Generation. IEEE Transactions on Systems, Man and Cybernetics, Vol. 21, 5:1278-1286, 1991.

[4]

M. R. Brito, E. Chavez, A. Quiroz, J. Yukich. Connectivity of the Mutual k-Nearest-Neighbor Graph for Clustering and Outlier Detection. Siatis~ics and Probability Letters, 35 (1997) pages 33-42.

[5]

P. Cheeseman, j. Kelly, S. Matthew. AutoClass: A Bayesian Classification System. Proceedings of ~he 5~h International Conference on Machine Learning, Morgan Kaufmann, June 1988.

[6]

R. Dubes, A. Jain. Clustering Meihodologies in Exploratory Data Analysis. Advances in Computers, Edited by M. Yovits, Vol. 19, Academic Press, New York, 1980.

[7]

M. Ester, H.-P. Kriegel, X. Xu. A Database Interface for Clu.,~tering in Large Spatial Databases. Proceedings of the first International Conference on Knowledge Discovery and Data Mining, 1995.

Digital Library

[8]

M. Ester, H.-P. Kriegel and X. Xu, Knowledge Discovery in Large ~patlal Databases: Focusing Techniques for Efficient Class Identification. Proceedings of ~he Fourth International Symposium on Large Spagial Database,J, Portland, Maine, U.S.A. 1995.

Digital Library

[9]

M. Ester, H.-P. Kriegel, J. Sander, X. Xu. A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of ~he 2nd International Conference on Knowledge Discovery in Databases and Da~a Mining, Portland, Oregon, August 1.995.

[10]

U. Shardanand, P. Maes. Social information filtering: algorithms ior automating "word of mouth". Proceedings of the A CM Conference on Human Factors in Compuging Systems, pages 210-217, 1995.

Digital Library

[11]

D. Fisher. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning 2(2), 1987.

Digital Library

[12]

D. Fisher. Optimization and Simplification of Hierarchical Clusters. Proceedings of ~he International Conference on Knowledge Discovery and Data Mining, August 1995.

[13]

D. Gibson, J. Kleinberg, P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. Proceedings of the 24$h VLDB Conference, pp. 311-3:22, 1998.

Digital Library

[14]

T. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, Vol. 38, pp. 293-306, 1985.

[15]

S. Guha, R. Rastogi, K. Shim. CURE: An Efficient Clustering Algorithm for Large Databases. Proceedings of the 1#98 A CM SIGMOD Conference, pp. 73-84, 1998.

Digital Library

[16]

T. Ibaraki, N. Katoh. Resource Allocation Problems: Algorithmic Approaches. MIT Press, Cambridge, Massachusetts, 1988.

Digital Library

[17]

A. Jain, R. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1998.

Digital Library

[18]

L. Kaufman, P. Rousseeuw. Finding Groups in Data- An Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics, 1990.

[19]

R. Kohavi, D. Sommerfield. Feature Subset Selection Using the Wrapper Method" Overfitting and Dynamic Search Space Topology. Proceedings of ~he First International Conference on Knowledge Discovery and Data Mining, 1995.

[20]

R. Lee. Clustering Analysis and its applicagions. Advances in Information Systems Science, edited by :I. Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.

[21]

R. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. Proceedings of ~he POth VLDB Conference, 1994, pp. 144.155.

Digital Library

[22]

D. Keim, S. Berchtold, C. BShm, H.-P. Kriegel. A cost model for nearest neighbor search in highdimensional data space. Proceedings of the 16~h Symposium on Principles of Database Systems (PODS), pages 78-86, 1997.

Digital Library

[23]

S. Wharton. A Generalized Histogram Clustering for Multidimensional Image Data. Pattern Recognition, Vol. 16, No. 2: pp. 193-199, 1983.

[24]

X. Xu, M. Ester, H.-P. Kriegel, J. SarLder. A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. Proceedinqs of ~he Fourteenth International Conference on D~;ta Engineering, 1998, pp. 324-331.

Digital Library

[25]

M. Zait, H. Messatfa. A Comparative Study of Clustering Methods. FGCS Journal, Special Issue on Data Mining, 1997.

Digital Library

[26]

T. Zhang, R. Ramakrishnan, M. Livny. BIRC:{-I: An Efficient Data Clustezing Method for Very Large Databases. Proceedings of ~he A CM $IGMG'D International Conference on Management of Da~a, Montreal, Canada, June 1996.

Digital Library

Cited By

Jain NGhosh SGhosh A(2024)A parameter free relative density based biclustering method for identifying non-linear feature relationsHeliyon10.1016/j.heliyon.2024.e3473610:15(e34736)Online publication date: Aug-2024
https://doi.org/10.1016/j.heliyon.2024.e34736
Zhu JLiu X(2024)An integrated intrusion detection framework based on subspace clustering and ensemble learningComputers and Electrical Engineering10.1016/j.compeleceng.2024.109113115:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.compeleceng.2024.109113
Singh JSingh D(2024)A comprehensive review of clustering techniques in artificial intelligence for knowledge discovery: Taxonomy, challenges, applications and future prospectsAdvanced Engineering Informatics10.1016/j.aei.2024.10279962(102799)Online publication date: Oct-2024
https://doi.org/10.1016/j.aei.2024.102799
Show More Cited By

Index Terms

Fast algorithms for projected clustering
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Robust projected clustering

Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many ...
Fast algorithms for projected clustering
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

The clustering problem is well known in the database literature for its numerous applications in problems such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces ...
The Projected Dip-means Clustering Algorithm
SETN '18: Proceedings of the 10th Hellenic Conference on Artificial Intelligence

One of the major research issues in data clustering concerns the estimation of number of clusters. In previous work, the dip-means clustering algorithm has been proposed as a successful attempt to tackle this problem. Dip-means is an incremental ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 28, Issue 2

June 1999

599 pages

ISSN:0163-5808

DOI:10.1145/304181

Chairmen:
Susan Davidson
Univ. of Pennsylvania
,
Christos Faloutsos
Carnegie Mellon Univ.

Issue’s Table of Contents

SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
June 1999
604 pages
ISBN:1581130848
DOI:10.1145/304182
Chairmen:
Susan B. Davidson
Univ. of Pennsylvania, Philidelphia
,
Christos Faloutsos
Carnegie Mellon Univ., Pittsburgh

Copyright © 1999 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1999

Published in SIGMOD Volume 28, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

712
Total Citations
View Citations
4,339
Total Downloads

Downloads (Last 12 months)393
Downloads (Last 6 weeks)47

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jain NGhosh SGhosh A(2024)A parameter free relative density based biclustering method for identifying non-linear feature relationsHeliyon10.1016/j.heliyon.2024.e3473610:15(e34736)Online publication date: Aug-2024
https://doi.org/10.1016/j.heliyon.2024.e34736
Zhu JLiu X(2024)An integrated intrusion detection framework based on subspace clustering and ensemble learningComputers and Electrical Engineering10.1016/j.compeleceng.2024.109113115:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.compeleceng.2024.109113
Singh JSingh D(2024)A comprehensive review of clustering techniques in artificial intelligence for knowledge discovery: Taxonomy, challenges, applications and future prospectsAdvanced Engineering Informatics10.1016/j.aei.2024.10279962(102799)Online publication date: Oct-2024
https://doi.org/10.1016/j.aei.2024.102799
Kalita JBhattacharyya DRoy S(2024)Cluster analysisFundamentals of Data Science10.1016/B978-0-32-391778-0.00016-8(181-214)Online publication date: 2024
https://doi.org/10.1016/B978-0-32-391778-0.00016-8
Haghzad Klidbary SJavadian M(2024)A Novel Hierarchical High-Dimensional Unsupervised Active Learning MethodInternational Journal of Computational Intelligence Systems10.1007/s44196-024-00601-w17:1Online publication date: 24-Jul-2024
https://doi.org/10.1007/s44196-024-00601-w
Mondal RIgnatova EWalke DBroneske DSaake GHeyer R(2024)Clustering graph data: the roadmap to spectral techniquesDiscover Artificial Intelligence10.1007/s44163-024-00102-x4:1Online publication date: 22-Jan-2024
https://doi.org/10.1007/s44163-024-00102-x
Kishore Veparala VKalpana V(2023)Big Data y diferentes enfoques de clustering subespacial: De la promoción en redes sociales al mapeo genómicoSalud, Ciencia y Tecnología10.56294/saludcyt20234133(413)Online publication date: 19-Jun-2023
https://doi.org/10.56294/saludcyt2023413
Zhao QZhang XLiu XZong L(2023)Semi-supervised Projected Subspace Clustering2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC)10.1109/ICFTIC59930.2023.10456306(1021-1024)Online publication date: 17-Nov-2023
https://doi.org/10.1109/ICFTIC59930.2023.10456306
Das ADas SMukherjee J(2023)Approximation algorithms for orthogonal line centersDiscrete Applied Mathematics10.1016/j.dam.2023.05.014338:C(69-76)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1016/j.dam.2023.05.014
Khamkar RDas PNamasudra S(2023)SCEOMOOApplied Soft Computing10.1016/j.asoc.2023.110185139:COnline publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.asoc.2023.110185
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents