article

Free access

Automatic subspace clustering of high dimensional data for data mining applications

Authors:

Rakesh Agrawal,

Johannes Gehrke,

Dimitrios Gunopulos, and

Prabhakar RaghavanAuthors Info & Claims

ACM SIGMOD Record, Volume 27, Issue 2

Pages 94 - 105

https://doi.org/10.1145/276305.276314

Published: 01 June 1998 Publication History

Abstract

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

References

[1]

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307-328. AAAI/MIT Press, 1996.]]

Digital Library

[2]

A. Aho, 2. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms. Addison-Welsley, 1974.]]

Digital Library

[3]

P. Arabic and L. J. Hubert. An overview of combinatorial data analyis, in P. Arabic, L. Hubert, and G. D. Sorts, editors, Clustering and Classification, pages 5-63. World Scientific Pub., New Jersey, 1996.]]

[4]

Arbor Software Corporation. Application Manager User's Guide, Essbase Version 4.0 edition.]]

[5]

R. Bayardo. Efficiently mining long patterns from databases. of Data, Seattle, Washington, 1998.]]

Digital Library

[6]

S. Berchtold, C. Bohm, D. Keim, and H.-P. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings o} the 16th Symposium on Principles of Database Systems (PODS), pages 78-86, 1997.]]

Digital Library

[7]

M. Berger and I. Regoutsos. An algorithm for point clustering and grid generation. IEEE Transactions on Systems, Man and Cybernetics, 21(5):1278-86, 1991.]]

[8]

S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the A CM SIGMOD Conference on Management of Da~a, May 1997.]]

Digital Library

[9]

P. Cheeseman and J. Stuff. Bayesian classification (autoclass): Theory and results. In U. M. Fayyad, O. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discove~'y and Data Mining, chapter 6, pages 153-180. AAAI/MIT Press, 1996.]]

Digital Library

[10]

R. Chhikara and D. Register. A numerical classification method for partitioning of a large multidimensional mixed data set. Technometrics, 21:531-537, 1979.]]

[11]

R. O. Duds and P. E. Hart. Pattern Classification and Scene Analysis. john Wiley and Sons, 1973.]]

[12]

R. J. Earle. Method and apparatus for storing and retrieving multi-dimensional data in computer memory. U.S. Patent No. 5359724, October 1994.]]

[13]

M. Ester, H.-P. Kriegel, 2. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise, in Proc. of the ~nd Int'l Cort/erenee on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996.]]

[14]

M. Ester, H.-P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In Proc. of the I st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995.]]

Digital Library

[15]

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.]]

Digital Library

[16]

U. Feige. A threshold of In n for approximating set cover. In Proceedings of the Twenty-Eighth Annual A CM Symposium on Theory of Computing, pages 314-318, 1996.]]

Digital Library

[17]

D. Franzblau. Performance guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles. SIAM Y. Disc. Math, 2:307-321, 3 (1989).]]

Digital Library

[18]

J. Friedman. Optimizing a noisy function of many variables with application to data mining. In UW/MSR Summer Research Institute in Data Mining, July 1997.]]

[19]

K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.]]

Digital Library

[20]

D. Gunopulos, R. Khardon, H. Mannila, and S. Saluja. Data mining, hypergraph transversals, and machine learning. In P~oc. of the 16th ACId Syrup. on Principles of Database Systems, pages 209-216, 1997.]]

Digital Library

[21]

C.-T. Ho, Ft. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proc. o~ ~h,e A UM SIGMOD Conference on Management o~ Data, Tucson, Arizona, May 1997.]]

Digital Library

[22]

S. J. Hong. MINI: A heuristic algorithm for two-level logic minimization. In R. Newton, editor, Selected Papers on Logic Synthesis/or Integrated Circuit Design. IEEE Press, 1987.]]

[23]

Internationl Business Machines. IBM Intelligent Miner User's Guide, Version 1 Release 1, SH12-6213-00 edition, July 1996.]]

[24]

A.K. Jain and R. C. Dubes. Algorithms lot Clustering Data. Prentice Hall, 1988.]]

Digital Library

[25]

L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.]]

[26]

D.-I. Lin and Z. M. Kedem. Pincer search: A new algorithm for discovering the maximum frequent sets. In Pr0c. o} the 6th Int'I Conference on Eztending Database Technology (EDBT), Valencia, Spain, 1998.]]

Digital Library

[27]

L. Lovasz. On the ratio of the optimal integral and fractional covers. Discrete Mathematics, 13:383-390, 1975.]]

Digital Library

[28]

C. Lund nnd M. Yannakakis. On the hardness of approximating minimization problems. In Proceedings o} the A CM Symposium on Theory o} Computing, pages 286-293, 1993.]]

Digital Library

[29]

W. Masek. Some NP-comptete set covering problems. M.S. Thesis, MIT, 1978.]]

[30]

M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. o} the Filth Int'l Conference or, Extending Database Technology (EDBT), Avignon, France, March 1996.]]

Digital Library

[31]

R. S. Michalski and It. E. Stepp. Learning from observation: Conceptual clustering. In It. S. Michalski, 3. G. Carbonell, nnd T. M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, volume I, pages 331-363. Morgan Kaufmann, 1983.]]

[32]

R. Miller and Y. Yang. Association rules over interval data. In Proc. A CM SIGMOD international Con}. on Management of Data, pages 452-461, 1997'.]]

Digital Library

[33]

R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proc. o} the VLDB Conference, Santiago, Chile, September 1994.]]

Digital Library

[34]

It. A. Reckhow and J. Culberson. Covering simple orthogohal polygon with a minimum number of orthogonally convex polygons. In Proc. o} the ACId 3rd Annual Computational Geometry Conference, pages 268-277, 1987.]]

Digital Library

[35]

J. Rissanen. Stochastic Complexly in Statistical Inquiry. World Scientific Publ. Co., 1989.]]

Digital Library

[36]

P. Schroeter and J. Bigun. Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement. Pattern Recognition, 25(5):695-709, May 1995.]]

[37]

J. Sharer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 2$nd Int'l Conference on Very Large Databases, Bombay, India, September 1996.]]

Digital Library

[38]

A. Shoshani. Personal communication. 1997.]]

[39]

P. Sneath and R. Sokal. Numerical Tazonomy. Freeman, 1973.]]

[40]

Ft. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. In Proc. of the A C'M SIGMOD Conference on Management o} Data, Montreal, Canada, june 1996.]]

Digital Library

[41]

H. Toivonen. Sampling large databases for association rules. In Proc. of the $$nd Int't Conference on Very Large Databases, pages 134-145, Mumbai (Bombay), India, September 1996.]]

Digital Library

[42]

S. Wharton. A generalized histogram clustering for multidimensional image data. Pattern Recognition, 16(2):193-199, 1983.]]

[43]

M. Zait and H. Messatfa. A comparative study of clustering methods. Future Generation Computer Systems, 13(2- 3):149-159, November 1997.]]

Digital Library

[44]

D. Zhang and A. Bowyer. CSG set-theoretic solid modelling and NC machining of blend surfaces. In Proceedings o} the Second Annual A CM Symposium on Computational Geometry, pages 314-318, 1986.]]

Digital Library

[45]

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clusterinK method for very large datnbales. In Proc. of the A CM SIGMOD ConJerence on Management o} Da~a, Montreal, Canada, June 1996.]]

Digital Library

Cited By

ELFarra BSalaha MAshour W(2024)Black Hole Clustering: Gravity-Based Approach with No Predetermined ParametersData Science Journal10.5334/dsj-2024-02723Online publication date: 7-May-2024
https://doi.org/10.5334/dsj-2024-027
Nikolaeva O(2024)Алгоритм сегментации многоспектральных изображенийMultispectral images segmentation algorithmМатематическое моделированиеMatematicheskoe modelirovanie10.20948/mm-2024-01-0236:1(25-40)Online publication date: 29-Jan-2024
https://doi.org/10.20948/mm-2024-01-02
Nikolaeva O(2024)Segmentation Algorithm of Multispectral ImagesMathematical Models and Computer Simulations10.1134/S207004822470002916:3(340-351)Online publication date: 25-Jun-2024
https://doi.org/10.1134/S2070048224700029
Show More Cited By

Index Terms

Automatic subspace clustering of high dimensional data for data mining applications

Recommendations

Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasets

Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
Read More
Automatic subspace clustering of high dimensional data for data mining applications
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical ...
Read More
Automatic Subspace Clustering of High Dimensional Data

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 27, Issue 2

June 1998

595 pages

ISSN:0163-5808

DOI:10.1145/276305

Chairmen:
Laura Haas
IBM Almaden Research Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editor:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle

Issue’s Table of Contents

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Chairmen:
Laura Haas
IBM AlmadenResearch Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editors:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
,
Michael Franklin
Univ. of Maryland, College Park

Copyright © 1998 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1998

Published in SIGMOD Volume 27, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1,826
Total Citations
View Citations
8,520
Total Downloads

Downloads (Last 12 months)838
Downloads (Last 6 weeks)147

Other Metrics

View Author Metrics

Citations

Cited By

ELFarra BSalaha MAshour W(2024)Black Hole Clustering: Gravity-Based Approach with No Predetermined ParametersData Science Journal10.5334/dsj-2024-02723Online publication date: 7-May-2024
https://doi.org/10.5334/dsj-2024-027
Nikolaeva O(2024)Алгоритм сегментации многоспектральных изображенийMultispectral images segmentation algorithmМатематическое моделированиеMatematicheskoe modelirovanie10.20948/mm-2024-01-0236:1(25-40)Online publication date: 29-Jan-2024
https://doi.org/10.20948/mm-2024-01-02
Nikolaeva O(2024)Segmentation Algorithm of Multispectral ImagesMathematical Models and Computer Simulations10.1134/S207004822470002916:3(340-351)Online publication date: 25-Jun-2024
https://doi.org/10.1134/S2070048224700029
Miller JHuroyan VNavarrete RHossain MKobourov S(2024)ENS-t-SNE: Embedding Neighborhoods Simultaneously t-SNE2024 IEEE 17th Pacific Visualization Conference (PacificVis)10.1109/PacificVis60374.2024.00032(222-231)Online publication date: 23-Apr-2024
https://doi.org/10.1109/PacificVis60374.2024.00032
Datta JBanerjee SDunson D(2024)Nonparametric Bayes multiresolution testing for high-dimensional rare eventsJournal of Nonparametric Statistics10.1080/10485252.2024.2309978(1-15)Online publication date: Feb-2024
https://doi.org/10.1080/10485252.2024.2309978
Challa JGoyal NSharma ASreekumar NBalasubramaniam SGoyal P(2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-0Online publication date: 26-Jun-2024
https://doi.org/10.1007/s11390-024-2700-0
Motallebi HMalakoutifar N(2024)An efficient clustering algorithm based on searching popularity peaksPattern Analysis and Applications10.1007/s10044-024-01261-427:2Online publication date: 21-May-2024
https://doi.org/10.1007/s10044-024-01261-4
Xu ZLiu ZLi JZhang XZheng Y(2024)Quantitative Identification and Application of Rock Types in Complex Carbonate Reservoirs in the Middle EastProceedings of the International Field Exploration and Development Conference 202310.1007/978-981-97-0464-4_28(333-345)Online publication date: 19-Mar-2024
https://doi.org/10.1007/978-981-97-0464-4_28
Kishore Veparala VKalpana V(2023)Big Data y diferentes enfoques de clustering subespacial: De la promoción en redes sociales al mapeo genómicoSalud, Ciencia y Tecnología10.56294/saludcyt20234133(413)Online publication date: 19-Jun-2023
https://doi.org/10.56294/saludcyt2023413
Suchy DSiminski K(2023)GrDBSCANInternational Journal of Applied Mathematics and Computer Science10.34768/amcs-2023-002233:2(297-312)Online publication date: 23-Jun-2023
https://dl.acm.org/doi/10.34768/amcs-2023-0022
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents