Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Automatic subspace clustering of high dimensional data for data mining applications

Published: 01 June 1998 Publication History
  • Get Citation Alerts
  • Abstract

    Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

    References

    [1]
    R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages 307-328. AAAI/MIT Press, 1996.]]
    [2]
    A. Aho, 2. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms. Addison-Welsley, 1974.]]
    [3]
    P. Arabic and L. J. Hubert. An overview of combinatorial data analyis, in P. Arabic, L. Hubert, and G. D. Sorts, editors, Clustering and Classification, pages 5-63. World Scientific Pub., New Jersey, 1996.]]
    [4]
    Arbor Software Corporation. Application Manager User's Guide, Essbase Version 4.0 edition.]]
    [5]
    R. Bayardo. Efficiently mining long patterns from databases. of Data, Seattle, Washington, 1998.]]
    [6]
    S. Berchtold, C. Bohm, D. Keim, and H.-P. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings o} the 16th Symposium on Principles of Database Systems (PODS), pages 78-86, 1997.]]
    [7]
    M. Berger and I. Regoutsos. An algorithm for point clustering and grid generation. IEEE Transactions on Systems, Man and Cybernetics, 21(5):1278-86, 1991.]]
    [8]
    S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the A CM SIGMOD Conference on Management of Da~a, May 1997.]]
    [9]
    P. Cheeseman and J. Stuff. Bayesian classification (autoclass): Theory and results. In U. M. Fayyad, O. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discove~'y and Data Mining, chapter 6, pages 153-180. AAAI/MIT Press, 1996.]]
    [10]
    R. Chhikara and D. Register. A numerical classification method for partitioning of a large multidimensional mixed data set. Technometrics, 21:531-537, 1979.]]
    [11]
    R. O. Duds and P. E. Hart. Pattern Classification and Scene Analysis. john Wiley and Sons, 1973.]]
    [12]
    R. J. Earle. Method and apparatus for storing and retrieving multi-dimensional data in computer memory. U.S. Patent No. 5359724, October 1994.]]
    [13]
    M. Ester, H.-P. Kriegel, 2. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise, in Proc. of the ~nd Int'l Cort/erenee on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August 1996.]]
    [14]
    M. Ester, H.-P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In Proc. of the I st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995.]]
    [15]
    U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.]]
    [16]
    U. Feige. A threshold of In n for approximating set cover. In Proceedings of the Twenty-Eighth Annual A CM Symposium on Theory of Computing, pages 314-318, 1996.]]
    [17]
    D. Franzblau. Performance guarantees on a sweep-line heuristic for covering rectilinear polygons with rectangles. SIAM Y. Disc. Math, 2:307-321, 3 (1989).]]
    [18]
    J. Friedman. Optimizing a noisy function of many variables with application to data mining. In UW/MSR Summer Research Institute in Data Mining, July 1997.]]
    [19]
    K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.]]
    [20]
    D. Gunopulos, R. Khardon, H. Mannila, and S. Saluja. Data mining, hypergraph transversals, and machine learning. In P~oc. of the 16th ACId Syrup. on Principles of Database Systems, pages 209-216, 1997.]]
    [21]
    C.-T. Ho, Ft. Agrawal, N. Megiddo, and R. Srikant. Range queries in OLAP data cubes. In Proc. o~ ~h,e A UM SIGMOD Conference on Management o~ Data, Tucson, Arizona, May 1997.]]
    [22]
    S. J. Hong. MINI: A heuristic algorithm for two-level logic minimization. In R. Newton, editor, Selected Papers on Logic Synthesis/or Integrated Circuit Design. IEEE Press, 1987.]]
    [23]
    Internationl Business Machines. IBM Intelligent Miner User's Guide, Version 1 Release 1, SH12-6213-00 edition, July 1996.]]
    [24]
    A.K. Jain and R. C. Dubes. Algorithms lot Clustering Data. Prentice Hall, 1988.]]
    [25]
    L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.]]
    [26]
    D.-I. Lin and Z. M. Kedem. Pincer search: A new algorithm for discovering the maximum frequent sets. In Pr0c. o} the 6th Int'I Conference on Eztending Database Technology (EDBT), Valencia, Spain, 1998.]]
    [27]
    L. Lovasz. On the ratio of the optimal integral and fractional covers. Discrete Mathematics, 13:383-390, 1975.]]
    [28]
    C. Lund nnd M. Yannakakis. On the hardness of approximating minimization problems. In Proceedings o} the A CM Symposium on Theory o} Computing, pages 286-293, 1993.]]
    [29]
    W. Masek. Some NP-comptete set covering problems. M.S. Thesis, MIT, 1978.]]
    [30]
    M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. o} the Filth Int'l Conference or, Extending Database Technology (EDBT), Avignon, France, March 1996.]]
    [31]
    R. S. Michalski and It. E. Stepp. Learning from observation: Conceptual clustering. In It. S. Michalski, 3. G. Carbonell, nnd T. M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, volume I, pages 331-363. Morgan Kaufmann, 1983.]]
    [32]
    R. Miller and Y. Yang. Association rules over interval data. In Proc. A CM SIGMOD international Con}. on Management of Data, pages 452-461, 1997'.]]
    [33]
    R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proc. o} the VLDB Conference, Santiago, Chile, September 1994.]]
    [34]
    It. A. Reckhow and J. Culberson. Covering simple orthogohal polygon with a minimum number of orthogonally convex polygons. In Proc. o} the ACId 3rd Annual Computational Geometry Conference, pages 268-277, 1987.]]
    [35]
    J. Rissanen. Stochastic Complexly in Statistical Inquiry. World Scientific Publ. Co., 1989.]]
    [36]
    P. Schroeter and J. Bigun. Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement. Pattern Recognition, 25(5):695-709, May 1995.]]
    [37]
    J. Sharer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 2$nd Int'l Conference on Very Large Databases, Bombay, India, September 1996.]]
    [38]
    A. Shoshani. Personal communication. 1997.]]
    [39]
    P. Sneath and R. Sokal. Numerical Tazonomy. Freeman, 1973.]]
    [40]
    Ft. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. In Proc. of the A C'M SIGMOD Conference on Management o} Data, Montreal, Canada, june 1996.]]
    [41]
    H. Toivonen. Sampling large databases for association rules. In Proc. of the $$nd Int't Conference on Very Large Databases, pages 134-145, Mumbai (Bombay), India, September 1996.]]
    [42]
    S. Wharton. A generalized histogram clustering for multidimensional image data. Pattern Recognition, 16(2):193-199, 1983.]]
    [43]
    M. Zait and H. Messatfa. A comparative study of clustering methods. Future Generation Computer Systems, 13(2- 3):149-159, November 1997.]]
    [44]
    D. Zhang and A. Bowyer. CSG set-theoretic solid modelling and NC machining of blend surfaces. In Proceedings o} the Second Annual A CM Symposium on Computational Geometry, pages 314-318, 1986.]]
    [45]
    T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clusterinK method for very large datnbales. In Proc. of the A CM SIGMOD ConJerence on Management o} Da~a, Montreal, Canada, June 1996.]]

    Cited By

    View all
    • (2024)Black Hole Clustering: Gravity-Based Approach with No Predetermined ParametersData Science Journal10.5334/dsj-2024-02723Online publication date: 7-May-2024
    • (2024)Алгоритм сегментации многоспектральных изображенийMultispectral images segmentation algorithmМатематическое моделированиеMatematicheskoe modelirovanie10.20948/mm-2024-01-0236:1(25-40)Online publication date: 29-Jan-2024
    • (2024)Segmentation Algorithm of Multispectral ImagesMathematical Models and Computer Simulations10.1134/S207004822470002916:3(340-351)Online publication date: 25-Jun-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 27, Issue 2
    June 1998
    595 pages
    ISSN:0163-5808
    DOI:10.1145/276305
    Issue’s Table of Contents
    • cover image ACM Conferences
      SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
      June 1998
      599 pages
      ISBN:0897919955
      DOI:10.1145/276304
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 1998
    Published in SIGMOD Volume 27, Issue 2

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)838
    • Downloads (Last 6 weeks)147

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Black Hole Clustering: Gravity-Based Approach with No Predetermined ParametersData Science Journal10.5334/dsj-2024-02723Online publication date: 7-May-2024
    • (2024)Алгоритм сегментации многоспектральных изображенийMultispectral images segmentation algorithmМатематическое моделированиеMatematicheskoe modelirovanie10.20948/mm-2024-01-0236:1(25-40)Online publication date: 29-Jan-2024
    • (2024)Segmentation Algorithm of Multispectral ImagesMathematical Models and Computer Simulations10.1134/S207004822470002916:3(340-351)Online publication date: 25-Jun-2024
    • (2024)ENS-t-SNE: Embedding Neighborhoods Simultaneously t-SNE2024 IEEE 17th Pacific Visualization Conference (PacificVis)10.1109/PacificVis60374.2024.00032(222-231)Online publication date: 23-Apr-2024
    • (2024)Nonparametric Bayes multiresolution testing for high-dimensional rare eventsJournal of Nonparametric Statistics10.1080/10485252.2024.2309978(1-15)Online publication date: Feb-2024
    • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-0Online publication date: 26-Jun-2024
    • (2024)An efficient clustering algorithm based on searching popularity peaksPattern Analysis and Applications10.1007/s10044-024-01261-427:2Online publication date: 21-May-2024
    • (2024)Quantitative Identification and Application of Rock Types in Complex Carbonate Reservoirs in the Middle EastProceedings of the International Field Exploration and Development Conference 202310.1007/978-981-97-0464-4_28(333-345)Online publication date: 19-Mar-2024
    • (2023)Big Data y diferentes enfoques de clustering subespacial: De la promoción en redes sociales al mapeo genómicoSalud, Ciencia y Tecnología10.56294/saludcyt20234133(413)Online publication date: 19-Jun-2023
    • (2023)GrDBSCANInternational Journal of Applied Mathematics and Computer Science10.34768/amcs-2023-002233:2(297-312)Online publication date: 23-Jun-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media