Article

Navigating massive data sets via local clustering

Author:

Michael E. HouleAuthors Info & Claims

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 547 - 552

https://doi.org/10.1145/956750.956817

Published: 24 August 2003 Publication History

Abstract

This paper introduces a scalable method for feature extraction and navigation of large data sets by means of local clustering, where clusters are modeled as overlapping neighborhoods. Under the model, intra-cluster association and external differentiation are both assessed in terms of a natural confidence measure. Minor clusters can be identified even when they appear in the intersection of larger clusters. Scalability of local clustering derives from recent generic techniques for efficient approximate similarity search. The cluster overlap structure gives rise to a hierarchy that can be navigated and queried by users. Experimental results are provided for two large text databases.

References

[1]

R. Agrawal and R. Srikant, Fast algorithms for mining association rules, Proc. 20th VLDB Conference, Santiago, Chile, 1994, pp. 487--499.

Digital Library

[2]

J. C. Bezdek, Pattern Recognition With Fuzzy Objective Function Algorithms, Plenum Press, New York, USA, 1981.

Digital Library

[3]

E. Chávez, G. Navarro, R. Baeza-Yates and J. L. Marroquín, Searching in metric spaces, ACM Computing Surveys 33(3):273--321, 2001.

Digital Library

[4]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. American Society for Information Science 41(6):391--407, 1990.

[5]

L. Ertöz, M. Steinbach and V. Kumar, A new shared nearest neighbor clustering algorithm and its applications, Proc. Workshop on Clustering High Dimensional Data and its Applications, Arlington, VA, USA, 2002, pp. 105--115.

[6]

M. Ester, H.-P. Kriegel, J. Sander and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD), Portland, OR, USA, 1996, pp. 226--231.

[7]

U. Fayyad, C. Reina, and P. S. Bradley. Initialization of iterative refinement clustering algorithms, Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD), New York, USA, 1998, pp. 194--198.

[8]

H. Ferhatosmanoglu, E. Tuncel, D. Agrawal and A. El Abbadi, Approximate nearest neighbor searching in multimedia databases, Proc. 17th Int. Confon Data Engineering (ICDE), Heidelberg, Germany, 2001, pp. 503--514.

Digital Library

[9]

A. Gionis, P. Indyk and R. Motwani, Similarity search in high dimensions via hashing, Proc. 25th VLDB Conference, Edingburgh, 1999, pp. 518--529.

Digital Library

[10]

J. Grabmeier and A. Rudolph, Techniques of cluster algorithms in data mining, Data Mining and Knowledge Discovery 6:303--360, 2002.

Digital Library

[11]

S. Guha, R. Rastogi and K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25(5):345--366, 2000.

Digital Library

[12]

M. E. Houle, SASH: a spatial approximation sample hierarchy for similarity search, IBM Tokyo Research Laboratory Report RT-0517, 16 pages, March 5, 2003.

[13]

P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proc. 30th ACM Symp. on Theory of Computing, Dallas, 1998, pp. 604--613.

Digital Library

[14]

A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: a review, ACM Computing Surveys 31(3):264--323, 1999.

Digital Library

[15]

R. A. Jarvis and E. A. Patrick, Clustering using a similarity measure based on shared nearest neighbors, IEEE Transactions on Computers C-22(11):1025--1034, November 1973.

Digital Library

[16]

K.-I. Lin and R. Kondadadi, A similarity-based soft clustering algorithm for documents, Proc. 7th Int. Conf. on Database Systems for Advanced Applications (DASFAA), Hong Kong, China, 2001, pp. 40--47.

Digital Library

[17]

T. Mitchell, Machine Learning, McGraw-Hill, New York, USA, 1997.

Digital Library

[18]

R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, New York, USA, 1995.

Digital Library

[19]

G. Salton, The SMART Retrieval System --- Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, NJ, USA, 1971.

Digital Library

[20]

P. Zezula, P. Savino, G. Amato and F. Rabitti, Approximate similarity retrieval with M-trees, The VLDB Journal 7:275--293, 1998.

Digital Library

Cited By

Yan ZLuo WBu CNi LAnjum AZhao X(2016)Clustering spatial data by the neighbors intersection and the density differenceProceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3006299.3006332(217-226)Online publication date: 6-Dec-2016
https://dl.acm.org/doi/10.1145/3006299.3006332
Zamora J(2016)Recent Advances in High-Dimensional Clustering for Text DataClaudio Moraga: A Passion for Multi-Valued Logic and Soft Computing10.1007/978-3-319-48317-7_20(323-337)Online publication date: 21-Oct-2016
https://doi.org/10.1007/978-3-319-48317-7_20
Marinchev IAgre G(2015)On speeding up the implementation of nearest neighbour search and classificationProceedings of the 16th International Conference on Computer Systems and Technologies10.1145/2812428.2812464(207-213)Online publication date: 25-Jun-2015
https://dl.acm.org/doi/10.1145/2812428.2812464
Show More Cited By

Index Terms

Navigating massive data sets via local clustering
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Is there any need for rough clustering?

The paper investigates if there is really a need for rough clustering in data mining.We show that rough clustering provides more detailed results than hard approaches.Rough clustering minimizes the number of incorrectly clustered objects.Trade-offs are ...
An Efficient Hierarchical Clustering Algorithm via Root Searching
CSE '14: Proceedings of the 2014 IEEE 17th International Conference on Computational Science and Engineering

As an important branch of machine learning, clustering is wildly used for data analysis in various domains. Hierarchical clustering algorithm, one of the traditional clustering algorithms, has excellent stability yet relatively poor time complexity. In ...
Soft clustering criterion functions for partitional document clustering: a summary of results
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

Recently published studies have shown that partitional clustering algorithms that optimize certain criterion functions, which measure key aspects of inter- and intra-cluster similarity, are very effective in producing hard clustering solutions for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

August 2003

736 pages

ISBN:1581137370

DOI:10.1145/956750

Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD03

Sponsor:

KDD03: The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2003

Washington, D.C.

Acceptance Rates

KDD '03 Paper Acceptance Rate 46 of 298 submissions, 15%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
886
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yan ZLuo WBu CNi LAnjum AZhao X(2016)Clustering spatial data by the neighbors intersection and the density differenceProceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3006299.3006332(217-226)Online publication date: 6-Dec-2016
https://dl.acm.org/doi/10.1145/3006299.3006332
Zamora J(2016)Recent Advances in High-Dimensional Clustering for Text DataClaudio Moraga: A Passion for Multi-Valued Logic and Soft Computing10.1007/978-3-319-48317-7_20(323-337)Online publication date: 21-Oct-2016
https://doi.org/10.1007/978-3-319-48317-7_20
Marinchev IAgre G(2015)On speeding up the implementation of nearest neighbour search and classificationProceedings of the 16th International Conference on Computer Systems and Technologies10.1145/2812428.2812464(207-213)Online publication date: 25-Jun-2015
https://dl.acm.org/doi/10.1145/2812428.2812464
Xiucai Ye Sakurai T(2015)Spectral clustering using robust similarity measure based on closeness of shared Nearest Neighbors2015 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2015.7280495(1-8)Online publication date: Jul-2015
https://doi.org/10.1109/IJCNN.2015.7280495
Kundu PMitra S(2015)Multi-objective optimization of shared nearest neighbor similarity for feature selectionApplied Soft Computing10.1016/j.asoc.2015.08.04237:C(751-762)Online publication date: 1-Dec-2015
https://dl.acm.org/doi/10.1016/j.asoc.2015.08.042
Bernecker THoule MKriegel HKröger PRenz MSchubert EZimek A(2011)Quality of similarity rankings in time seriesProceedings of the 12th international conference on Advances in spatial and temporal databases10.5555/2035253.2035285(422-440)Online publication date: 24-Aug-2011
https://dl.acm.org/doi/10.5555/2035253.2035285
Hamzaoui AJoly ABoujemaa N(2011)Multi-source shared nearest neighbours for multi-modal image clusteringMultimedia Tools and Applications10.1007/s11042-010-0637-551:2(479-503)Online publication date: 1-Jan-2011
https://dl.acm.org/doi/10.1007/s11042-010-0637-5
Bernecker THoule MKriegel HKröger PRenz MSchubert EZimek A(2011)Quality of Similarity Rankings in Time SeriesAdvances in Spatial and Temporal Databases10.1007/978-3-642-22922-0_25(422-440)Online publication date: 2011
https://doi.org/10.1007/978-3-642-22922-0_25
Houle MKriegel HKröger PSchubert EZimek A(2010)Can shared-neighbor distances defeat the curse of dimensionality?Proceedings of the 22nd international conference on Scientific and statistical database management10.5555/1876037.1876078(482-500)Online publication date: 30-Jun-2010
https://dl.acm.org/doi/10.5555/1876037.1876078
Morimoto Y(2010)Co-location pattern mining for unevenly distributed data: algorithm, experiments and applicationsInternational Journal of Computational Science and Engineering10.1504/IJCSE.2010.0376745:3/4(185-196)Online publication date: 1-Dec-2010
https://dl.acm.org/doi/10.1504/IJCSE.2010.037674
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents