Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/956750.956817acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Navigating massive data sets via local clustering

Published: 24 August 2003 Publication History

Abstract

This paper introduces a scalable method for feature extraction and navigation of large data sets by means of local clustering, where clusters are modeled as overlapping neighborhoods. Under the model, intra-cluster association and external differentiation are both assessed in terms of a natural confidence measure. Minor clusters can be identified even when they appear in the intersection of larger clusters. Scalability of local clustering derives from recent generic techniques for efficient approximate similarity search. The cluster overlap structure gives rise to a hierarchy that can be navigated and queried by users. Experimental results are provided for two large text databases.

References

[1]
R. Agrawal and R. Srikant, Fast algorithms for mining association rules, Proc. 20th VLDB Conference, Santiago, Chile, 1994, pp. 487--499.
[2]
J. C. Bezdek, Pattern Recognition With Fuzzy Objective Function Algorithms, Plenum Press, New York, USA, 1981.
[3]
E. Chávez, G. Navarro, R. Baeza-Yates and J. L. Marroquín, Searching in metric spaces, ACM Computing Surveys 33(3):273--321, 2001.
[4]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. American Society for Information Science 41(6):391--407, 1990.
[5]
L. Ertöz, M. Steinbach and V. Kumar, A new shared nearest neighbor clustering algorithm and its applications, Proc. Workshop on Clustering High Dimensional Data and its Applications, Arlington, VA, USA, 2002, pp. 105--115.
[6]
M. Ester, H.-P. Kriegel, J. Sander and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD), Portland, OR, USA, 1996, pp. 226--231.
[7]
U. Fayyad, C. Reina, and P. S. Bradley. Initialization of iterative refinement clustering algorithms, Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD), New York, USA, 1998, pp. 194--198.
[8]
H. Ferhatosmanoglu, E. Tuncel, D. Agrawal and A. El Abbadi, Approximate nearest neighbor searching in multimedia databases, Proc. 17th Int. Confon Data Engineering (ICDE), Heidelberg, Germany, 2001, pp. 503--514.
[9]
A. Gionis, P. Indyk and R. Motwani, Similarity search in high dimensions via hashing, Proc. 25th VLDB Conference, Edingburgh, 1999, pp. 518--529.
[10]
J. Grabmeier and A. Rudolph, Techniques of cluster algorithms in data mining, Data Mining and Knowledge Discovery 6:303--360, 2002.
[11]
S. Guha, R. Rastogi and K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25(5):345--366, 2000.
[12]
M. E. Houle, SASH: a spatial approximation sample hierarchy for similarity search, IBM Tokyo Research Laboratory Report RT-0517, 16 pages, March 5, 2003.
[13]
P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proc. 30th ACM Symp. on Theory of Computing, Dallas, 1998, pp. 604--613.
[14]
A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: a review, ACM Computing Surveys 31(3):264--323, 1999.
[15]
R. A. Jarvis and E. A. Patrick, Clustering using a similarity measure based on shared nearest neighbors, IEEE Transactions on Computers C-22(11):1025--1034, November 1973.
[16]
K.-I. Lin and R. Kondadadi, A similarity-based soft clustering algorithm for documents, Proc. 7th Int. Conf. on Database Systems for Advanced Applications (DASFAA), Hong Kong, China, 2001, pp. 40--47.
[17]
T. Mitchell, Machine Learning, McGraw-Hill, New York, USA, 1997.
[18]
R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, New York, USA, 1995.
[19]
G. Salton, The SMART Retrieval System --- Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, NJ, USA, 1971.
[20]
P. Zezula, P. Savino, G. Amato and F. Rabitti, Approximate similarity retrieval with M-trees, The VLDB Journal 7:275--293, 1998.

Cited By

View all
  • (2016)Clustering spatial data by the neighbors intersection and the density differenceProceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3006299.3006332(217-226)Online publication date: 6-Dec-2016
  • (2016)Recent Advances in High-Dimensional Clustering for Text DataClaudio Moraga: A Passion for Multi-Valued Logic and Soft Computing10.1007/978-3-319-48317-7_20(323-337)Online publication date: 21-Oct-2016
  • (2015)On speeding up the implementation of nearest neighbour search and classificationProceedings of the 16th International Conference on Computer Systems and Technologies10.1145/2812428.2812464(207-213)Online publication date: 25-Jun-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2003
736 pages
ISBN:1581137370
DOI:10.1145/956750
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. association
  2. confidence
  3. nearest neighbor
  4. soft clustering

Qualifiers

  • Article

Conference

KDD03
Sponsor:

Acceptance Rates

KDD '03 Paper Acceptance Rate 46 of 298 submissions, 15%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2016)Clustering spatial data by the neighbors intersection and the density differenceProceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3006299.3006332(217-226)Online publication date: 6-Dec-2016
  • (2016)Recent Advances in High-Dimensional Clustering for Text DataClaudio Moraga: A Passion for Multi-Valued Logic and Soft Computing10.1007/978-3-319-48317-7_20(323-337)Online publication date: 21-Oct-2016
  • (2015)On speeding up the implementation of nearest neighbour search and classificationProceedings of the 16th International Conference on Computer Systems and Technologies10.1145/2812428.2812464(207-213)Online publication date: 25-Jun-2015
  • (2015)Spectral clustering using robust similarity measure based on closeness of shared Nearest Neighbors2015 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2015.7280495(1-8)Online publication date: Jul-2015
  • (2015)Multi-objective optimization of shared nearest neighbor similarity for feature selectionApplied Soft Computing10.1016/j.asoc.2015.08.04237:C(751-762)Online publication date: 1-Dec-2015
  • (2011)Quality of similarity rankings in time seriesProceedings of the 12th international conference on Advances in spatial and temporal databases10.5555/2035253.2035285(422-440)Online publication date: 24-Aug-2011
  • (2011)Multi-source shared nearest neighbours for multi-modal image clusteringMultimedia Tools and Applications10.1007/s11042-010-0637-551:2(479-503)Online publication date: 1-Jan-2011
  • (2011)Quality of Similarity Rankings in Time SeriesAdvances in Spatial and Temporal Databases10.1007/978-3-642-22922-0_25(422-440)Online publication date: 2011
  • (2010)Can shared-neighbor distances defeat the curse of dimensionality?Proceedings of the 22nd international conference on Scientific and statistical database management10.5555/1876037.1876078(482-500)Online publication date: 30-Jun-2010
  • (2010)Co-location pattern mining for unevenly distributed data: algorithm, experiments and applicationsInternational Journal of Computational Science and Engineering10.1504/IJCSE.2010.0376745:3/4(185-196)Online publication date: 1-Dec-2010
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media