Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/584792.584877acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Evaluation of hierarchical clustering algorithms for document datasets

Published: 04 November 2002 Publication History

Abstract

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.

References

[1]
C. C. Aggarwal, S. C. Gates, and P. S. Yu. On the merits of building categorization systems by supervised clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 352--356, 1999.]]
[2]
D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 1998.]]
[3]
D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review), 11:365--391, 1999.]]
[4]
D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for web document categorization. Decision Support Systems (accepted for publication), 1999.]]
[5]
P. Cheeseman and J. Stutz. Baysian classification (autoclass): Theory and results. In U. Fayyad, G. Piatetsky-Shapiro, P. Smith, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153--180. AAAI/MIT Press, 1996.]]
[6]
D. Cutting, J. Pedersen, D. Karger, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR, pages pages 318--329, Copenhagen, 1992.]]
[7]
I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Technical Report Research Report RJ 10147, IBM Almadan Research Center, 1999.]]
[8]
C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.]]
[9]
R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, 2001.]]
[10]
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. of 1998 ACM-SIGMOD Int. Conf. on Management of Data, 1998.]]
[11]
S. Guha, R. Rastogi, and K. Shim. ROCK: a robust clustering algorithm for categorical attributes. In Proc. of the 15th Int'l Conf. on Data Eng., 1999.]]
[12]
E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. WebACE: A web agent for document categorization and exploartion. In Proc. of the 2nd International Conference on Autonomous Agents, May 1998.]]
[13]
E. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph based clustering in high dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1), 1998.]]
[14]
A. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.]]
[15]
G. Karypis and E. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis, 2000. Available on the WWW at URL http://www.cs.umn.edu/~karypis.]]
[16]
G. Karypis, E. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68--75, 1999.]]
[17]
B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.]]
[18]
B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 16--22, 1999.]]
[19]
D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis, 1999.]]
[20]
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Statist, Prob., pages 281--297, 1967.]]
[21]
J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering. In 7th Workshop on Information Technologies and Systems, Dec. 1997.]]
[22]
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In Proc. of the 20th VLDB Conference, pages 144--155, Santiago, Chile, 1994.]]
[23]
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.]]
[24]
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.]]
[25]
P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.]]
[26]
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.]]
[27]
A. Strehl and J. Ghosh. Scalable approach to balanced, high-dimensional clustering of market-baskets. In Proceedings of HiPC, 2000.]]
[28]
S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.]]
[29]
TREC. Text REtrieval conference. http://trec.nist.gov, 1999.]]
[30]
Yahoo! Yahoo! http://www.yahoo.com.]]
[31]
K. Zahn. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68--86, 1971.]]
[32]
Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01--40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. Available on the WWW at http://cs.umn.edu/~karypis/publications.]]

Cited By

View all
  • (2024)Optimally Ordered Orthogonal Neighbor Joining Trees for Hierarchical Cluster AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.328449930:8(5034-5046)Online publication date: 1-Aug-2024
  • (2023)A Comprehensive Survey of Machine Learning Methodologies with Emphasis in Water Resources ManagementApplied Sciences10.3390/app13221214713:22(12147)Online publication date: 8-Nov-2023
  • (2023)TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge GraphsProceedings of the ACM on Management of Data10.1145/36173411:3(1-27)Online publication date: 13-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
November 2002
704 pages
ISBN:1581134924
DOI:10.1145/584792
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2002

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. agglomerative clustering
  2. hierarchical clustering
  3. partitional clustering

Qualifiers

  • Article

Conference

CIKM02

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)179
  • Downloads (Last 6 weeks)9
Reflects downloads up to 17 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimally Ordered Orthogonal Neighbor Joining Trees for Hierarchical Cluster AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.328449930:8(5034-5046)Online publication date: 1-Aug-2024
  • (2023)A Comprehensive Survey of Machine Learning Methodologies with Emphasis in Water Resources ManagementApplied Sciences10.3390/app13221214713:22(12147)Online publication date: 8-Nov-2023
  • (2023)TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge GraphsProceedings of the ACM on Management of Data10.1145/36173411:3(1-27)Online publication date: 13-Nov-2023
  • (2023)Avoiding Overconfidence in Predictions of Residential Energy Demand Through Identification of the Persistence Forecast EffectIEEE Transactions on Smart Grid10.1109/TSG.2022.319832614:1(228-238)Online publication date: Jan-2023
  • (2023)Emotion Intensity and its Control for Emotional Voice ConversionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.317557814:1(31-48)Online publication date: 1-Jan-2023
  • (2023)Time series clustering using trend, seasonal and autoregressive components to identify maximum temperature patterns in the Iberian PeninsulaEnvironmental and Ecological Statistics10.1007/s10651-023-00572-930:3(421-442)Online publication date: 15-Jul-2023
  • (2023)Novel Blocking Techniques and Distance Metrics for Record LinkageInformation Integration and Web Intelligence10.1007/978-3-031-48316-5_40(431-446)Online publication date: 22-Nov-2023
  • (2022)Malaysia PM10 Air Quality Time Series Clustering Based on Dynamic Time WarpingAtmosphere10.3390/atmos1304050313:4(503)Online publication date: 22-Mar-2022
  • (2022)Scalable Semi-Supervised Clustering for Face Recognition with Insufficient Labelled SamplesPattern Recognition and Image Analysis10.1134/S105466182202005532:2(373-383)Online publication date: 6-Jul-2022
  • (2022)Combating Deforestation Using Different AGNES Approaches2022 14th International Conference on Communications (COMM)10.1109/COMM54429.2022.9817217(1-5)Online publication date: 16-Jun-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media