Article

Evaluation of hierarchical clustering algorithms for document datasets

Authors:

George KarypisAuthors Info & Claims

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

Pages 515 - 524

https://doi.org/10.1145/584792.584877

Published: 04 November 2002 Publication History

Abstract

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.

References

[1]

C. C. Aggarwal, S. C. Gates, and P. S. Yu. On the merits of building categorization systems by supervised clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 352--356, 1999.]]

Digital Library

[2]

D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 1998.]]

Digital Library

[3]

D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review), 11:365--391, 1999.]]

Digital Library

[4]

D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for web document categorization. Decision Support Systems (accepted for publication), 1999.]]

Digital Library

[5]

P. Cheeseman and J. Stutz. Baysian classification (autoclass): Theory and results. In U. Fayyad, G. Piatetsky-Shapiro, P. Smith, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153--180. AAAI/MIT Press, 1996.]]

Digital Library

[6]

D. Cutting, J. Pedersen, D. Karger, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR, pages pages 318--329, Copenhagen, 1992.]]

Digital Library

[7]

I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Technical Report Research Report RJ 10147, IBM Almadan Research Center, 1999.]]

[8]

C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.]]

[9]

R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, 2001.]]

Digital Library

[10]

S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. of 1998 ACM-SIGMOD Int. Conf. on Management of Data, 1998.]]

Digital Library

[11]

S. Guha, R. Rastogi, and K. Shim. ROCK: a robust clustering algorithm for categorical attributes. In Proc. of the 15th Int'l Conf. on Data Eng., 1999.]]

Digital Library

[12]

E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. WebACE: A web agent for document categorization and exploartion. In Proc. of the 2nd International Conference on Autonomous Agents, May 1998.]]

Digital Library

[13]

E. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph based clustering in high dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1), 1998.]]

[14]

A. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.]]

Digital Library

[15]

G. Karypis and E. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis, 2000. Available on the WWW at URL http://www.cs.umn.edu/~karypis.]]

[16]

G. Karypis, E. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68--75, 1999.]]

Digital Library

[17]

B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.]]

[18]

B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 16--22, 1999.]]

Digital Library

[19]

D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis, 1999.]]

[20]

J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Statist, Prob., pages 281--297, 1967.]]

[21]

J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering. In 7th Workshop on Information Technologies and Systems, Dec. 1997.]]

[22]

R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In Proc. of the 20th VLDB Conference, pages 144--155, Santiago, Chile, 1994.]]

Digital Library

[23]

M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.]]

[24]

G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.]]

Digital Library

[25]

P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.]]

[26]

M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.]]

[27]

A. Strehl and J. Ghosh. Scalable approach to balanced, high-dimensional clustering of market-baskets. In Proceedings of HiPC, 2000.]]

Digital Library

[28]

S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.]]

Digital Library

[29]

TREC. Text REtrieval conference. http://trec.nist.gov, 1999.]]

[30]

Yahoo! Yahoo! http://www.yahoo.com.]]

[31]

K. Zahn. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68--86, 1971.]]

Digital Library

[32]

Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01--40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. Available on the WWW at http://cs.umn.edu/~karypis/publications.]]

Cited By

Ge TLuo XWang YSedlmair MCheng ZZhao YLiu XDeussen OChen B(2024)Optimally Ordered Orthogonal Neighbor Joining Trees for Hierarchical Cluster AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.328449930:8(5034-5046)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TVCG.2023.3284499
Drogkoula MKokkinos KSamaras N(2023)A Comprehensive Survey of Machine Learning Methodologies with Emphasis in Water Resources ManagementApplied Sciences10.3390/app13221214713:22(12147)Online publication date: 8-Nov-2023
https://doi.org/10.3390/app132212147
Dhulipala LŁącki JLee JMirrokni V(2023)TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge GraphsProceedings of the ACM on Management of Data10.1145/36173411:3(1-27)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617341
Show More Cited By

Index Terms

Evaluation of hierarchical clustering algorithms for document datasets
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Hierarchical Clustering Algorithms for Document Datasets

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, clustering ...
Neighboring-Aware Hierarchical Clustering: A New Algorithm and Extensive Evaluation

In this work, a simple yet robust neighboring-aware hierarchical-based clustering approach (NHC) is developed. NHC employs its dynamic technique to take into account the surroundings of each point when clustering, making it extremely competitive. NHC ...
Eliminating Error Accumulation in Hierarchical Clustering Algorithms
EIDWT '13: Proceedings of the 2013 Fourth International Conference on Emerging Intelligent Data and Web Technologies

Hierarchical agglomerative clustering treats given data as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data. However, if two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

November 2002

704 pages

ISBN:1581134924

DOI:10.1145/584792

General Chair:
Charles Nicholas
University of Maryland Baltimore County
,
Program Chairs:
David Grossman
Illinois Institute of Technology
,
Konstantinos Kalpakis
University of Maryland Baltimore County
,
Sajda Qureshi
Erasmus University, Rotterdam
,
Han van Dissel
Erasmus University, Rotterdam
,
Len Seligman
The MITRE Corporation

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CIKM02

Sponsor:

CIKM02: Eleventh ACM International Conference on Information and Knowledge Management

November 4 - 9, 2002

Virginia, McLean, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

251
Total Citations
View Citations
4,911
Total Downloads

Downloads (Last 12 months)179
Downloads (Last 6 weeks)9

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ge TLuo XWang YSedlmair MCheng ZZhao YLiu XDeussen OChen B(2024)Optimally Ordered Orthogonal Neighbor Joining Trees for Hierarchical Cluster AnalysisIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.328449930:8(5034-5046)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TVCG.2023.3284499
Drogkoula MKokkinos KSamaras N(2023)A Comprehensive Survey of Machine Learning Methodologies with Emphasis in Water Resources ManagementApplied Sciences10.3390/app13221214713:22(12147)Online publication date: 8-Nov-2023
https://doi.org/10.3390/app132212147
Dhulipala LŁącki JLee JMirrokni V(2023)TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge GraphsProceedings of the ACM on Management of Data10.1145/36173411:3(1-27)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617341
Akyol HPreist CSchien D(2023)Avoiding Overconfidence in Predictions of Residential Energy Demand Through Identification of the Persistence Forecast EffectIEEE Transactions on Smart Grid10.1109/TSG.2022.319832614:1(228-238)Online publication date: Jan-2023
https://doi.org/10.1109/TSG.2022.3198326
Zhou KSisman BRana RSchuller BLi H(2023)Emotion Intensity and its Control for Emotional Voice ConversionIEEE Transactions on Affective Computing10.1109/TAFFC.2022.317557814:1(31-48)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TAFFC.2022.3175578
Palacios Gutiérrez AValencia Delfa JVilleta López M(2023)Time series clustering using trend, seasonal and autoregressive components to identify maximum temperature patterns in the Iberian PeninsulaEnvironmental and Ecological Statistics10.1007/s10651-023-00572-930:3(421-442)Online publication date: 15-Jul-2023
https://doi.org/10.1007/s10651-023-00572-9
Deo NBasak JSoliman AWeinberg DSteorts RRajasekaran S(2023)Novel Blocking Techniques and Distance Metrics for Record LinkageInformation Integration and Web Intelligence10.1007/978-3-031-48316-5_40(431-446)Online publication date: 22-Nov-2023
https://doi.org/10.1007/978-3-031-48316-5_40
Suris FBakar MAriff NMohd Nadzir MIbrahim K(2022)Malaysia PM10 Air Quality Time Series Clustering Based on Dynamic Time WarpingAtmosphere10.3390/atmos1304050313:4(503)Online publication date: 22-Mar-2022
https://doi.org/10.3390/atmos13040503
Nish Chandran SDurgaprasad Gangodkar (2022)Scalable Semi-Supervised Clustering for Face Recognition with Insufficient Labelled SamplesPattern Recognition and Image Analysis10.1134/S105466182202005532:2(373-383)Online publication date: 6-Jul-2022
https://doi.org/10.1134/S1054661822020055
Andrei AGrigore O(2022)Combating Deforestation Using Different AGNES Approaches2022 14th International Conference on Communications (COMM)10.1109/COMM54429.2022.9817217(1-5)Online publication date: 16-Jun-2022
https://dl.acm.org/doi/10.1109/COMM54429.2022.9817217
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents