research-article

Document clustering method using dimension reduction and support vector clustering to overcome sparseness

Authors:

Sang-Sung Park,

Dong-Sik JangAuthors Info & Claims

Expert Systems with Applications: An International Journal, Volume 41, Issue 7

Pages 3204 - 3212

https://doi.org/10.1016/j.eswa.2013.11.018

Published: 01 June 2014 Publication History

Abstract

This study proposes new method to overcome sparsity problem of document clustering.We build combined method using dimension reduction, K-means clustering, and SVC.In particular, we attempt to overcome the sparseness in patent document clustering.First, we conduct experiment using news data from UCI machine learning repository.Second, using retrieved patent documents, we carry out patent clustering. Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.

References

[1]

Bibliographic coupling, common abstract stems and clustering: A comparison of two document-document similarity approaches in the context of science mapping. Scientometrics. v76 i2. 273-290.

[2]

Clustering of document collection - A weighting approach. Expert Systems with Applications. v36. 7904-7916.

Digital Library

[3]

Case studies of technology roadmapping in mining. Journal of Engineering and Technology Management. v28. 23-32.

Digital Library

[4]

Andrews, N. O., & Fox, E. A. (2007). Recent developments in document clustering. Technical Report TR-07-35, Computer Science, Virginia Tech.

[5]

Technology management simply defined: A tweet plus two characters. Journal of Engineering and Technology Management. v26. 219-224.

Digital Library

[6]

Support vector clustering. Journal of Machine Learning Research. v2. 125-137.

[7]

Mapping inventive activity and technological change through patent analysis: A case study of India and China. Scientometrics. v61 i3. 361-381.

[8]

Chen, B., Tai, P. C., Harrison, R., & Pan, Y. (2005). Novel hybrid hierarchical-K-means clustering method (H-K-means) for microarray analysis. In Proceedings of IEEE computational systems bioinformatics conference workshops (pp. 1-4).

Digital Library

[9]

Learning from data - concepts, theory, and methods. John Wiley & Sons.

Digital Library

[10]

An SAO-based text mining approach to building a technology tree for technology planning. Expert Systems with Applications. v39. 11443-11455.

Digital Library

[11]

A new document representation using term frequency and vectorized graph connectionists with application to document retrieval. Expert Systems with Applications. v36. 12023-12035.

[12]

The use of patent titles for identifying the topics of invention and forecasting trends. Scientometrics. v26 i2. 231-242.

[13]

Ding, C., & He, X. (2004). K-means clustering via principal component analysis. In Proceedings of the 21st international conference on machine learning (pp. 1-9).

Digital Library

[14]

Computer-aided clustering of citation networks as a tool of mapping of research trends in biomedicine. Scientometrics. v32 i3. 247-258.

[15]

Cluster analysis. 4th ed. Apnold.

[16]

Text mining infrastructure in R. Journal of Statistical Software. v25 i5. 1-52.

[17]

Technology roadmapping for technology-based product-service integration: A case study. Journal of Engineering and Technology Management. v28. 128-146.

Digital Library

[18]

Research in emerging fields: Who takes the lead?. ISSI Newsletter. v7 i4. 85-95.

[19]

Technological trends in the area of fullerenes using bibliometric analysis of patents. Scientometrics. v44 i1. 17-31.

[20]

Data mining concepts and techniques. Morgan Kaufmann.

Digital Library

[21]

Hierarchically SVM classification based on support vector clustering method and its application to document categorization. Expert Systems with Applications. v33. 627-635.

Digital Library

[22]

The elements of statistical learning, data mining, inference, and prediction. Springer.

[23]

Using the self organizing map for clustering of text documents. Expert Systems with Applications. v36. 9584-9591.

Digital Library

[24]

Applied multivariate statistical analysis. Prentice Hall.

[25]

Technology forecasting using matrix map and patent clustering. Industrial Management and Data Systems. v112 i5. 786-807.

[26]

Patent and statistics, What's the connection?. Communications of the Korean Statistical Society. v17 i2. 205-222.

[27]

Support vector machines in R. Journal of Statistical Software. v15 i9. 1-28.

[28]

Kees, J., Marchiori, E., & Vaart, A. V. D. (2003). Finding clusters using support vector classifier. In Proceedings of the 18th ESANN-European symposium on artificial neural networks (pp. 23-25).

[29]

An improved cluster labeling method for support vector clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. v27 i3. 461-464.

Digital Library

[30]

Technology clustering based on evolutionary patterns: The case of information and communications technologies. Technology Forecasting and Social Change. v78. 953-967.

[31]

An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications. v36. 3208-3215.

Digital Library

[32]

A clustering study of a 7000 EU document inventory using MDS and SOM. Expert Systems with Applications. v38. 8835-8849.

Digital Library

[33]

Relationship matrix nonnegative decomposition for clustering. Mathematical Problems in Engineering. 1-15.

[34]

Puma-Villanueva, W. J., Bezerra, G. B., Lima, C. A., & Zuben, F. J. V. (2005). Improving support vector clustering with ensembles. In Proceedings of the IEEE international joint conference on neural networks (pp. 13-15).

[35]

R Development Core Team (2010). R: A language and environment for statistical computing, R foundation for statistical computing, Vienna, Austria. ISBN 3-900051-07-0, URL www.r-project.org.

[36]

Forecasting of science & technology expenditure of India by simulation method. Scientometrics. v17 i3-4. 227-251.

[37]

Forecasting and management of technology. Wiley.

[38]

Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. v20. 53-65.

Digital Library

[39]

A fuzzy clustering approach for finding similar documents using a novel similarity measure. Expert Systems with Applications. v33. 600-605.

Digital Library

[40]

A cooperative multi-agent platform for invention based on patent document analysis and ontology. Expert Systems with Applications. v31. 766-775.

[41]

An empirical examination of the science-technology relationship in the biotechnology industry. Journal of Engineering and Technology Management. v27. 160-171.

Digital Library

[42]

The United States Patent and Trademark Office (USPTO) (2011). www.uspto.gov.

[43]

Development of a patent document classification and search platform using a back-propagation network. Expert Systems with Application. v31. 755-765.

[44]

Generic title labeling for clustered documents. Expert Systems with Applications. v37. 2247-2254.

Digital Library

[45]

Text mining techniques for patent analysis. Information Processing and Management. v43. 1216-1247.

Digital Library

[46]

Turenne, N. (2010). svcR: An R package for support vector clustering improved with geometric hashing applied to lexical pattern discovery, CRAN R-Package.

[47]

University of California - Irvine (2011). UCI Machine Learning Repository, <http://archive.ics.uci.edu/ml/datasets.html>.

[48]

Statistical learning theory. Wiley.

[49]

Automatically determining the number of clusters in unlabeled data sets. IEEE Transactions on Knowledge and Data Engineering. v21 i3. 335-350.

Digital Library

[50]

A new fuzzy clustering algorithm based on clonal selection for land cover classification. Mathematical Problems in Engineering. 1-21.

Cited By

Mustafi DMustafi A(2023)A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe pointsMultimedia Tools and Applications10.1007/s11042-023-14716-382:21(32177-32201)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14716-3
Phan CChiang J(2022)COVID-19 Deep Clustering: An Ontology construction clustering method with dynamic medical labelingProceedings of the 11th International Symposium on Information and Communication Technology10.1145/3568562.3568564(216-222)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1145/3568562.3568564
Geng QChuai ZJin J(2022)Webpage retrieval based on query by example for think tank constructionInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10276759:1Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1016/j.ipm.2021.102767
Show More Cited By

Index Terms

Document clustering method using dimension reduction and support vector clustering to overcome sparseness
1. Applied computing
  1. Document management and text processing
2. Information systems

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Text document clustering based on neighbors

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Document Clustering Using K-Means, Heuristic K-Means and Fuzzy C-Means
CICN '11: Proceedings of the 2011 International Conference on Computational Intelligence and Communication Networks

Document clustering refers to unsupervised classification (categorization) of documents into groups (clusters) in such a way that the documents in a cluster are similar, whereas documents in different clusters are dissimilar. The documents may be web ...

Reviews

Reviewer: Fazli Can

In this paper, the authors aim to address three problems associated with document clustering: determining the number of clusters, structuring the collection description matrix into a form suitable for statistical analysis, and overcoming the collection description matrix sparseness problem. For determining the number of clusters, they employ support vector clustering (SVC) and a measure called Silhouette. To overcome sparseness and make data more suitable for statistical analysis, they combine singular value decomposition (SVD) and principal component analysis (PCA). The authors perform experiments using two document collections: a set of 159 news articles, and 98 patent documents. In the first set of experiments, the goal is to show the efficacy of their approach. In patent data tests, their aim is measuring the success of their method in predicting research and development trends. The results of the experiments are inconclusive. In both cases, the experimental collections are too small. In the trend analysis, the authors hypothesize and show that, in a research field with a small number of patents, it is expected that there would be a greater number of patents in later years. The authors provide only one observation to support their claim. This paper would have been better if they had provided several observations with more data covering a wider time window. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal

Expert Systems with Applications: An International Journal Volume 41, Issue 7

June, 2014

443 pages

ISSN:0957-4174

Issue’s Table of Contents

Copyright © Elsevier Ltd.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 June 2014

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mustafi DMustafi A(2023)A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe pointsMultimedia Tools and Applications10.1007/s11042-023-14716-382:21(32177-32201)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14716-3
Phan CChiang J(2022)COVID-19 Deep Clustering: An Ontology construction clustering method with dynamic medical labelingProceedings of the 11th International Symposium on Information and Communication Technology10.1145/3568562.3568564(216-222)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1145/3568562.3568564
Geng QChuai ZJin J(2022)Webpage retrieval based on query by example for think tank constructionInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10276759:1Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1016/j.ipm.2021.102767
Sandhiya RSundarambal M(2022)Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applicationsCluster Computing10.1007/s10586-018-2023-422:2(3213-3230)Online publication date: 11-Mar-2022
https://dl.acm.org/doi/10.1007/s10586-018-2023-4
Ben HajKacem MBen N’cir CEssoussi N(2021)A parallel text clustering method using Spark and hashingComputing10.1007/s00607-021-00932-y103:9(2007-2031)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1007/s00607-021-00932-y
Kim JYoon JPark EChoi S(2020)Patent document clustering with deep embeddingsScientometrics10.1007/s11192-020-03396-7123:2(563-577)Online publication date: 1-May-2020
https://dl.acm.org/doi/10.1007/s11192-020-03396-7
Feng XLai ZYu H(2020)A novel parallel object-tracking behavior algorithm based on dynamics for data clusteringSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-04058-424:3(2265-2285)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s00500-019-04058-4
Herrero ÁJiménez ABayraktar SGarcia-Rodriguez J(2019)Hybrid Unsupervised Exploratory PlotsComplexity10.1155/2019/62710172019Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1155/2019/6271017
Fraj MBen Hajkacem MEssoussi N(2019)Ensemble Method for Multi-view Text ClusteringComputational Collective Intelligence10.1007/978-3-030-28377-3_18(219-231)Online publication date: 4-Sep-2019
https://dl.acm.org/doi/10.1007/978-3-030-28377-3_18
Khennak IDrias HKechid AMoulai H(2019)Clustering Algorithms for Query Expansion Based Information RetrievalComputational Collective Intelligence10.1007/978-3-030-28374-2_23(261-272)Online publication date: 4-Sep-2019
https://dl.acm.org/doi/10.1007/978-3-030-28374-2_23
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents