Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Document clustering method using dimension reduction and support vector clustering to overcome sparseness

Published: 01 June 2014 Publication History

Abstract

This study proposes new method to overcome sparsity problem of document clustering.We build combined method using dimension reduction, K-means clustering, and SVC.In particular, we attempt to overcome the sparseness in patent document clustering.First, we conduct experiment using news data from UCI machine learning repository.Second, using retrieved patent documents, we carry out patent clustering. Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.

References

[1]
Bibliographic coupling, common abstract stems and clustering: A comparison of two document-document similarity approaches in the context of science mapping. Scientometrics. v76 i2. 273-290.
[2]
Clustering of document collection - A weighting approach. Expert Systems with Applications. v36. 7904-7916.
[3]
Case studies of technology roadmapping in mining. Journal of Engineering and Technology Management. v28. 23-32.
[4]
Andrews, N. O., & Fox, E. A. (2007). Recent developments in document clustering. Technical Report TR-07-35, Computer Science, Virginia Tech.
[5]
Technology management simply defined: A tweet plus two characters. Journal of Engineering and Technology Management. v26. 219-224.
[6]
Support vector clustering. Journal of Machine Learning Research. v2. 125-137.
[7]
Mapping inventive activity and technological change through patent analysis: A case study of India and China. Scientometrics. v61 i3. 361-381.
[8]
Chen, B., Tai, P. C., Harrison, R., & Pan, Y. (2005). Novel hybrid hierarchical-K-means clustering method (H-K-means) for microarray analysis. In Proceedings of IEEE computational systems bioinformatics conference workshops (pp. 1-4).
[9]
Learning from data - concepts, theory, and methods. John Wiley & Sons.
[10]
An SAO-based text mining approach to building a technology tree for technology planning. Expert Systems with Applications. v39. 11443-11455.
[11]
A new document representation using term frequency and vectorized graph connectionists with application to document retrieval. Expert Systems with Applications. v36. 12023-12035.
[12]
The use of patent titles for identifying the topics of invention and forecasting trends. Scientometrics. v26 i2. 231-242.
[13]
Ding, C., & He, X. (2004). K-means clustering via principal component analysis. In Proceedings of the 21st international conference on machine learning (pp. 1-9).
[14]
Computer-aided clustering of citation networks as a tool of mapping of research trends in biomedicine. Scientometrics. v32 i3. 247-258.
[15]
Cluster analysis. 4th ed. Apnold.
[16]
Text mining infrastructure in R. Journal of Statistical Software. v25 i5. 1-52.
[17]
Technology roadmapping for technology-based product-service integration: A case study. Journal of Engineering and Technology Management. v28. 128-146.
[18]
Research in emerging fields: Who takes the lead?. ISSI Newsletter. v7 i4. 85-95.
[19]
Technological trends in the area of fullerenes using bibliometric analysis of patents. Scientometrics. v44 i1. 17-31.
[20]
Data mining concepts and techniques. Morgan Kaufmann.
[21]
Hierarchically SVM classification based on support vector clustering method and its application to document categorization. Expert Systems with Applications. v33. 627-635.
[22]
The elements of statistical learning, data mining, inference, and prediction. Springer.
[23]
Using the self organizing map for clustering of text documents. Expert Systems with Applications. v36. 9584-9591.
[24]
Applied multivariate statistical analysis. Prentice Hall.
[25]
Technology forecasting using matrix map and patent clustering. Industrial Management and Data Systems. v112 i5. 786-807.
[26]
Patent and statistics, What's the connection?. Communications of the Korean Statistical Society. v17 i2. 205-222.
[27]
Support vector machines in R. Journal of Statistical Software. v15 i9. 1-28.
[28]
Kees, J., Marchiori, E., & Vaart, A. V. D. (2003). Finding clusters using support vector classifier. In Proceedings of the 18th ESANN-European symposium on artificial neural networks (pp. 23-25).
[29]
An improved cluster labeling method for support vector clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. v27 i3. 461-464.
[30]
Technology clustering based on evolutionary patterns: The case of information and communications technologies. Technology Forecasting and Social Change. v78. 953-967.
[31]
An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications. v36. 3208-3215.
[32]
A clustering study of a 7000 EU document inventory using MDS and SOM. Expert Systems with Applications. v38. 8835-8849.
[33]
Relationship matrix nonnegative decomposition for clustering. Mathematical Problems in Engineering. 1-15.
[34]
Puma-Villanueva, W. J., Bezerra, G. B., Lima, C. A., & Zuben, F. J. V. (2005). Improving support vector clustering with ensembles. In Proceedings of the IEEE international joint conference on neural networks (pp. 13-15).
[35]
R Development Core Team (2010). R: A language and environment for statistical computing, R foundation for statistical computing, Vienna, Austria. ISBN 3-900051-07-0, URL www.r-project.org.
[36]
Forecasting of science & technology expenditure of India by simulation method. Scientometrics. v17 i3-4. 227-251.
[37]
Forecasting and management of technology. Wiley.
[38]
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. v20. 53-65.
[39]
A fuzzy clustering approach for finding similar documents using a novel similarity measure. Expert Systems with Applications. v33. 600-605.
[40]
A cooperative multi-agent platform for invention based on patent document analysis and ontology. Expert Systems with Applications. v31. 766-775.
[41]
An empirical examination of the science-technology relationship in the biotechnology industry. Journal of Engineering and Technology Management. v27. 160-171.
[42]
The United States Patent and Trademark Office (USPTO) (2011). www.uspto.gov.
[43]
Development of a patent document classification and search platform using a back-propagation network. Expert Systems with Application. v31. 755-765.
[44]
Generic title labeling for clustered documents. Expert Systems with Applications. v37. 2247-2254.
[45]
Text mining techniques for patent analysis. Information Processing and Management. v43. 1216-1247.
[46]
Turenne, N. (2010). svcR: An R package for support vector clustering improved with geometric hashing applied to lexical pattern discovery, CRAN R-Package.
[47]
University of California - Irvine (2011). UCI Machine Learning Repository, <http://archive.ics.uci.edu/ml/datasets.html>.
[48]
Statistical learning theory. Wiley.
[49]
Automatically determining the number of clusters in unlabeled data sets. IEEE Transactions on Knowledge and Data Engineering. v21 i3. 335-350.
[50]
A new fuzzy clustering algorithm based on clonal selection for land cover classification. Mathematical Problems in Engineering. 1-21.

Cited By

View all
  • (2023)A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe pointsMultimedia Tools and Applications10.1007/s11042-023-14716-382:21(32177-32201)Online publication date: 2-Mar-2023
  • (2022)COVID-19 Deep Clustering: An Ontology construction clustering method with dynamic medical labelingProceedings of the 11th International Symposium on Information and Communication Technology10.1145/3568562.3568564(216-222)Online publication date: 1-Dec-2022
  • (2022)Webpage retrieval based on query by example for think tank constructionInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10276759:1Online publication date: 1-Jan-2022
  • Show More Cited By

Recommendations

Reviews

Fazli Can

In this paper, the authors aim to address three problems associated with document clustering: determining the number of clusters, structuring the collection description matrix into a form suitable for statistical analysis, and overcoming the collection description matrix sparseness problem. For determining the number of clusters, they employ support vector clustering (SVC) and a measure called Silhouette. To overcome sparseness and make data more suitable for statistical analysis, they combine singular value decomposition (SVD) and principal component analysis (PCA). The authors perform experiments using two document collections: a set of 159 news articles, and 98 patent documents. In the first set of experiments, the goal is to show the efficacy of their approach. In patent data tests, their aim is measuring the success of their method in predicting research and development trends. The results of the experiments are inconclusive. In both cases, the experimental collections are too small. In the trend analysis, the authors hypothesize and show that, in a research field with a small number of patents, it is expected that there would be a greater number of patents in later years. The authors provide only one observation to support their claim. This paper would have been better if they had provided several observations with more data covering a wider time window. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal
Expert Systems with Applications: An International Journal  Volume 41, Issue 7
June, 2014
443 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 June 2014

Author Tags

  1. Dimension reduction
  2. Document clustering
  3. K-means clustering based on support vector clustering
  4. Patent clustering
  5. Silhouette measure
  6. Sparseness problem

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A differential evolution based algorithm to cluster text corpora using lazy re-evaluation of fringe pointsMultimedia Tools and Applications10.1007/s11042-023-14716-382:21(32177-32201)Online publication date: 2-Mar-2023
  • (2022)COVID-19 Deep Clustering: An Ontology construction clustering method with dynamic medical labelingProceedings of the 11th International Symposium on Information and Communication Technology10.1145/3568562.3568564(216-222)Online publication date: 1-Dec-2022
  • (2022)Webpage retrieval based on query by example for think tank constructionInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10276759:1Online publication date: 1-Jan-2022
  • (2022)Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applicationsCluster Computing10.1007/s10586-018-2023-422:2(3213-3230)Online publication date: 11-Mar-2022
  • (2021)A parallel text clustering method using Spark and hashingComputing10.1007/s00607-021-00932-y103:9(2007-2031)Online publication date: 1-Sep-2021
  • (2020)Patent document clustering with deep embeddingsScientometrics10.1007/s11192-020-03396-7123:2(563-577)Online publication date: 1-May-2020
  • (2020)A novel parallel object-tracking behavior algorithm based on dynamics for data clusteringSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-04058-424:3(2265-2285)Online publication date: 1-Feb-2020
  • (2019)Hybrid Unsupervised Exploratory PlotsComplexity10.1155/2019/62710172019Online publication date: 1-Jan-2019
  • (2019)Ensemble Method for Multi-view Text ClusteringComputational Collective Intelligence10.1007/978-3-030-28377-3_18(219-231)Online publication date: 4-Sep-2019
  • (2019)Clustering Algorithms for Query Expansion Based Information RetrievalComputational Collective Intelligence10.1007/978-3-030-28374-2_23(261-272)Online publication date: 4-Sep-2019
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media