research-article

Document classification based on web search hit counts

Authors:

Masaya Kaneko,

Shusuke Okamoto,

Masaki Kohana,

You InayoshiAuthors Info & Claims

IIWAS '12: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services

Pages 223 - 228

https://doi.org/10.1145/2428736.2428772

Published: 03 December 2012 Publication History

Get Access

Abstract

This paper describes a web mining method to classify research documents automatically. Web hit counts of AND-search on two words are used to form a document vector. Target documents are classified with a result of k-means clustering method, in which cosine similarity is used to calculate a distance.

References

[1]

D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using web search engines. In Proceedings of the International Conference on World Wide Web 2007, pages 757--766, May 2007.

Digital Library

Google Scholar

[2]

D. Bollegala, Y. Matsuo, and M. Ishizuka. A web search engine-based approach to measure semantic similarity between words. IEEE Transaction on Knowledge and Data Engineering, 23(7):977--990, July 2011.

Digital Library

Google Scholar

[3]

R. L. Cilibrasi and P. M. B. Vitanyi. The google similarity distance. IEEE Transaction Knowledge and Data Engineering, 19(3):370--383, March 2007.

Digital Library

Google Scholar

[4]

A. Gledson and J. Keane. Using web-search result to mesure word-group similarity. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 281--288, August 2008.

Digital Library

Google Scholar

[5]

K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11--21, 1972.

Crossref

Google Scholar

[6]

K. S. Jones. A statistical interpretation of term specificity and its application in retrieval (reprinted version). Journal of Documentation, 60(5):493--502, 2044.

Crossref

Google Scholar

[7]

J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297. University of California Press, 1967.

Google Scholar

[8]

G. K. Michael Steinbach and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, pages 1--20, 2000.

Google Scholar

[9]

J. Rennie. The 20 Newsgroups data set. http://qwone.com/~jason/20Newsgroups/, 2008. {Online; accessed 17-Oct-2012}.

Google Scholar

[10]

M. Roche and Y. Kodratoff. Text and web mining approaches in order to build specialized ontorogies. Journal of Digital Information, 10(4), June 2009.

Google Scholar

[11]

M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the International Conference on World Wide Web 2006, pages 377--786, May 2006.

Digital Library

Google Scholar

[12]

O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceeding of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 46--54. ACM, August 1998.

Digital Library

Google Scholar

Cited By

View all

Kohana MOkamoto SKaneko M(2013)A Clustering Algorithm Using Twitter User BiographyProceedings of the 2013 16th International Conference on Network-Based Information Systems10.1109/NBiS.2013.70(432-435)Online publication date: 4-Sep-2013
https://dl.acm.org/doi/10.1109/NBiS.2013.70
Yadav SSain D(2013)An efficient technique for finding semantic similarity and their frequency between words2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE)10.1109/ICGCE.2013.6823420(159-163)Online publication date: Dec-2013
https://doi.org/10.1109/ICGCE.2013.6823420

Index Terms

Document classification based on web search hit counts
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Document clustering based on web search hit counts

This paper describes a web mining method for clustering research documents automatically. Web hit counts of AND-search for two words are used to form a document feature vector. Target documents are clustered using the k-means clustering method twice, in ...
Web Document Classification Based on Rough Set
RSFDGrC '07: Proceedings of the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing

For traditional way of Web document representation in Vector Space Model, zero-valued similarity problem between vectors occurs frequently, which decreases classificatory quality when defining the relation between Web documents. In this paper, a novel ...
A semantic weighting method for document classification based on Markov logic networks
RACS '14: Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems

This paper proposes a semantic weighting method to classify textural documents. Human lives in the world where web documents have a great potential and the amount of valuable information has been consistently growing over the year. There is a problem ...

Comments

Information & Contributors

Information

Published In

IIWAS '12: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services

December 2012

432 pages

ISBN:9781450313063

DOI:10.1145/2428736

General Chair:
Eric Pardede
La Trobe University, Australia

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

IIWAS '12

Sponsor:

@WAS

IIWAS '12: The 14th International Conference on Information Integration and Web-based Applications & Services

December 3 - 5, 2012

Bali, Indonesia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
106
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kohana MOkamoto SKaneko M(2013)A Clustering Algorithm Using Twitter User BiographyProceedings of the 2013 16th International Conference on Network-Based Information Systems10.1109/NBiS.2013.70(432-435)Online publication date: 4-Sep-2013
https://dl.acm.org/doi/10.1109/NBiS.2013.70
Yadav SSain D(2013)An efficient technique for finding semantic similarity and their frequency between words2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE)10.1109/ICGCE.2013.6823420(159-163)Online publication date: Dec-2013
https://doi.org/10.1109/ICGCE.2013.6823420

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Document clustering based on web search hit counts

Web Document Classification Based on Rough Set

A semantic weighting method for document classification based on Markov logic networks

Comments

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Document clustering based on web search hit counts

Web Document Classification Based on Rough Set

A semantic weighting method for document classification based on Markov logic networks

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations