Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2428736.2428772acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Document classification based on web search hit counts

Published: 03 December 2012 Publication History

Abstract

This paper describes a web mining method to classify research documents automatically. Web hit counts of AND-search on two words are used to form a document vector. Target documents are classified with a result of k-means clustering method, in which cosine similarity is used to calculate a distance.

References

[1]
D. Bollegala, Y. Matsuo, and M. Ishizuka. Measuring semantic similarity between words using web search engines. In Proceedings of the International Conference on World Wide Web 2007, pages 757--766, May 2007.
[2]
D. Bollegala, Y. Matsuo, and M. Ishizuka. A web search engine-based approach to measure semantic similarity between words. IEEE Transaction on Knowledge and Data Engineering, 23(7):977--990, July 2011.
[3]
R. L. Cilibrasi and P. M. B. Vitanyi. The google similarity distance. IEEE Transaction Knowledge and Data Engineering, 19(3):370--383, March 2007.
[4]
A. Gledson and J. Keane. Using web-search result to mesure word-group similarity. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 281--288, August 2008.
[5]
K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11--21, 1972.
[6]
K. S. Jones. A statistical interpretation of term specificity and its application in retrieval (reprinted version). Journal of Documentation, 60(5):493--502, 2044.
[7]
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297. University of California Press, 1967.
[8]
G. K. Michael Steinbach and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, pages 1--20, 2000.
[9]
J. Rennie. The 20 Newsgroups data set. http://qwone.com/~jason/20Newsgroups/, 2008. {Online; accessed 17-Oct-2012}.
[10]
M. Roche and Y. Kodratoff. Text and web mining approaches in order to build specialized ontorogies. Journal of Digital Information, 10(4), June 2009.
[11]
M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the International Conference on World Wide Web 2006, pages 377--786, May 2006.
[12]
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceeding of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 46--54. ACM, August 1998.

Cited By

View all
  • (2013)A Clustering Algorithm Using Twitter User BiographyProceedings of the 2013 16th International Conference on Network-Based Information Systems10.1109/NBiS.2013.70(432-435)Online publication date: 4-Sep-2013
  • (2013)An efficient technique for finding semantic similarity and their frequency between words2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE)10.1109/ICGCE.2013.6823420(159-163)Online publication date: Dec-2013

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IIWAS '12: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
December 2012
432 pages
ISBN:9781450313063
DOI:10.1145/2428736
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • @WAS: International Organization of Information Integration and Web-based Applications and Services

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AND-search
  2. clustering
  3. document classification
  4. web hit counts
  5. web mining

Qualifiers

  • Research-article

Conference

IIWAS '12
Sponsor:
  • @WAS

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2013)A Clustering Algorithm Using Twitter User BiographyProceedings of the 2013 16th International Conference on Network-Based Information Systems10.1109/NBiS.2013.70(432-435)Online publication date: 4-Sep-2013
  • (2013)An efficient technique for finding semantic similarity and their frequency between words2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE)10.1109/ICGCE.2013.6823420(159-163)Online publication date: Dec-2013

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media