Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/319950.319965acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article
Free access

Classification algorithms for NETNEWS articles

Published: 01 November 1999 Publication History

Abstract

We propose several algorithms using the vector space model to classify the news articles posted on the NETNEWS according to the newsgroup categories. The baseline method combines the terms of all the articles of each newsgroup in the training set to represent the newsgroups as single vectors. After training, the incoming news articles are classified based on their similarity to the existing newsgroup categories. We propose to use the following techniques to improve the classification performance of the baseline method: (1) use routing (classification) accuracy and the similarity values to refine the training set; (2) update the underlying term structures periodically during testing; and (3) apply k-means clustering to partition the newsgroup articles and represent each newsgroup by k vectors. Our test collection consists of the real news articles and the 519 subnewsgroups under the REC newsgroup of NETNEWS in a period of 3 months. Our experimental results demonstrate that the technique of refining the training set reduces from one-third to two-thirds of the storage. The technique of periodical updates improves the routing accuracy ranging from 20% to 100% but incurs runtime overhead. Finally, representing each newsgroup by k vectors (with k = 2 or 3) using clustering yields the most significant improvement in routing accuracy, ranging from 60% to 100%, while causing only slightly higher storage requirements.

References

[1]
E.W. Brown, J.P. Callan, and W.B. Croft. "Fast Incremental Indexing for Full-Text Information Retrieval'', Proceedings of the 20th International Conference on Very Large Databases (VLDB), pp. 192- 202, 1994.
[2]
S. T. Dumais, J. Platt, D. Heckerman and M. Sahami (1998). "Inductive learning algorithms and representations for text categorization", Proceedings of A CM-CIKM98, pp. 148-155, 1998
[3]
W.B. Frakes, and R. Baeza-Yates, Information Retrieval.' Data Structures ~4 Algorithms, Prentice Hall, 1992.
[4]
W. Francis and H. Kucera. Frequency Analysis of English Usage, New York: Houghton Mifflin, 1982.
[5]
H. Garcia-Molina, W. Labio, J. Yang. "Expiring Data in a Waxehouse", Proceedings of 2~th International Conference on Very Large Databases (VLDB), pp. 500-511, 1998.
[6]
J. Gonzalo, F. Verdejo, I. Chugur, J Cigarran. "Indexing with WordNet Synsets can Improve Text Retrieval", Coling-ACL'98 Workshop: Usage of Word- Net in Natural Language Processing Systems, pp. 38- 44, August 1998.
[7]
A.D. Gordon. Classification, Chapman and Hall, 1981.
[8]
W. Hsu, and S. Lang, "NETNEWS Classification via Batch Routing and Updates", Proceedings of international Conference of Information Resources Management Association, May 1999.
[9]
P. Jacobs. "Using Statistical Methods to Improve Knowledge-Based News Categorization", IEEE EX- PERT, pp. 13-23, April 1993.
[10]
T. Joachims. "Text Categorization with Support Vector Machine: Learning with Many Relevant features", Proceedings of European Conference on Machine Learning (ECML), pp. 137-142, 1998.
[11]
W. Lain, C.Y. Ho. "Using A Generalized Instance Set for Automatic Text Categorization", Proceedings of SIGIR, pp. 81-89~ 1998.
[12]
K. Lang. "Newsweeder: Learning to filter netnews", Proceedings of International Conference on Machine Learning, pp. 331-339, July 1995.
[13]
D.D. Lewis, R.E. Schapire, J.P. Callan, and R. Papka. "Training algorithms for linear text classio tiers", Proceedings of SIGIR, pp. 298-306, 1996.
[14]
P.C. Mahalanobis, "On the Generalized Distance in Statistics", Proceedings of the National Institute of Science of India, 12, pp. 49-55, 1936.
[15]
H. Mase. "Experiments on Automatic Web Page Categorization for IR system", technical report, Stanford University, 1998.
[16]
K. Pollari-Malmi, E. Soisalon-Soininen~ and T. Y16nen. "Concurrency Control in B-Trees with Batch Updates:', IEEE Transaction on Knowledge and Data Engineering, 8:6, December 1996.
[17]
M.F. Porter. "An Algorithm for Suffix Stripping", Program, 14(3), pp. 130-137, 1980.
[18]
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. "A Bayesian approach to filtering junk email", Proceedings of AAAi'98 Workshop on Learning for Text Categorization, 1998, Madison, Wisconsin.
[19]
G. Salton. The SMART Retrieval System - Experiments in Automatic Document Processing, Prentice Hall, 1971.
[20]
G. Salton and C. Buckley "Term-Weighting Approaches in Automatic Text Retrieval", Information Processing f_4 Management, Vol 24. No.5 pp.513-523, 1988.
[21]
S. Scott, and S. Matwin. "Text Classification Using WordNet Hypernyms', Coting-ACL'98 Workshop: Usage of WordNet in Natural Language Processing Systems, pp. 45-51, August 1998.
[22]
A. Tomasic, H. Garcia-Molina, and K. Shoens, "Incremental Updates of Inverted Lists for Text Document Retrieval", Proceedings of A CM Special Interest Group on Management of Data (SIGMOD), pp. 289- 300, 1994.
[23]
S. A. Weiss, S. Kasif, and E. Brill. "Text Classification in USENET Newsgroups: A Progress Report", Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, pp. 125-127, 1996
[24]
P. Willett. "Recent Trends in Hierarchic Document Clustering: A Critical Review", Information Process ~4 Management, pp. 577-597, 1988
[25]
Y. Yang. "Expert network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval" Proceedings of $IGIR, pp. 13-22, 1994.
[26]
Y. Yang. "Noise Reduction in a Statistical Approach to Text Categorization", Proceedings of SIGIR, pp. 256-263, 1995.
[27]
Y. Yang. "Using Corpus Statistics to Remove Redundant Words in Text Categorization", JASIS, pp. 13-22, 1996.

Cited By

View all
  • (2021)An intelligent model based on integrated inverse document frequency and multinomial Naive Bayes for current affairs news categorisationInternational Journal of System Assurance Engineering and Management10.1007/s13198-021-01471-713:3(1341-1355)Online publication date: 7-Nov-2021
  • (2011)Automated Profiling of the Balance of Optimism and Pessimism in Online News ContentProceedings of the 2011 IEEE Fifth International Conference on Semantic Computing10.1109/ICSC.2011.85(1-6)Online publication date: 18-Sep-2011
  • (2010)An Adaptive Ontology Based Hierarchical Browsing System for CiteSeerxProceedings of the 2010 Second International Conference on Knowledge and Systems Engineering10.1109/KSE.2010.32(203-208)Online publication date: 7-Oct-2010
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '99: Proceedings of the eighth international conference on Information and knowledge management
November 1999
564 pages
ISBN:1581131461
DOI:10.1145/319950
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 1999

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

CIKM99
Sponsor:
CIKM99: Conference on Information and Knowledge Management
November 2 - 6, 1999
Missouri, Kansas City, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)5
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2021)An intelligent model based on integrated inverse document frequency and multinomial Naive Bayes for current affairs news categorisationInternational Journal of System Assurance Engineering and Management10.1007/s13198-021-01471-713:3(1341-1355)Online publication date: 7-Nov-2021
  • (2011)Automated Profiling of the Balance of Optimism and Pessimism in Online News ContentProceedings of the 2011 IEEE Fifth International Conference on Semantic Computing10.1109/ICSC.2011.85(1-6)Online publication date: 18-Sep-2011
  • (2010)An Adaptive Ontology Based Hierarchical Browsing System for CiteSeerxProceedings of the 2010 Second International Conference on Knowledge and Systems Engineering10.1109/KSE.2010.32(203-208)Online publication date: 7-Oct-2010
  • (2010)Adaptation of RSS feeds based on the user profile and on the end deviceJournal of Network and Computer Applications10.1016/j.jnca.2010.02.00433:4(410-421)Online publication date: 1-Jul-2010
  • (2008)Parsimonious concept modelingProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1390334.1390519(815-816)Online publication date: 20-Jul-2008
  • (2008)Mitigating media biasProceedings of the hypertext 2008 workshop on Collaboration and collective intelligence10.1145/1379157.1379169(47-51)Online publication date: 19-Jun-2008
  • (2008)PeRSSonal's core functionality evaluationData & Knowledge Engineering10.1016/j.datak.2007.07.00764:1(330-345)Online publication date: 1-Jan-2008
  • (2007)Ontology-Based User Profiles for Personalized SearchOntologies10.1007/978-0-387-37022-4_24(665-694)Online publication date: 2007
  • (2003)Ontology-based personalized search and browsingWeb Intelligence and Agent Systems10.5555/1016416.10164211:3-4(219-234)Online publication date: 1-Dec-2003
  • (2003)Life cycle modeling of news events using aging theoryProceedings of the 14th European Conference on Machine Learning10.1007/978-3-540-39857-8_7(47-59)Online publication date: 22-Sep-2003
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media