Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.3115/976909.979623dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free access

Document classification using a finite mixture model

Published: 07 July 1997 Publication History

Abstract

We propose a new method of classifying documents into categories. We define for each category a finite mixture model based on soft clustering of words. We treat the problem of classifying documents as that of conducting statistical hypothesis testing over finite mixture models, and employ the EM algorithm to efficiently estimate parameters in a finite mixture model. Experimental results indicate that our method outperforms existing methods.

References

[1]
Apte, Chidanand, Fred Damerau, and Sholom M. Weiss. 1994. Automated learning of decision rules for text categorization. ACM Tran. on Information Systems, 12(3):233--251.
[2]
Cohen, William W. and Yoram Singer. 1996. Context-sensitive learning methods for text categorization. Proc. of SIGIR'96.
[3]
Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journ. of the American Society for Information Science, 41(6):391--407.
[4]
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journ. of the Royal Statistical Society, Series B, 39(1):1--38.
[5]
Everitt, B. and D. Hand. 1981. Finite Mixture Distributions. London: Chapman and Hall.
[6]
Fuhr, Norbert. 1989. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55--72.
[7]
Gale, Williams. A. and Kenth W. Church. 1990. Poor estimates of context are worse than none. Proc. of the DARPA Speech and Natural Language Workshop, pages 283--287.
[8]
Guthrie, Louise, Elbert Walker, and Joe Guthrie. 1994. Document classification by machine: Theory and practice. Proc. of COLING'94, pages 1059--1063.
[9]
Helmbold, D., R. Schapire, Y. Siuger, and M. Warmuth. 1995. A comparison of new and old algorithm for a mixture estimation problem. Proc. of COLT'95, pages 61--68.
[10]
Jelinek, F. and R. I. Mercer. 1980. Interpolated estimation of markov source parameters from sparse data. Proc. of Workshop on Pattern Recognition in Practice, pages 381--402.
[11]
Lewis, David D. 1992. An evaluation of phrasal and clustered representations on a text categorization task. Proc. of SIGIR'92, pages 37--50.
[12]
Lewis, David D. and Marc Ringuette. 1994. A comparison of two learning algorithms for test categorization. Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 81--93.
[13]
Lewis, David D., Robert E. Schapire, James P. Callan, and Ron Papka. 1996. Training algorithms for linear text classifiers. Proc. of SIGIR'96.
[14]
Pereira, Fernando, Naftali Tishby, and Lillian Lee. 1993. Distributional clustering of english words. Proc. of ACL'93, pages 183--190.
[15]
Robertson, S. E. and K. Sparck Jones. 1976. Relevance weighting of search terms. Journ. of the American Society for Information Science, 27:129--146.
[16]
Salton, G. and M. J. McGill. 1983. Introduction to Modern Information Retrieval. New York: McGraw Hill.
[17]
Schutze, Hinrich, David A. Hull, and Jan O. Pedersen. 1995. A comparison of classifiers and document representations for the routing problem. Proc. of SIGIR'95.
[18]
Tanner, Martin A. and Wing Hung Wong. 1987. The calculation of posterior distributions by data augmentation. Journ. of the American Statistical Association, 82(398):528--540.
[19]
Wong, S. K. M. and Y. Y. Yao. 1989. A probability distribution model for information retrieval. Information Processing and Management, 25(1):39--53.
[20]
Yamanishi, Kenji. 1996. A randomized approximation of the mdl for stochastic models with hidden variables. Proc. of COLT'96, pages 99--109.
[21]
Yang, Yiming and Christoper G. Chute. 1994. An example-based mapping method for text categorization and retrieval. ACM Tran. on Information Systems, 12(3):252--277.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
ACL '98/EACL '98: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
July 1997
543 pages

Sponsors

  • Directorate General XIII (European Commission)
  • Universidad Complutense de Madrid
  • Universidad Autónoma de Madrid
  • Universidad Nacional de Educación a Distancia
  • Universidad Politécnica de Madrid

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 07 July 1997

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 85 of 443 submissions, 19%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)13
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2015)New feature weighting approaches for speech-act classificationPattern Recognition Letters10.1016/j.patrec.2014.08.01451:C(107-111)Online publication date: 1-Jan-2015
  • (2013)Contextual and active learning-based affect-sensing from virtual drama improvisationACM Transactions on Speech and Language Processing 10.1145/2407736.24077389:4(1-25)Online publication date: 30-Jan-2013
  • (2012)Time, topic and trawlProceedings of the Designing Interactive Systems Conference10.1145/2317956.2317993(234-243)Online publication date: 11-Jun-2012
  • (2007)Automatic classification of web pages into bookmark categoriesProceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1277741.1277881(731-732)Online publication date: 23-Jul-2007
  • (2003)Dominant meanings classification model for web informationDesign and application of hybrid intelligent systems10.5555/998038.998152(1044-1053)Online publication date: 1-Jan-2003
  • (2003)Classification of text documents based on minimum system entropyProceedings of the Twentieth International Conference on International Conference on Machine Learning10.5555/3041838.3041887(384-391)Online publication date: 21-Aug-2003
  • (2000)Automatic text categorization by unsupervised learningProceedings of the 18th conference on Computational linguistics - Volume 110.3115/990820.990886(453-459)Online publication date: 31-Jul-2000
  • (2000)Topic analysis using a finite mixture modelProceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 1310.3115/1117794.1117799(35-44)Online publication date: 7-Oct-2000
  • (2000)An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messagesProceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval10.1145/345508.345569(160-167)Online publication date: 1-Jul-2000
  • (2000)Text Classification from Labeled and Unlabeled Documents using EMMachine Language10.1023/A:100769271308539:2-3(103-134)Online publication date: 1-May-2000
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media