Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking

Published: 01 January 2002 Publication History

Abstract

In certain classification problems there is a strong a asymmetry between the number of labeled examples available for each of the classes involved. In an extreme case, there may be a complete lack of labeled data for one of the classes while, at the same time, there are adequate labeled examples for the others, accompanied by a large body of unlabeled data. Since most classification algorithms require some information about all classes involved, label estimation for the un-represented class is desired. An important representative of this group of problems is that of user interest/preference modeling where there may be a large number of examples of what the user likes with essentially no counterexamples.
Recently, there has been much interest in applying the EM algorithm to incomplete data problems in the area of text retrieval and categorization. We adapt this approach to the asymmetric case of modeling user interests in news articles, where only labeled positive training data are available, with access to a large corpus of unlabeled documents. User modeling is here equivalent to that of user-specific document ranking. EM is used in conjunction with the Naive Bayes model while its output is also utilized by a Support Vector Machine and Rocchio's technique.
Our findings demonstrate that the EM algorithm can be quite effective in modeling the negative class under a number of different initialization schemes. Although primarily just the negative training examples are needed, a natural question is whether using all of the estimated labels (i.e., positive and negative) would be more (or less) beneficial. This is important considering that, in this context, the initialization of the negative class for EM is likely not to be very accurate. Experimental results suggest that EM output should be limited to negative label estimates only.

References

[1]
Billsus D and Pazzani M (1999) A hybrid user model for news story classification. In: Seventh International Conference on User Modeling (UM '99). http://www.ics.uci.edu/?pazzani/Publications/um99.ps.
[2]
Blum A and Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the 1998 Conference on Computational Learning Theory.
[3]
Buckley C and Salton G (1995) Optimization of relevance feedback weights. In: Proceedings of 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 351–357.
[4]
Buckley C, Salton G, Allan J and Singhal A (1995) Automatic query expansion using SMART: TREC-3. In: Harman DK, Ed., Proceedings of the 3rd Text Retrieval Conference (TREC-3) (NIST SP 500-225).
[5]
Claypool M, Gokhale A, Miranda T, Murnikov P, Netes D and Sartin M (1999) Combining contentbased and collaborative filters in an online newspaper. ACM SIGIR Recommender Systems Workshop.http://www.cs.wpi.edu/?claypool/papers/content-collab/content-collab.ps.
[6]
Cohn D, Atlas L., and Ladner R Improving generalization with active learning Machine Learning 1994 15 2 201-221
[7]
Croft W and Harper D Using probabilistic models of document retrieval without relevance information Journal of Documentation 1979 35 285-295
[8]
Dempster AP, Laird NM, and Rubin DB Maximum likelihood from incomplete data via the EM algorithm Journal of the Royal Statistical Society, Series B 1977 39 1-38
[9]
Domingos P (1999) MetaCost: A general method for making classifiers cost-sensitive. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99).
[10]
Foltz PW and Dumais ST Personalized information delivery: An analysis of information filtering methods Communications of the ACM 1992 35 12 51-60
[11]
Band F. W. and Baeza-Yates R. Information Retrieval: Data Structures and Algorithms 1992 Boston, MA Prentice-Hall
[12]
Harman D Frakes WB and Baeza-Yates R Ranking algorithms Information Retrieval: Data Structures and Algorithms 1992 Boston, MA Prentice-Hall 363-392
[13]
Iyengar V, Apte C and Zhang T (2000) Active learning using adaptive resampling. In: Proceedings of ACM SIGKDD 2000.
[14]
Japkowicz N (2000) The class imbalance problem: Significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence. http://www.cs.dal.ca/?nat/Papers/ic-ai-2000.ps.
[15]
Jennings A, Higichi H, and Liu H A user model neural network for a personal news service Australian Telecommunication Research 1993 27 1 1-12
[16]
Joachims T (1997) Text categorization with support vector machines: Learning with many relevant features.Technical Report LS-8/23, University of Dortmund.
[17]
Joachims T (1999) Making large-scale svm learning practical. In: Schoelkopf B, Burges C and Smola A, Eds., Advances in Kernel Methods-Support Vector Learning. MIT Press.
[18]
Joachims T, Freitag D and Mitchell T (1997)Webwatcher: A tour guide for the world wide web. In: Proceedings of the International Joint Conference on Artificial Intelligence. http://www.cs.cmu.edu/afs/cs/project/theo-6/webagent/ www/ijcai97.ps.
[19]
Kamba T, Sahagami H, and Koseki Y ANATANAGONOMY: A personalized newspaper on the world wide web Int. J. Human-Compuer Studies 1997 46 789-803
[20]
Kubat M, Holte R and Matwin S (1997) Learning when negative examples abound. In: Proceedings of the European Conference on Machine Learning, ECML'97, pp. 146–153. http://www.cacs.louisiana.edu/? mkubat/publications/imbalanced.ps.
[21]
Kubat M and Matwin S (1997) Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, ICML'97, pp. 179–186. http://www.cacslouisiana.edu/?mkubat/publications/sampling.ps.
[22]
Lang K (1995) NewsWeeder: Learning to filter NetNews. In: Proceedings of the 12th International Conference on Machine Learning: ICML-95, pp. 331–339.
[23]
Lewis DD (1998) Naive (bayes) at forty: The independence assumption in information retrieval. In: Proceedings of the 10th European Conference on Machine Learning, pp. 4–15. http://www.research.att.com/ ?lewis/papers/lewis98b.ps.
[24]
Lewis D. and Gale WA Croft W. and van Rijsbergen CJ Asequential algorithm for training text classifiers SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval 1994 Dublin, Ireland Springer-Verlag 3-12
[25]
Lieberman H (1995) Letizia: An agent that assists web browsing. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 924–929. http://lieber.www.media.mit.edu/people/lieber/Lieberary/ Letizia/Letizia-AAAI/Letizia.html.
[26]
Losee RM Text Retrieval & Filtering: Analytic Models of Performance 1998 New York Kluwer Academic Publishers
[27]
McCallum AK and Nigam K (1997) Employing EM in pool-based active learning for text classification.In: Proceedings of the 1998 International Machine Learning Conference, pp. 25–32. http://www.cs.cmu.edu/?mccallum/papers/emactive-icm198.ps.gz.
[28]
McCallum AK and Nigam K (1998) A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. http://www.cs.cmu.edu/?mccallum/papers/multinomialaaai98w.ps.
[29]
McLachlan GJ and Krishnan T The EM Algorithm and Extensions 1996 Philadelphia, PA JohnWiley & Sons
[30]
Mitra M, Singhal Aand Buckley C(1997) Learning queries in a query zone. In: Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 25–32.
[31]
Morik K, Imboff M, Brockhausen P, Joachims T, and Gather U Knowledge discovery and knowledge validation in intensive care Artificial Intelligence in Medicine 2000 19 3 225-249
[32]
Morita Mand Shinoda Y (1994) Information filtering based on user behavior analysis and best match text retrieval.In: Proceedings of 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272–281.
[33]
Nickerson A, Japkowicz N and Milios E (2001) Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics.
[34]
Nigam K and Ghani R (2000) Analyzing the effectiveness and applicability of co-training. In: Proceedings of the Ninth International Conference on Information and Knowledge Management.
[35]
Nigam K, McCallum AK, Thrun S, and Mitchell T Text classification from labeled and unlabeled documents using EM Machine Learning 2000 39 2 103-134
[36]
Pazzani M and Billsus D Learning and revising user profiles: The identification of interesting web sites Machine Learning 1997 27 313-331
[37]
Pazzani M, Merz C, Murphy, P, Ali K, Hume T and Brunk C (1994) Reducing misclassification costs. In: 11th International Conference of Machine Learning, pp. 217–225. http://www.ics.uci.edu/?pazzani/publications/ MLC94.pdf.
[38]
Pednault E, Rosen BK and Apte C (2000) Handling imbalanced data sets in insurance risk modeling. Technical Report RC-21731, IBM.
[39]
Platt J (2000) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.In: Smola A, Bartlett P, Scholkopf B and Schuurmans D, Eds., Advances in Large-Margin Classifiers (Neural Information Processing), MIT Press
[40]
Porter M An algorithm for suffix stripping Program (Automated Library and Information Systems) 1980 14 3 130-137
[41]
Provost F Japkowicz N Machine learning from imbalanced data sets 101 Learning from Imbalanced Data Sets-Papers from the AAAI Workshop 2000 Austin, TX AAAI Press 1-3
[42]
Robertson SE and Sparck-Jones K (1976) Relevance weighting of search terms. JASIS pp. 129–176.
[43]
Rocchio J (1971) Relevance feedback in information retrieval. In: Salton G, Ed., The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, pp. 313–323.
[44]
Salton G The SMART Retrieval System: Experiments in Automatic Document Processing 1971 Melbourne, Australia Prentice-Hall
[45]
Schapire RE, Singer Yand Singhal, A(1998) Boosting and Rocchio applied to text filtering. In: Proceedings of 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 215–223.http://www.research.att.com/?schapire/cgi-bin/uncompress-papers/SchapireSiSi98.ps.
[46]
Schohn G and Cohn D (2000) Less is more: Active learning with support vector machines. In: Proceedings of the Seventeenth International Conference on Machine Learning.
[47]
Seung HS, Opper M and Sompolinsky H (1992) Query by committee. In: Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp. 287–294.
[48]
Stevens C Automating the creation of information filters Communications of the ACM 1992 35 12 48
[49]
van Rijsbergen CJ Information Retrieval 1979 London Buttersworth
[50]
Vapnik VN Statistical Learning Theory 1998 New York John Wiley
[51]
Yang Y and Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49.http://www.cs.cmu.edu/?yiming/papers.yy/sigir99.ps.

Cited By

View all
  • (2004)EditorialACM SIGKDD Explorations Newsletter10.1145/1007730.10077336:1(1-6)Online publication date: 1-Jun-2004
  • (2001)Summarization as feature selection for text categorizationProceedings of the tenth international conference on Information and knowledge management10.1145/502585.502647(365-370)Online publication date: 5-Oct-2001

Index Terms

  1. Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Information Retrieval
          Information Retrieval  Volume 5, Issue 1
          Jan 2002
          113 pages

          Publisher

          Kluwer Academic Publishers

          United States

          Publication History

          Published: 01 January 2002

          Author Tags

          1. incomplete data problems
          2. imbalanced training data
          3. user modeling
          4. personalization
          5. information retrieval

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 12 Sep 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2004)EditorialACM SIGKDD Explorations Newsletter10.1145/1007730.10077336:1(1-6)Online publication date: 1-Jun-2004
          • (2001)Summarization as feature selection for text categorizationProceedings of the tenth international conference on Information and knowledge management10.1145/502585.502647(365-370)Online publication date: 5-Oct-2001

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media