research-article

Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking

Authors:

Aleksander Kołcz,

Joshua AlspectorAuthors Info & Claims

Information Retrieval, Volume 5, Issue 1

Pages 5 - 40

https://doi.org/10.1023/A:1012714523368

Published: 01 January 2002 Publication History

Abstract

In certain classification problems there is a strong a asymmetry between the number of labeled examples available for each of the classes involved. In an extreme case, there may be a complete lack of labeled data for one of the classes while, at the same time, there are adequate labeled examples for the others, accompanied by a large body of unlabeled data. Since most classification algorithms require some information about all classes involved, label estimation for the un-represented class is desired. An important representative of this group of problems is that of user interest/preference modeling where there may be a large number of examples of what the user likes with essentially no counterexamples.

Recently, there has been much interest in applying the EM algorithm to incomplete data problems in the area of text retrieval and categorization. We adapt this approach to the asymmetric case of modeling user interests in news articles, where only labeled positive training data are available, with access to a large corpus of unlabeled documents. User modeling is here equivalent to that of user-specific document ranking. EM is used in conjunction with the Naive Bayes model while its output is also utilized by a Support Vector Machine and Rocchio's technique.

Our findings demonstrate that the EM algorithm can be quite effective in modeling the negative class under a number of different initialization schemes. Although primarily just the negative training examples are needed, a natural question is whether using all of the estimated labels (i.e., positive and negative) would be more (or less) beneficial. This is important considering that, in this context, the initialization of the negative class for EM is likely not to be very accurate. Experimental results suggest that EM output should be limited to negative label estimates only.

References

[1]

Billsus D and Pazzani M (1999) A hybrid user model for news story classification. In: Seventh International Conference on User Modeling (UM '99). http://www.ics.uci.edu/?pazzani/Publications/um99.ps.

Abstract

References

Cited By

Index Terms

Recommendations

The impact of imbalanced training data on machine learning for author name disambiguation

Classifying Remote Sensing Data with Support Vector Machines and Imbalanced Training Data

Overcoming the lack of labeled data: Training malware detection models using adversarial domain adaptation

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations