Learning from Positive and Unlabeled Examples with Different Data Distributions

Li, Xiao-Li; Liu, Bing

doi:10.1007/11564096_24

Xiao-Li Li²³ &
Bing Liu²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3720))

Included in the following conference series:

European Conference on Machine Learning

6244 Accesses
40 Citations

Abstract

We study the problem of learning from positive and unlabeled examples. Although several techniques exist for dealing with this problem, they all assume that positive examples in the positive set P and the positive examples in the unlabeled set U are generated from the same distribution. This assumption may be violated in practice. For example, one wants to collect all printer pages from the Web. One can use the printer pages from one site as the set P of positive pages and use product pages from another site as U. One wants to classify the pages in U into printer pages and non-printer pages. Although printer pages from the two sites have many similarities, they can also be quite different because different sites often present similar products in different styles and have different focuses. In such cases, existing methods perform poorly. This paper proposes a novel technique A-EM to deal with the problem. Experiment results with product page classification demonstrate the effectiveness of the proposed technique.

Download to read the full chapter text

Chapter PDF

Beyond the Selected Completely at Random Assumption for Learning from Positive and Unlabeled Data

Class-prior estimation for learning from positive and unlabeled data

Article 14 November 2016

Estimating the $$F_1$$ Score for Learning from Positive and Unlabeled Examples

References

Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: ICML 2002 (2002)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT 1998 (1998)
Google Scholar
Bockhorst, J., Craven, M.: Exploiting relations among concepts to acquire weakly labeled training data. In: ICML 2002 (2002)
Google Scholar
Crammer, K., Chechik, G.: A needle in a haystack: local one-class optimization. In: ICML 2004 (2004)
Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (1977)
Google Scholar
Denis, F.: PAC learning from positive statistical queries. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 112–126. Springer, Heidelberg (1998)
Chapter Google Scholar
Goldman, S., Zhou, Y.: Enhancing supervised learning with unlabeled data. In: ICML 2000 (2000)
Google Scholar
Koppel, M., Schler, J.: Authorship Verification as a one-class classification problem. In: ICML 2004 (2004)
Google Scholar
Lee, W., Liu, B.: Learning with positive and unlabeled examples using weighted logistic regression. In: ICML 2003 (2003)
Google Scholar
Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: SIGIR 1994 (1994)
Google Scholar
Li, X., Liu, B.: Learning to classify text using positive and unlabeled data. In: IJCAI 2003 (2003)
Google Scholar
Liu, B., Lee, W., Yu, P., Li, X.: Partially supervised classification of text documents. In: ICML 2002 (2002)
Google Scholar
Liu, B., Dai, Y., Li, X., Lee, W., Yu, P.: Building text classifiers using positive and unlabeled examples. In: ICDM 2003 (2003)
Google Scholar
McCallum, A.: Multi-label text classification with a mixture model trained by EM. In: AAAI 1999 Workshop on Text Learning (1999)
Google Scholar
Muggleton, S.: Learning from the positive data. Machine Learning (2001)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning (2000)
Google Scholar
Rocchio, J.: Relevant feedback in information retrieval. In: Salton, G. (ed.) The smart retrieval system: experiments in auto-matic document processing, Englewood Cliffs (1971)
Google Scholar
Sarawagi, S., Chakrabarti, S., Godbole, S.: Cross-training: learning probabilistic mappings between topics. In: KDD 2003 (2003)
Google Scholar
Scholkopf, B., Platt, J., Shawe, J., Smola, A., Williamson, R.: 1999. Estimating the support of a high-dimensional distribution. Technical Report MSR-TR-99-87, Microsoft Research (1999)
Google Scholar
Vapnik, V.: The nature of statistical learning theory (1995)
Google Scholar
Wu, P., Dietterich, T.: Improving SVM accuracy by training on auxiliary data sources. In: ICML 2004 (2004)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: SIGIR 1999 (1999)
Google Scholar
Yu, H., Han, J., Chang, K.: PEBL: Positive example based learning for Web page classification using SVM. In: KDD 2002 (2002)
Google Scholar
Yu, H.: General MC: Estimating boundary of positive class from small positive data. In: ICDM 2003 (2003)
Google Scholar
Zelikovitz, S., Hirsh, H.: Improving short text classification using unlabeled background knowledge to assess document similarity. In: ICML 2000 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Infocomm Research, Heng Mui Keng Terrace, 119613, Singapore
Xiao-Li Li
Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL, 60607-7053
Bing Liu

Authors

Xiao-Li Li
View author publications
You can also search for this author in PubMed Google Scholar
Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Economics of the University of Porto, Portugal
João Gama
Faculdade de Engenharia & LIAAD, Universidade do Porto, Portugal
Rui Camacho
LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Rua de Ceuta, 118-6, 4050-190, Porto, Portugal
Pavel B. Brazdil
LIACC/FEP, Universidade do Porto, Portugal
Alípio Mário Jorge
LIAAD-INESC Porto LA / FEP, University of Porto, R. de Ceuta, 118, 6., 4050-190, Porto, Portugal
Luís Torgo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, XL., Liu, B. (2005). Learning from Positive and Unlabeled Examples with Different Data Distributions. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science(), vol 3720. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564096_24

Download citation

DOI: https://doi.org/10.1007/11564096_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29243-2
Online ISBN: 978-3-540-31692-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning from Positive and Unlabeled Examples with Different Data Distributions

Abstract

Chapter PDF

Similar content being viewed by others

Beyond the Selected Completely at Random Assumption for Learning from Positive and Unlabeled Data

Class-prior estimation for learning from positive and unlabeled data

Estimating the $$F_1$$ Score for Learning from Positive and Unlabeled Examples

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Learning from Positive and Unlabeled Examples with Different Data Distributions

Abstract

Chapter PDF

Similar content being viewed by others

Beyond the Selected Completely at Random Assumption for Learning from Positive and Unlabeled Data

Class-prior estimation for learning from positive and unlabeled data

Estimating the $$F_1$$ Score for Learning from Positive and Unlabeled Examples

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation