Learning Classification with Both Labeled and Unlabeled Data

Vittaut, Jean-Noël; Amini, Massih-Reza; Gallinari, Patrick

doi:10.1007/3-540-36755-1_39

Jean-Noël Vittaut²,
Massih-Reza Amini² &
Patrick Gallinari²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2430))

Included in the following conference series:

European Conference on Machine Learning

3947 Accesses
10 Citations

Abstract

A key difficulty for applying machine learning classification algorithms for many applications is that they require a lot of hand-labeled examples. Labeling large amount of data is a costly process which in many cases is prohibitive. In this paper we show how the use of a small number of labeled data together with a large number of unlabeled data can create high-accuracy classifiers. Our approach does not rely on any parametric assumptions about the data as it is usually the case with generative methods widely used in semi-supervised learning. We propose new discriminant algorithms handling both labeled and unlabeled data for training classification models and we analyze their performances on different information access problems ranging from text span classification for text summarization to e-mail spam detection and text classification.

Download to read the full chapter text

Chapter PDF

Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data

Avoiding the Cluster Hypothesis in SV Classification of Partially Labeled Data

Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Amini, M.-R., Gallinari P.: Learning for Text Summarization using labeled and unlabeled sentences. Proceedings of the 11th International Conference of Artificial Neural Networks, (2001), 1177–1184.
Google Scholar
Amini, M.-R., Gallinari P.: Automatic Text Summarization using Unsupervised and Semi-supervised Learning. Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, (2001) 16–28.
Google Scholar
Amini, M.-R., Gallinari P.: Semi-supervised Logistic Regression. Proceedings of the 15th European Conference on Artificial Intelligence, (2002), to appear.
Google Scholar
Amini, M.-R., Gallinari P.: The Use of the labeled data to Improve Supervised Learning for Text Summarization. Proceedings of the 25th International ACM SIGIR, (2002), to appear.
Google Scholar
Anderson, J. A., Richardson, S. C.: Logistic Discrimination and Bias correction in maximum likelihood estimation. Technometrics, Vol. 21. (1979) 71–78.
Article MATH Google Scholar
Banko, M. Mittal V., Kantrowitz, M., Goldstein, J.: Generating Extraction-Based Summaries from Hand-written done by text alignment. Pac. Rim Conf. On Comp. (1999).
Google Scholar
Bennet, K., Demirez, A.: Semi-supervised Support Vector machines. In Kearns, Solla, and Cohn, editors. Advances in Neural Information Processing Systems 11. MIT Press (1998) 368–374.
Google Scholar
Blum, A., Mitchell, T.: Combining Labeled and unlabeled Data with Co-Training. Proceedings of the Conference on Computational Learning Theory (1998) 92–100.
Google Scholar
Celeux, G., Govaert, G.: A Classification EM algorithm for clustering and two stochastic versions. Computational Statistic and Data Analysis Vol. 14 (1992) 351–332.
MathSciNet Google Scholar
Chuang, W. T., Yang, J.: Extracting sentence segments for text summarization: a machine learning approach. Proceedings of the 23rd ACM SIGIR. (2000) 152–159.
Google Scholar
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In Proceedings of EMNLP (1999)
Google Scholar
Day N. E., Estimating the components of a mixture of normal distributions. Biometrika, Vol. 56, N° 3. (1969) 463–474.
Article MATH MathSciNet Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society, Vol. B, n°39 (1977) 1–38.
MathSciNet Google Scholar
De Sa, V. R.: Learning Classification with Unlabeled Data. Neural Information Processing Systems, Vol. 6 (1993) 112–119.
Google Scholar
Dumais, S. T., Platt J., Heckerman, D., Sahami M.: Inductive learning algorithms and representations for text categorization. CIKM. (1998) 148–155.
Google Scholar
Ghahramani, Z., Jordan M. I.: Supervised learning from incomplete data via EM approach. Advances in Neural Information Processing Systems, Vol. 6, (1994) 120–127.
Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. Tenth European Conference in Machine Learning (1998) 137–142.
Google Scholar
Joachims, T.: Transductive inference for text classification using support vector machines. Proceedings of sixteenth International Conference on Machine Learning (1999) 200–209.
Google Scholar
Kupiec J., Pderson J., Chen F. A.: Trainable Document Summarizer. Proceedings of the 18th ACM SIGIR (1995) 68–73.
Google Scholar
Lewis, D. D.: Naive (Bayes) at forty: The independence assumption in information retrieval. Tenth European Conference in Machine Learning (1998) 4–15.
Google Scholar
Luhn, P. H.: Automatic creation of literature abstracts. IBM Journal (1958) 159–165.
Google Scholar
Mani, I., Bloedorn, E.: Machine Learning of Generic and User-Focused Summarization. Proceedings of the Fifteenth National Conference on AI. (1998) 821–826.
Google Scholar
McLachlan, G. J.: Iterative reclassification procedure for constructing asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association. Vol. 70, N° 350, (1975) 365–369.
Article MATH MathSciNet Google Scholar
McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. John Willey and Sons, New York (1992)
Google Scholar
Miller, D., Uyar, H.: A Mixture of Experts classifier with learning based on both labeled and unlabeled data. Advances in Neural Information Processing Systems 9 (1996) 571–577.
Google Scholar
Mosteller, F., Wallace, D. L.: Inference and disputed authorship: The Federalist. Massachusetts: Addison-Wesley, (1964)
MATH Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, Vol 39, N° 2/3, 103–104 (2000)
Article MATH Google Scholar
Roth V., Steinhage, V.: Nonlinear Discriminant Analysis using Kernel Functions. Advances in Neural Information Processing Systems, Vol. 12, (1999).
Google Scholar
Schapire, R. E., Singer, Y.: BoosTexter: A Boosting-based system for text categorization. Machine Learning, Vol. 39, N° 2/3. (2000) 135–168.
Article MATH Google Scholar
Sparck Jones, K.: Discourse modeling for automatic summarizing. Technical report 29D, Computer laboratory, university of Cambridge. (1993).
Google Scholar
Symons, M. J.: Clustering criteria and Multivariate Normal Mixture. Biometrics. Vol. 37 (1981) 35–43.
Article MATH MathSciNet Google Scholar
Titterington, D. M.: Updating a diagnostic system using unconfirmed cases. Applied Statistics, Vol. 25, N° 3, (1976) 238–247.
Article Google Scholar
Vapnik, V.: Statistical learning theory. John Wiley, New York.
Google Scholar
Wiener, E., Pederson, J. O., Weigend, A. S.: A neural network approach to topic spotting. Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval. (1995) 317–332.
Google Scholar
Yang Y: An evaluation of statistical approaches to text categorization. Information Retrieval, Vol. 1, N° 2/3. (1999) 67–88.
Article Google Scholar
Zhang, T., Oles, F. J.: A probability analysis on the value of unlabeled data for classification problems. Proceedings of the Seventeenth International Conference on Machine Learning (2000) 1191–1198.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Laboratory of Paris 6 (LIP6), University of Pierre et Marie Curie, 8 rue du capitaine Scott, 75015, Paris, France
Jean-Noël Vittaut, Massih-Reza Amini & Patrick Gallinari

Authors

Jean-Noël Vittaut
View author publications
You can also search for this author in PubMed Google Scholar
Massih-Reza Amini
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Gallinari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Helsinki, P.O. Box 26, 00014, Helsinki, Finland
Tapio Elomaa , Heikki Mannila & Hannu Toivonen , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vittaut, JN., Amini, MR., Gallinari, P. (2002). Learning Classification with Both Labeled and Unlabeled Data. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Machine Learning: ECML 2002. ECML 2002. Lecture Notes in Computer Science(), vol 2430. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36755-1_39

Download citation

DOI: https://doi.org/10.1007/3-540-36755-1_39
Published: 20 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44036-9
Online ISBN: 978-3-540-36755-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Learning Classification with Both Labeled and Unlabeled Data

Abstract

Chapter PDF

Similar content being viewed by others

Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data

Avoiding the Cluster Hypothesis in SV Classification of Partially Labeled Data

Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Learning Classification with Both Labeled and Unlabeled Data

Abstract

Chapter PDF

Similar content being viewed by others

Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data

Avoiding the Cluster Hypothesis in SV Classification of Partially Labeled Data

Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation