Abstract
A key difficulty for applying machine learning classification algorithms for many applications is that they require a lot of hand-labeled examples. Labeling large amount of data is a costly process which in many cases is prohibitive. In this paper we show how the use of a small number of labeled data together with a large number of unlabeled data can create high-accuracy classifiers. Our approach does not rely on any parametric assumptions about the data as it is usually the case with generative methods widely used in semi-supervised learning. We propose new discriminant algorithms handling both labeled and unlabeled data for training classification models and we analyze their performances on different information access problems ranging from text span classification for text summarization to e-mail spam detection and text classification.
Chapter PDF
Similar content being viewed by others
Keywords
- Support Vector Machine
- Label Data
- Unlabeled Data
- Learn Classification
- Neural Information Processing System
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Amini, M.-R., Gallinari P.: Learning for Text Summarization using labeled and unlabeled sentences. Proceedings of the 11th International Conference of Artificial Neural Networks, (2001), 1177–1184.
Amini, M.-R., Gallinari P.: Automatic Text Summarization using Unsupervised and Semi-supervised Learning. Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, (2001) 16–28.
Amini, M.-R., Gallinari P.: Semi-supervised Logistic Regression. Proceedings of the 15th European Conference on Artificial Intelligence, (2002), to appear.
Amini, M.-R., Gallinari P.: The Use of the labeled data to Improve Supervised Learning for Text Summarization. Proceedings of the 25th International ACM SIGIR, (2002), to appear.
Anderson, J. A., Richardson, S. C.: Logistic Discrimination and Bias correction in maximum likelihood estimation. Technometrics, Vol. 21. (1979) 71–78.
Banko, M. Mittal V., Kantrowitz, M., Goldstein, J.: Generating Extraction-Based Summaries from Hand-written done by text alignment. Pac. Rim Conf. On Comp. (1999).
Bennet, K., Demirez, A.: Semi-supervised Support Vector machines. In Kearns, Solla, and Cohn, editors. Advances in Neural Information Processing Systems 11. MIT Press (1998) 368–374.
Blum, A., Mitchell, T.: Combining Labeled and unlabeled Data with Co-Training. Proceedings of the Conference on Computational Learning Theory (1998) 92–100.
Celeux, G., Govaert, G.: A Classification EM algorithm for clustering and two stochastic versions. Computational Statistic and Data Analysis Vol. 14 (1992) 351–332.
Chuang, W. T., Yang, J.: Extracting sentence segments for text summarization: a machine learning approach. Proceedings of the 23rd ACM SIGIR. (2000) 152–159.
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In Proceedings of EMNLP (1999)
Day N. E., Estimating the components of a mixture of normal distributions. Biometrika, Vol. 56, N° 3. (1969) 463–474.
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society, Vol. B, n°39 (1977) 1–38.
De Sa, V. R.: Learning Classification with Unlabeled Data. Neural Information Processing Systems, Vol. 6 (1993) 112–119.
Dumais, S. T., Platt J., Heckerman, D., Sahami M.: Inductive learning algorithms and representations for text categorization. CIKM. (1998) 148–155.
Ghahramani, Z., Jordan M. I.: Supervised learning from incomplete data via EM approach. Advances in Neural Information Processing Systems, Vol. 6, (1994) 120–127.
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. Tenth European Conference in Machine Learning (1998) 137–142.
Joachims, T.: Transductive inference for text classification using support vector machines. Proceedings of sixteenth International Conference on Machine Learning (1999) 200–209.
Kupiec J., Pderson J., Chen F. A.: Trainable Document Summarizer. Proceedings of the 18th ACM SIGIR (1995) 68–73.
Lewis, D. D.: Naive (Bayes) at forty: The independence assumption in information retrieval. Tenth European Conference in Machine Learning (1998) 4–15.
Luhn, P. H.: Automatic creation of literature abstracts. IBM Journal (1958) 159–165.
Mani, I., Bloedorn, E.: Machine Learning of Generic and User-Focused Summarization. Proceedings of the Fifteenth National Conference on AI. (1998) 821–826.
McLachlan, G. J.: Iterative reclassification procedure for constructing asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association. Vol. 70, N° 350, (1975) 365–369.
McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. John Willey and Sons, New York (1992)
Miller, D., Uyar, H.: A Mixture of Experts classifier with learning based on both labeled and unlabeled data. Advances in Neural Information Processing Systems 9 (1996) 571–577.
Mosteller, F., Wallace, D. L.: Inference and disputed authorship: The Federalist. Massachusetts: Addison-Wesley, (1964)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, Vol 39, N° 2/3, 103–104 (2000)
Roth V., Steinhage, V.: Nonlinear Discriminant Analysis using Kernel Functions. Advances in Neural Information Processing Systems, Vol. 12, (1999).
Schapire, R. E., Singer, Y.: BoosTexter: A Boosting-based system for text categorization. Machine Learning, Vol. 39, N° 2/3. (2000) 135–168.
Sparck Jones, K.: Discourse modeling for automatic summarizing. Technical report 29D, Computer laboratory, university of Cambridge. (1993).
Symons, M. J.: Clustering criteria and Multivariate Normal Mixture. Biometrics. Vol. 37 (1981) 35–43.
Titterington, D. M.: Updating a diagnostic system using unconfirmed cases. Applied Statistics, Vol. 25, N° 3, (1976) 238–247.
Vapnik, V.: Statistical learning theory. John Wiley, New York.
Wiener, E., Pederson, J. O., Weigend, A. S.: A neural network approach to topic spotting. Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval. (1995) 317–332.
Yang Y: An evaluation of statistical approaches to text categorization. Information Retrieval, Vol. 1, N° 2/3. (1999) 67–88.
Zhang, T., Oles, F. J.: A probability analysis on the value of unlabeled data for classification problems. Proceedings of the Seventeenth International Conference on Machine Learning (2000) 1191–1198.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vittaut, JN., Amini, MR., Gallinari, P. (2002). Learning Classification with Both Labeled and Unlabeled Data. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Machine Learning: ECML 2002. ECML 2002. Lecture Notes in Computer Science(), vol 2430. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36755-1_39
Download citation
DOI: https://doi.org/10.1007/3-540-36755-1_39
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44036-9
Online ISBN: 978-3-540-36755-0
eBook Packages: Springer Book Archive