Abstract
The development of data-mining applications such as textclassification and molecular profiling has shown the need for machine learning algorithms that can benefit from both labeled and unlabeled data, where often the unlabeled examples greatly outnumber the labeled examples. In this paper we present a two-stage classifier that improves its predictive accuracy by making use of the available unlabeled data. It uses a weighted nearest neighbor classification algorithm using the combined example-sets as a knowledge base. The examples from the unlabeled set are “pre-labeled” by an initial classifier that is build using the limited available training data. By choosing appropriate weights for this pre-labeled data, the nearest neighbor classifier consistently improves on the original classifier.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Seeger, M.: Learning with labeled and unlabeled data. Technical report, Edinburgh University (2001)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT: Proceedings of the Workshop on Computational Learning Theory, pp. 92–100. Morgan Kaufmann, San Francisco (1998)
Zhou, D., Bousquet, O., Lal, T., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Proceedings of the Annual Conf. on Neural Information Processing Systems, NIPS (2004)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Machine Learning 39, 103–134 (2000)
Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of object detection models. In: 7th IEEE Workshop on Applications of Computer Vision / IEEE Workshop on Motion and Video Computing, Breckenridge, CO, USA, January 5-7, pp. 29–36. IEEE Computer Society, Los Alamitos (2005)
Joachims, T.: Transductive inference for text classification using support vector machines. In: Bratko, I., Džeroski, S. (eds.) Proceedings of ICML99, 16th International Conference on Machine Learning, pp. 200–209. Morgan Kaufmann, San Francisco (1999)
Szummer, M., Jaakkola, T.: Partially labeled classification with markov random walks. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, Neural Information Processing Systems, NIPS 2001, Vancouver and Whistler, British Columbia, Canada, December 3-8, pp. 945–952. MIT Press, Cambridge (2001)
Chapelle, O., Weston, J., Schölkopf, B.: Cluster kernels for semi-supervised learning. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15, Neural Information Processing Systems, NIPS 2002, Vancouver, British Columbia, Canada, December 9-14, pp. 585–592. MIT Press, Cambridge (2002)
Zhu, X.: Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University, School of Computer Science, Pittsburgh, Pennsylvania (PA), USA (2005)
Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Brodley, C., Pohoreckyj Danyluk, A. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, pp. 19–26. Morgan Kaufmann, San Francisco (2001)
Joachims, T.: Transductive learning via spectral graph partitioning. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), August 21-24, pp. 290–297. AAAI Press, Washington (2003)
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised searning using Gaussian fields and harmonic functions. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), August 21-24, pp. 912–919. AAAI Press, Washington (2003)
Neville, J., Jensen, D.: Collective classification with relational dependency networks. In: Proceedings of the Second International Workshop on Multi-Relational Data-Mining (2003)
Zhou, Z.H., Jiang, Y.: Nec4.5: neural ensemble based c4.5. IEEE Transactions on Knowledge and Data Engineering 16, 770–773 (2004)
Blum, A., Chawla, S.: Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (2001)
Friedman, J., Bentley, J., Finkel, R.: An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software 3, 209–226 (1977)
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: pre-print (2005), available from, www.cs.rochester.edu/u/beygel/publications.html
Omohundro, S.: Efficient algorithms with nearal network behavior. Journal of Complex Systems 1, 273–347 (1987)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Driessens, K., Reutemann, P., Pfahringer, B., Leschi, C. (2006). Using Weighted Nearest Neighbor to Benefit from Unlabeled Data. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_10
Download citation
DOI: https://doi.org/10.1007/11731139_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)