research-article

Improving Text Classification Accuracy by Training Label Cleaning

Authors:

Andrea Esuli,

Fabrizio SebastianiAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 31, Issue 4

Article No.: 19, Pages 1 - 28

https://doi.org/10.1145/2516889

Published: 01 November 2013 Publication History

Get Access

Abstract

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting classifiers for a given amount of training effort. Both strategies have been actively investigated for TC in recent years. Much less research has been devoted to a third such strategy, training label cleaning (TLC), which consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has mislabelled them. This provides a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods for multilabel classification we present three different techniques for performing TLC and, on three widely used TC benchmarks, evaluate them by their capability of spotting training documents that, for experimental reasons only, we have purposefully mislabelled. We also evaluate the degradation in classification effectiveness that these mislabelled texts bring about, and to what extent training label cleaning can prevent this degradation.

References

[1]

Abney, S., Schapire, R. E., and Singer, Y. 1999. Boosting applied to tagging and PP attachment. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC’99). 38--45.

Google Scholar

[2]

Agarwal, S., Godbole, S., Punjani, D., and Roy, S. 2007. How much noise is too much: A study in automatic text classification. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). 3--12.

Digital Library

Google Scholar

[3]

Argamon-Engelson, S. and Dagan, I. 1999. Committee-based sample selection for probabilistic classifiers. J. Artif. Intell. Res. 11, 335--360.

Crossref

Google Scholar

[4]

Breiman, L. 1996. Bagging predictors. Machine Learning 24, 2, 123--140.

Crossref

Google Scholar

[5]

Brodley, C. E. and Friedl, M. A. 1996. Identifying and eliminating mislabeled training instances. In Proceedings of the 13th Conference of the American Association for Artificial Intelligence (AAAI’96). 799--805.

Digital Library

Google Scholar

[6]

Chapelle, O., Schölkopf, B., and Zien, A., Eds. 2006. Semi-Supervised Learning. MIT Press, Cambridge, MA.

Google Scholar

[7]

Cohn, D., Atlas, L., and Ladner, R. 1994. Improving generalization with active learning. Machine Learn. 15, 2, 201--221.

Digital Library

Google Scholar

[8]

Dickinson, M. and Meurers, W. D. 2003. Detecting errors in part-of-speech annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03). 107--114.

Digital Library

Google Scholar

[9]

Dietterich, T. G. 2000. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learn. 40, 2, 139--157.

Digital Library

Google Scholar

[10]

Eskin, E. 2000. Detecting errors within a corpus using anomaly detection. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’00). 148--153.

Digital Library

Google Scholar

[11]

Esuli, A. and Sebastiani, F. 2009. Training data cleaning for text classification. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval (ICTIR’09). 29--41.

Digital Library

Google Scholar

[12]

Esuli, A. and Sebastiani, F. 2010. Machines that learn how to code open-ended survey data. Int. J. Market Res. 52, 6, 775--800.

Crossref

Google Scholar

[13]

Esuli, A., Fagni, T., and Sebastiani, F. 2006. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE’06). 1--12.

Digital Library

Google Scholar

[14]

Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. 1992. Information, prediction, and query by committee. In Advances in Neural Information Processing Systems, Vol. 5, MIT Press, Cambridge, MA, 483--490.

Digital Library

Google Scholar

[15]

Friedman, J., Hastie, T., and Tibshirani, R. J. 2000. Additive logistic regression: A statistical view of boosting. Ann. Statist. 2, 337--374.

Crossref

Google Scholar

[16]

Fukumoto, F. and Suzuki, Y. 2004. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). 868--874.

Digital Library

Google Scholar

[17]

Galavotti, L., Sebastiani, F., and Simi, M. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’00). 59--68.

Digital Library

Google Scholar

[18]

Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comput. 4, 1, 1--58.

Digital Library

Google Scholar

[19]

Grady, C. and Lease, M. 2010. Crowdsourcing document relevance assessment with Mechanical fiTurk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 172--179.

Digital Library

Google Scholar

[20]

Hersh, W., Buckley, C., Leone, T., and Hickman, D. 1994. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 192--201.

Digital Library

Google Scholar

[21]

Järvelin, K. and Kekäläinen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00). 41--48.

Digital Library

Google Scholar

[22]

John, G. H. 1995. Robust decision trees: Removing outliers from databases. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD’95). 174--179.

Google Scholar

[23]

Lewis, D. D. 2004. Reuters-21578 text categorization test collection Distribution 1.0 README file (v 1.3). http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt.

Google Scholar

[24]

Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classifiers. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR’96). 298--306.

Digital Library

Google Scholar

[25]

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A new benchmark collection for text categorization research. J. Machine Learn. Res. 5, 361--397.

Digital Library

Google Scholar

[26]

Maclin, R. and Opitz, D. W. 1997. An empirical evaluation of bagging and boosting. In Proceedings of the 14th Conference of the American Association for Artificial Intelligence (AAAI’97). 546--551.

Digital Library

Google Scholar

[27]

Malik, H. H. and Bhardwaj, V. S. 2011. Automatic training data cleaning for text classification. In Proceedings of the ICDM Workshop on Domain-Driven Data Mining. 442--449.

Digital Library

Google Scholar

[28]

Murata, M., Utiyama, M., Uchimoto, K., Isahara, H., and Ma, Q. 2005. Correction of errors in a verb modality corpus for machine translation with a machine-learning method. ACM Trans. Asian Lang. Inform. Process. 4, 1, 18--37.

Digital Library

Google Scholar

[29]

Nakagawa, T. and Matsumoto, Y. 2002. Detecting errors in corpora using support vector machines. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 1--7.

Digital Library

Google Scholar

[30]

Resta, G. 2012. On the expected average precision of the random ranker. Tech. rep. IIT TR-04/2012, Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, IT. http://www.iit.cnr.it/sites/default/files/TR-04-2012.pdf.

Google Scholar

[31]

Schapire, R. and Singer, Y. 1999. Improved boosting using confidence-rated predictions. Machine Learn. 37, 3, 297--336.

Digital Library

Google Scholar

[32]

Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Machine Learn. 39, 2/3, 135--168.

Digital Library

Google Scholar

[33]

Schapire, R. E. and Freund, Y. 2012. Boosting: Foundations and Algorithms. MIT Press, Cambridge, MA.

Crossref

Google Scholar

[34]

Shinnou, H. 2001. Detection of errors in training data by using a decision list and Adaboost. In Proceedings of the IJCAI Workshop on Text Learning Beyond Supervision.

Google Scholar

[35]

Sindhwani, V. and Keerthi, S. S. 2006. Large scale semi-supervised linear SVMs. In Proceedings of the 29th ACM International Conference on Research and Development in Information Retrieval (SIGIR’06). 477--484.

Digital Library

Google Scholar

[36]

Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. 2008. Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 254--263.

Digital Library

Google Scholar

[37]

Vinciarelli, A. 2005. Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27, 12, 1882--1895.

Digital Library

Google Scholar

[38]

Yang, Y. 1994. Expert network: Effective and efficient learning from human decisions in text categorisation and retrieval. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 13--22.

Digital Library

Google Scholar

[39]

Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inf. Retriev. 1, 1/2, 69--90.

Digital Library

Google Scholar

[40]

Yih, W.-T., McCann, R., and Kolcz, A. 2007. Improving spam filtering by detecting gray mail. In Proceedings of the 4th Conference on Email and Anti-Spam (CEAS’07).

Google Scholar

[41]

Yokoyama, M., Matsui, T., and Ohwada, H. 2005. Detecting and revising misclassifications using ILP. In Proceedings of the 8th International Conference on Discovery Science (DS’05). 75--80.

Digital Library

Google Scholar

[42]

Yu, K., Zhu, S., Xu, W., and Gong, Y. 2008. Non-greedy active learning for text categorization using convex transductive experimental design. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR’08). 635--642.

Digital Library

Google Scholar

[43]

Zeng, X. and Martinez, T. R. 2001. An algorithm for correcting mislabeled data. Intell. Data Anal. 5, 6, 491--502.

Digital Library

Google Scholar

[44]

Zhu, X. and Goldberg, A. B. 2009. Introduction to Semi-Supervised Learning. Morgan and Claypool, San Rafael, CA.

Digital Library

Google Scholar

Cited By

View all

Li SYuan GYang MShen YLi CXu RZhao X(2024)Improving Semi-Supervised Text Classification with Dual Meta-LearningACM Transactions on Information Systems10.1145/364861242:4(1-28)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1145/3648612
Esuli AFabris AMoreo ASebastiani FEsuli AFabris AMoreo ASebastiani F(2022)The Road AheadLearning to Quantify10.1007/978-3-031-20467-8_7(121-123)Online publication date: 29-Dec-2022
https://doi.org/10.1007/978-3-031-20467-8_7
Esuli AFabris AMoreo ASebastiani FEsuli AFabris AMoreo ASebastiani F(2022)The Quantification LandscapeLearning to Quantify10.1007/978-3-031-20467-8_6(103-120)Online publication date: 29-Dec-2022
https://doi.org/10.1007/978-3-031-20467-8_6
Show More Cited By

Index Terms

Improving Text Classification Accuracy by Training Label Cleaning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Effective multi-label active learning for text classification
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Labeling text data is quite time-consuming but essential for automatic text classification. Especially, manually creating multiple labels for each document may become impractical when a very large amount of data is needed for training multi-label text ...
Automatic Training Data Cleaning for Text Classification
ICDMW '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops

Supervised text classification algorithms rely on the availability of large quantities of quality training data to achieve their optimal performance. However, not all training data is created equal and the quality of class-labels assigned by human ...
Vertical Ensemble Co-Training for Text Classification
Regular Papers

High-quality, labeled data is essential for successfully applying machine learning methods to real-world text classification problems. However, in many cases, the amount of labeled data is very small compared to that of the unlabeled, and labeling ...

Reviews

Reviewer: Jun Ping Ng

A large-scale study on the use of training label cleaning (TLC) to improve text classification is described in this paper. The purpose of TLC is to identify potentially mislabeled instances in a training dataset, and to flag them for closer inspection by human annotators. The underlying premise for doing this is that incorrect annotations can have a significant, adverse impact on the performance of classifiers. TLC is slightly different from active learning, where potentially useful, unlabeled instances are flagged for human annotation. The paper makes use of several well-known datasets, and examines the impact that incorrect annotations can have on classifier performance. The authors also detail three main techniques for TLC, and evaluate how these can help identify instances of incorrect annotations, resulting in improvements to text classification performance. This well-written paper was a joy to read. The experiments are extensive and sound. The authors share many useful insights into the importance of annotation integrity, and also present an illuminating discussion of the results they obtained. Readers who want to find out more about TLC may be slightly disappointed, as the paper does not go into much depth on the actual techniques used. However, TLC is already well covered in existing literature [1,2], so this is not a big problem. Some parts of the methodology and experiments could have been better structured for a more fluent read (for example, the section on using support vector machines (SVM) to refute doubts about the use of MP-Boost seems a lot like an afterthought), but the paper is worth reading nonetheless for the many observations and insights it contains. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

ACM Transactions on Information Systems Volume 31, Issue 4

November 2013

192 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/2536736

Editor:
Jamie Callan
Carnegie Mellon University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2013

Accepted: 01 June 2013

Revised: 01 April 2013

Received: 01 June 2012

Published in TOIS Volume 31, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
636
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)3

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Li SYuan GYang MShen YLi CXu RZhao X(2024)Improving Semi-Supervised Text Classification with Dual Meta-LearningACM Transactions on Information Systems10.1145/364861242:4(1-28)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1145/3648612
Esuli AFabris AMoreo ASebastiani FEsuli AFabris AMoreo ASebastiani F(2022)The Road AheadLearning to Quantify10.1007/978-3-031-20467-8_7(121-123)Online publication date: 29-Dec-2022
https://doi.org/10.1007/978-3-031-20467-8_7
Esuli AFabris AMoreo ASebastiani FEsuli AFabris AMoreo ASebastiani F(2022)The Quantification LandscapeLearning to Quantify10.1007/978-3-031-20467-8_6(103-120)Online publication date: 29-Dec-2022
https://doi.org/10.1007/978-3-031-20467-8_6
Esuli AFabris AMoreo ASebastiani FEsuli AFabris AMoreo ASebastiani F(2022)Advanced TopicsLearning to Quantify10.1007/978-3-031-20467-8_5(87-101)Online publication date: 29-Dec-2022
https://doi.org/10.1007/978-3-031-20467-8_5
Esuli AFabris AMoreo ASebastiani FEsuli AFabris AMoreo ASebastiani F(2022)Methods for Learning to QuantifyLearning to Quantify10.1007/978-3-031-20467-8_4(55-85)Online publication date: 29-Dec-2022
https://doi.org/10.1007/978-3-031-20467-8_4
Esuli AFabris AMoreo ASebastiani FEsuli AFabris AMoreo ASebastiani F(2022)Evaluation of Quantification AlgorithmsLearning to Quantify10.1007/978-3-031-20467-8_3(33-54)Online publication date: 29-Dec-2022
https://doi.org/10.1007/978-3-031-20467-8_3
Esuli AFabris AMoreo ASebastiani FEsuli AFabris AMoreo ASebastiani F(2022)Applications of QuantificationLearning to Quantify10.1007/978-3-031-20467-8_2(19-31)Online publication date: 29-Dec-2022
https://doi.org/10.1007/978-3-031-20467-8_2
Esuli AFabris AMoreo ASebastiani FEsuli AFabris AMoreo ASebastiani F(2022)The Case for QuantificationLearning to Quantify10.1007/978-3-031-20467-8_1(1-17)Online publication date: 29-Dec-2022
https://doi.org/10.1007/978-3-031-20467-8_1
S. SS. RS. I(2020)Natural Language Processing-Based Information Extraction and Abstraction for Lease DocumentsNeural Networks for Natural Language Processing10.4018/978-1-7998-1159-6.ch011(170-187)Online publication date: 2020
https://doi.org/10.4018/978-1-7998-1159-6.ch011
Muller NMarkert K(2019)Identifying Mislabeled Instances in Classification Datasets2019 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2019.8851920(1-8)Online publication date: Jul-2019
https://doi.org/10.1109/IJCNN.2019.8851920
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Effective multi-label active learning for text classification

Automatic Training Data Cleaning for Text Classification

Vertical Ensemble Co-Training for Text Classification

Reviews

Access critical reviews of Computing literature here