Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Improving Text Classification Accuracy by Training Label Cleaning

Published: 01 November 2013 Publication History
  • Get Citation Alerts
  • Abstract

    In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting classifiers for a given amount of training effort. Both strategies have been actively investigated for TC in recent years. Much less research has been devoted to a third such strategy, training label cleaning (TLC), which consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has mislabelled them. This provides a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods for multilabel classification we present three different techniques for performing TLC and, on three widely used TC benchmarks, evaluate them by their capability of spotting training documents that, for experimental reasons only, we have purposefully mislabelled. We also evaluate the degradation in classification effectiveness that these mislabelled texts bring about, and to what extent training label cleaning can prevent this degradation.

    References

    [1]
    Abney, S., Schapire, R. E., and Singer, Y. 1999. Boosting applied to tagging and PP attachment. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC’99). 38--45.
    [2]
    Agarwal, S., Godbole, S., Punjani, D., and Roy, S. 2007. How much noise is too much: A study in automatic text classification. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). 3--12.
    [3]
    Argamon-Engelson, S. and Dagan, I. 1999. Committee-based sample selection for probabilistic classifiers. J. Artif. Intell. Res. 11, 335--360.
    [4]
    Breiman, L. 1996. Bagging predictors. Machine Learning 24, 2, 123--140.
    [5]
    Brodley, C. E. and Friedl, M. A. 1996. Identifying and eliminating mislabeled training instances. In Proceedings of the 13th Conference of the American Association for Artificial Intelligence (AAAI’96). 799--805.
    [6]
    Chapelle, O., Schölkopf, B., and Zien, A., Eds. 2006. Semi-Supervised Learning. MIT Press, Cambridge, MA.
    [7]
    Cohn, D., Atlas, L., and Ladner, R. 1994. Improving generalization with active learning. Machine Learn. 15, 2, 201--221.
    [8]
    Dickinson, M. and Meurers, W. D. 2003. Detecting errors in part-of-speech annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03). 107--114.
    [9]
    Dietterich, T. G. 2000. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learn. 40, 2, 139--157.
    [10]
    Eskin, E. 2000. Detecting errors within a corpus using anomaly detection. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’00). 148--153.
    [11]
    Esuli, A. and Sebastiani, F. 2009. Training data cleaning for text classification. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval (ICTIR’09). 29--41.
    [12]
    Esuli, A. and Sebastiani, F. 2010. Machines that learn how to code open-ended survey data. Int. J. Market Res. 52, 6, 775--800.
    [13]
    Esuli, A., Fagni, T., and Sebastiani, F. 2006. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE’06). 1--12.
    [14]
    Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. 1992. Information, prediction, and query by committee. In Advances in Neural Information Processing Systems, Vol. 5, MIT Press, Cambridge, MA, 483--490.
    [15]
    Friedman, J., Hastie, T., and Tibshirani, R. J. 2000. Additive logistic regression: A statistical view of boosting. Ann. Statist. 2, 337--374.
    [16]
    Fukumoto, F. and Suzuki, Y. 2004. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). 868--874.
    [17]
    Galavotti, L., Sebastiani, F., and Simi, M. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’00). 59--68.
    [18]
    Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comput. 4, 1, 1--58.
    [19]
    Grady, C. and Lease, M. 2010. Crowdsourcing document relevance assessment with Mechanical fiTurk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 172--179.
    [20]
    Hersh, W., Buckley, C., Leone, T., and Hickman, D. 1994. OHSUMED: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 192--201.
    [21]
    Järvelin, K. and Kekäläinen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00). 41--48.
    [22]
    John, G. H. 1995. Robust decision trees: Removing outliers from databases. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD’95). 174--179.
    [23]
    Lewis, D. D. 2004. Reuters-21578 text categorization test collection Distribution 1.0 README file (v 1.3). http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt.
    [24]
    Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. 1996. Training algorithms for linear text classifiers. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR’96). 298--306.
    [25]
    Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. RCV1: A new benchmark collection for text categorization research. J. Machine Learn. Res. 5, 361--397.
    [26]
    Maclin, R. and Opitz, D. W. 1997. An empirical evaluation of bagging and boosting. In Proceedings of the 14th Conference of the American Association for Artificial Intelligence (AAAI’97). 546--551.
    [27]
    Malik, H. H. and Bhardwaj, V. S. 2011. Automatic training data cleaning for text classification. In Proceedings of the ICDM Workshop on Domain-Driven Data Mining. 442--449.
    [28]
    Murata, M., Utiyama, M., Uchimoto, K., Isahara, H., and Ma, Q. 2005. Correction of errors in a verb modality corpus for machine translation with a machine-learning method. ACM Trans. Asian Lang. Inform. Process. 4, 1, 18--37.
    [29]
    Nakagawa, T. and Matsumoto, Y. 2002. Detecting errors in corpora using support vector machines. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 1--7.
    [30]
    Resta, G. 2012. On the expected average precision of the random ranker. Tech. rep. IIT TR-04/2012, Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, IT. http://www.iit.cnr.it/sites/default/files/TR-04-2012.pdf.
    [31]
    Schapire, R. and Singer, Y. 1999. Improved boosting using confidence-rated predictions. Machine Learn. 37, 3, 297--336.
    [32]
    Schapire, R. E. and Singer, Y. 2000. Boostexter: A boosting-based system for text categorization. Machine Learn. 39, 2/3, 135--168.
    [33]
    Schapire, R. E. and Freund, Y. 2012. Boosting: Foundations and Algorithms. MIT Press, Cambridge, MA.
    [34]
    Shinnou, H. 2001. Detection of errors in training data by using a decision list and Adaboost. In Proceedings of the IJCAI Workshop on Text Learning Beyond Supervision.
    [35]
    Sindhwani, V. and Keerthi, S. S. 2006. Large scale semi-supervised linear SVMs. In Proceedings of the 29th ACM International Conference on Research and Development in Information Retrieval (SIGIR’06). 477--484.
    [36]
    Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. 2008. Cheap and fast - but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 254--263.
    [37]
    Vinciarelli, A. 2005. Noisy text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 27, 12, 1882--1895.
    [38]
    Yang, Y. 1994. Expert network: Effective and efficient learning from human decisions in text categorisation and retrieval. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94). 13--22.
    [39]
    Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inf. Retriev. 1, 1/2, 69--90.
    [40]
    Yih, W.-T., McCann, R., and Kolcz, A. 2007. Improving spam filtering by detecting gray mail. In Proceedings of the 4th Conference on Email and Anti-Spam (CEAS’07).
    [41]
    Yokoyama, M., Matsui, T., and Ohwada, H. 2005. Detecting and revising misclassifications using ILP. In Proceedings of the 8th International Conference on Discovery Science (DS’05). 75--80.
    [42]
    Yu, K., Zhu, S., Xu, W., and Gong, Y. 2008. Non-greedy active learning for text categorization using convex transductive experimental design. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR’08). 635--642.
    [43]
    Zeng, X. and Martinez, T. R. 2001. An algorithm for correcting mislabeled data. Intell. Data Anal. 5, 6, 491--502.
    [44]
    Zhu, X. and Goldberg, A. B. 2009. Introduction to Semi-Supervised Learning. Morgan and Claypool, San Rafael, CA.

    Cited By

    View all
    • (2024)Improving Semi-Supervised Text Classification with Dual Meta-LearningACM Transactions on Information Systems10.1145/364861242:4(1-28)Online publication date: 20-Feb-2024
    • (2022)The Road AheadLearning to Quantify10.1007/978-3-031-20467-8_7(121-123)Online publication date: 29-Dec-2022
    • (2022)The Quantification LandscapeLearning to Quantify10.1007/978-3-031-20467-8_6(103-120)Online publication date: 29-Dec-2022
    • Show More Cited By

    Recommendations

    Reviews

    Jun Ping Ng

    A large-scale study on the use of training label cleaning (TLC) to improve text classification is described in this paper. The purpose of TLC is to identify potentially mislabeled instances in a training dataset, and to flag them for closer inspection by human annotators. The underlying premise for doing this is that incorrect annotations can have a significant, adverse impact on the performance of classifiers. TLC is slightly different from active learning, where potentially useful, unlabeled instances are flagged for human annotation. The paper makes use of several well-known datasets, and examines the impact that incorrect annotations can have on classifier performance. The authors also detail three main techniques for TLC, and evaluate how these can help identify instances of incorrect annotations, resulting in improvements to text classification performance. This well-written paper was a joy to read. The experiments are extensive and sound. The authors share many useful insights into the importance of annotation integrity, and also present an illuminating discussion of the results they obtained. Readers who want to find out more about TLC may be slightly disappointed, as the paper does not go into much depth on the actual techniques used. However, TLC is already well covered in existing literature [1,2], so this is not a big problem. Some parts of the methodology and experiments could have been better structured for a more fluent read (for example, the section on using support vector machines (SVM) to refute doubts about the use of MP-Boost seems a lot like an afterthought), but the paper is worth reading nonetheless for the many observations and insights it contains. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 31, Issue 4
    November 2013
    192 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2536736
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 November 2013
    Accepted: 01 June 2013
    Revised: 01 April 2013
    Received: 01 June 2012
    Published in TOIS Volume 31, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Text classification
    2. supervised learning
    3. synthetic noise
    4. training label cleaning
    5. training label noise

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)38
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Improving Semi-Supervised Text Classification with Dual Meta-LearningACM Transactions on Information Systems10.1145/364861242:4(1-28)Online publication date: 20-Feb-2024
    • (2022)The Road AheadLearning to Quantify10.1007/978-3-031-20467-8_7(121-123)Online publication date: 29-Dec-2022
    • (2022)The Quantification LandscapeLearning to Quantify10.1007/978-3-031-20467-8_6(103-120)Online publication date: 29-Dec-2022
    • (2022)Advanced TopicsLearning to Quantify10.1007/978-3-031-20467-8_5(87-101)Online publication date: 29-Dec-2022
    • (2022)Methods for Learning to QuantifyLearning to Quantify10.1007/978-3-031-20467-8_4(55-85)Online publication date: 29-Dec-2022
    • (2022)Evaluation of Quantification AlgorithmsLearning to Quantify10.1007/978-3-031-20467-8_3(33-54)Online publication date: 29-Dec-2022
    • (2022)Applications of QuantificationLearning to Quantify10.1007/978-3-031-20467-8_2(19-31)Online publication date: 29-Dec-2022
    • (2022)The Case for QuantificationLearning to Quantify10.1007/978-3-031-20467-8_1(1-17)Online publication date: 29-Dec-2022
    • (2020)Natural Language Processing-Based Information Extraction and Abstraction for Lease DocumentsNeural Networks for Natural Language Processing10.4018/978-1-7998-1159-6.ch011(170-187)Online publication date: 2020
    • (2019)Identifying Mislabeled Instances in Classification Datasets2019 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2019.8851920(1-8)Online publication date: Jul-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media