Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
short-paper

Supervised Contrast Learning Text Classification Model Based on Data Quality Augmentation

Published: 10 May 2024 Publication History

Abstract

Token-level data augmentation generates text samples by modifying the words of the sentences. However, data that are not easily classified can negatively affect the model. In particular, not considering the role of keywords when performing random augmentation operations on samples may lead to the generation of low-quality supplementary samples. Therefore, we propose a supervised contrast learning text classification model based on data quality augmentation. First, dynamic training is used to screen high-quality datasets containing beneficial information for model training. The selected data is then augmented with data based on important words with tag information. To obtain a better text representation to serve the downstream classification task, we employ a standard supervised contrast loss to train the model. Finally, we conduct experiments on five text classification datasets to validate the effectiveness of our model. In addition, ablation experiments are conducted to verify the impact of each module on classification.

References

[1]
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 632–642.
[2]
Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Jorge, Célia Nunes, and Adam Jatowt. 2020. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences 509, C (Jan. 2020), 257–289.
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
[4]
Xiaowen Ding, Bing Liu, and Philip S. Yu. 2008. A holistic lexicon-based approach to opinion mining. In Proceedings of the 2008 International Conference on Web Search and Data Mining(WSDM’08). ACM, New York, NY, USA, 231–240.
[5]
Murthy Ganapathibhotla and Bing Liu. 2008. Mining opinions in comparative sentences. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING’08). 241–248. https://aclanthology.org/C08-1031
[6]
Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013). http://arxiv.org/abs/1311.2524
[7]
Beliz Gunel, Jingfei Du, Alexis Conneau, and Ves Stoyanov. 2020. Supervised contrastive learning for pre-trained language model fine-tuning. arXiv abs/2011.01403 (2020).
[8]
Biyang Guo, Songqiao Han, and Hailiang Huang. 2022. Selective text augmentation with word roles for low-resource text classification. arXiv abs/2209.01560 (2022).
[9]
Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 452–457.
[10]
Oleksandr Kolomiyets, Steven Bethard, and Marie-Francine Moens. 2011. Model-portability experiments for textual temporal analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 271–276. https://aclanthology.org/P11-2047
[11]
Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics—Volume 1(COLING’02). 1–7.
[12]
Yitong Li, Trevor Cohn, and Timothy Baldwin. 2017. Robust training under linguistic adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2 (Short Papers). 21–27. https://aclanthology.org/E17-2004
[13]
Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR abs/1711.05101 (2017). http://arxiv.org/abs/1711.05101
[14]
Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning–based text classification: A comprehensive review. ACM Computing Surveys 54, 3 (April 2021), Article 62, 40 pages.
[15]
Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL’04). 271–278.
[16]
Seo Yeon Park and Cornelia Caragea. 2022. A data cartography based MixUp for pre-trained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4244–4250.
[17]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2383–2392.
[18]
Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: Volume 1 (Long Papers). 455–465. https://aclanthology.org/P13-1045
[19]
Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir D. Bourdev, and Rob Fergus. 2014. Training convolutional networks with noisy labels. arXiv:1406.2080 (2014).
[20]
Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. 2020. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 9275–9293.
[21]
Kailas Vodrahalli, Ke Li, and Jitendra Malik. 2018. Are all training examples created equal? An empirical study. arXiv abs/1811.12569 (2018).
[22]
Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 6382–6388.
[23]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 1(NIPS’15). 649–657.
[24]
Zhilu Zhang and Mert R. Sabuncu. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In Proceedings of the 32nd International Conference on Neural Information Processing Systems(NIPS’18). 8792–8802.

Index Terms

  1. Supervised Contrast Learning Text Classification Model Based on Data Quality Augmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 5
    May 2024
    297 pages
    EISSN:2375-4702
    DOI:10.1145/3613584
    • Editor:
    • Imed Zitouni
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 May 2024
    Online AM: 19 March 2024
    Accepted: 17 March 2024
    Revised: 17 February 2024
    Received: 07 May 2023
    Published in TALLIP Volume 23, Issue 5

    Check for updates

    Author Tags

    1. Text augmentation
    2. data quality
    3. text classification
    4. contrast learning

    Qualifiers

    • Short-paper

    Funding Sources

    • Science and Technology Bureau of Changchun City
    • Jilin Province Development and Reform Commission

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 120
      Total Downloads
    • Downloads (Last 12 months)120
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media