research-article

Explainability-Based Mix-Up Approach for Text Data Augmentation

Authors:

Younghoon LeeAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 17, Issue 1

Article No.: 13, Pages 1 - 14

https://doi.org/10.1145/3533048

Published: 20 February 2023 Publication History

Abstract

Text augmentation is a strategy for increasing the diversity of training examples without explicitly collecting new data. Owing to the efficiency and effectiveness of text augmentation, numerous augmentation methodologies have been proposed. Among them, the method based on modification, particularly the mix-up method of swapping words between two or more sentences, is widely used because it can be applied simply and shows good levels of performance. However, the existing mix-up approaches are limited; they do not reflect the importance of the manipulated word. That is, even if a word that has a critical effect on the classification result is manipulated, it is not considered significant in labeling the augmented data. Therefore, in this study, we propose an effective text augmentation technique that explicitly derives the importance of manipulated words and reflects this importance in the labeling of augmented data. The importance of each word, in other words, explainability, is calculated, and this is explicitly reflected in the labeling process of the augmented data. The results of the experiment confirmed that when the importance of the manipulated word was reflected in the labeling, the performance was significantly higher than that of the existing methods.

References

[1]

Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 52138–52160.

[2]

Viktar Atliha and Dmitrij Šešok. 2020. Text augmentation using BERT for image captioning. Applied Sciences 10, 17 (2020), 5978.

[3]

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV’18). IEEE, 839–847.

[4]

Jiaao Chen, Yuwei Wu, and Diyi Yang. 2020. Semi-supervised models via data augmentationfor classifying interactive affective responses. In Proceedings of the 3rd Workshop on Affective Content Analysis (AffCon 2020) co-located with Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), New York, USA, February 7, 2020, Niyati Chhaya, Kokil Jaidka, Jennifer Healey, Lyle Ungar, and Atanu Sinha (Eds.). CEUR-WS.org, 151–160.

[5]

Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2147–2157. DOI:

[6]

Xiang Dai and Heike Adel. 2020. An analysis of simple data augmentation for named entity recognition. Proceedings of the 28th International Conference on Computational Linguistics.

[7]

Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 489–500. DOI:

[8]

Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A survey of data augmentation approaches for nlp. In Proceedings of the Findings of the Association for Computational Linguistics (ACL-IJCNLP’21). Association for Computational Linguistics, 968–988.

[9]

Demi Guo, Yoon Kim, and Alexander M. Rush. 2020. Sequence-level mixed sample data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 5547–5552.

[10]

Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, and Kazuya Takeda. 2018. Back-translation-style data augmentation for end-to-end ASR. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 426–433.

[11]

Linwei Hu, Jie Chen, Vijayan N Nair, and Agus Sudjianto. 2018. Locally interpretable models and effects based on supervised partitioning (LIME-SUP). arXiv:1806.00663. Retrieved from https://arxiv.org/abs/1806.00663.

[12]

Mai Ibrahim, Marwan Torki, and Nagwa M. El-Makky. 2020. AlexU-BackTranslation-TL at SemEval-2020 task 12: Improving offensive language detection using data augmentation and transfer learning. In Proceedings of the 14th Workshop on Semantic Evaluation. 1881–1890.

[13]

Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Association for Computational Linguistics, 18–26.

[14]

Younghoon Lee, Jungmin Park, and Sungzoon Cho. 2020. Extraction and prioritization of product attributes using an explainable neural network. Pattern Analysis and Applications 23, 4 (2020), 1767–1777.

Digital Library

[15]

Chaojun Liu, Yongqiang Wang, Kshitiz Kumar, and Yifan Gong. 2016. Investigations on speaker adaptation of LSTM RNN models for speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5020–5024.

Digital Library

[16]

Pei Liu, Xuemin Wang, Chao Xiang, and Weiye Meng. 2020. A survey of text data augmentation. In Proceedings of the 2020 International Conference on Computer Communication and Network Security (CCNS’20). IEEE, 191–195.

[17]

Ruibo Liu, Guangxuan Xu, Chenyan Jia, Weicheng Ma, Lili Wang, and Soroush Vosoughi. 2020. Data boost: Text data augmentation through reinforcement learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 9031–9041. DOI:

[18]

Sisi Liu, Kyungmi Lee, and Ickjai Lee. 2020. Document-level multi-topic sentiment classification of email data with bilstm and data augmentation. Knowledge-Based Systems 197 (2020), 105918.

[19]

Jiaqi Lun, Jia Zhu, Yong Tang, and Min Yang. 2020. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13389–13396.

[20]

Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems. 4765–4774.

[21]

Wentao Ma, Yiming Cui, Chenglei Si, Ting Liu, Shijin Wang, and Guoping Hu. 2020. CharBERT: Character-aware pre-trained language model. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 39–50. DOI:

[22]

Grégoire Montavon, Alexander Binder, Sebastian Lapuschkin, Wojciech Samek, and Klaus-Robert Müller. 2019. Layer-wise relevance propagation: An overview. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (2019), 193–209.

Digital Library

[23]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135–1144.

Digital Library

[24]

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.

[25]

Liang Ding, Di Wu, and Dacheng Tao. 2021. Improving neural machine translation by bidirectional training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3278–3284. DOI:

[26]

Connor Shorten, Taghi M. Khoshgoftaar, and Borko Furht. 2021. Text data augmentation for deep learning. Journal of Big Data 8, 1 (2021), 1–34.

[27]

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning. PMLR, 3145–3153.

[28]

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. 2017. Smoothgrad: Removing noise by adding noise. arXiv:1706.03825. Retrieved from https://arxiv.org/abs/1706.03825.

[29]

Amane Sugiyama and Naoki Yoshinaga. 2019. Data augmentation using back-translation for context-aware neural machine translation. In Proceedings of the 4th Workshop on Discourse in Machine Translation (DiscoMT’19). 35–44.

[30]

Mikhail Tikhomirov, N. Loukachevitch, Anastasiia Sirotina, and Boris Dobrov. 2020. Using bert and augmentation in named entity recognition for cybersecurity domain. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. Springer, 16–24.

[31]

Longshaokan Wang, Maryam Fazel-Zarandi, Aditya Tiwari, Spyros Matsoukas, and Lazaros Polymenakos. 2020. Data augmentation for training dialog models robust to speech recognition errors. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. Association for Computational Linguistics, 63–70. DOI:

[32]

Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. 2018. Switchout: An efficient data augmentation algorithm for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 856–861.

[33]

Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 6382–6388.

[34]

Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019. Conditional bert contextual augmentation. In Proceedings of the International Conference on Computational Science. Springer, 84–95.

[35]

Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. 2020. Unsupervised data augmentation for consistency training, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). Curran Associates, Inc., 6256–6268. https://proceedings.neurips.cc/paper/2020/file/44feb0096faa8326192570788b38c1d1-Paper.pdf.

[36]

Binxia Xu, Siyuan Qiu, Jie Zhang, Yafang Wang, Xiaoyu Shen, and Gerard de Melo. 2020. Data augmentation for multiclass utterance classification–a systematic study. In Proceedings of the 28th International Conference on Computational Linguistics. 5494–5506.

[37]

Kang Min Yoo, Hanbit Lee, Franck Dernoncourt, Trung Bui, Walter Chang, and Sang-goo Lee. 2020. Variational hierarchical dialog autoencoder for dialog state tracking data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 3406–3425.

[38]

Rongzhi Zhang, Yue Yu, and Chao Zhang. 2020. Seqmix: Augmenting active sequence labeling via sequence mixup. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 8566–8579. DOI:

[39]

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.

Cited By

Saarela MPodgorelec V(2024)Recent Applications of Explainable AI (XAI): A Systematic Literature ReviewApplied Sciences10.3390/app1419888414:19(8884)Online publication date: 2-Oct-2024
https://doi.org/10.3390/app14198884
Guo BZhao DDong XMeng JLin H(2024)Few-shot biomedical relation extraction using data augmentation and domain informationNeurocomputing10.1016/j.neucom.2024.127881595(127881)Online publication date: Aug-2024
https://doi.org/10.1016/j.neucom.2024.127881
Hu JZhu KCheng SKovalchuk NSoulsby ASimmons MMatar OArcucci R(2024)Explainable AI models for predicting drop coalescence in microfluidics deviceChemical Engineering Journal10.1016/j.cej.2023.148465481(148465)Online publication date: Feb-2024
https://doi.org/10.1016/j.cej.2023.148465
Show More Cited By

Index Terms

Explainability-Based Mix-Up Approach for Text Data Augmentation
1. Computing methodologies
  1. Machine learning

Recommendations

Mixing Approach for Text Data Augmentation Based on an Ensemble of Explainable Artificial Intelligence Methods
Abstract
To improve the accuracy and robustness of a model, text data augmentation is utilized to expand data. Among various text data augmentation methodologies, the method of mixing two or more data to generate augmented data is one of the most used ...
Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few Labels
Machine Learning and Knowledge Discovery in Databases
Abstract
Recent advances in state-of-the-art machine learning models like deep neural networks heavily rely on large amounts of labeled training data which is difficult to obtain for many applications. To address label scarcity, recent work has focused on ...
Tailored text augmentation for sentiment analysis
Abstract
In synonym replacement-based data augmentation techniques for natural language processing tasks, words in a sentence are often sampled randomly with equal probability. In this paper, we propose a novel data augmentation technique named Tailored ...
Highlights
- A novel data augmentation algorithm for sentiment analysis.
- Word sampling and replacing based on discriminative power and relevance to sentiment.
- Application of the algorithm to sentiment analysis on polices against COVID-19.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 17, Issue 1

January 2023

375 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3572846

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2023

Online AM: 27 April 2022

Accepted: 21 April 2022

Revised: 02 March 2022

Received: 27 October 2021

Published in TKDD Volume 17, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,159
Total Downloads

Downloads (Last 12 months)550
Downloads (Last 6 weeks)36

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Saarela MPodgorelec V(2024)Recent Applications of Explainable AI (XAI): A Systematic Literature ReviewApplied Sciences10.3390/app1419888414:19(8884)Online publication date: 2-Oct-2024
https://doi.org/10.3390/app14198884
Guo BZhao DDong XMeng JLin H(2024)Few-shot biomedical relation extraction using data augmentation and domain informationNeurocomputing10.1016/j.neucom.2024.127881595(127881)Online publication date: Aug-2024
https://doi.org/10.1016/j.neucom.2024.127881
Hu JZhu KCheng SKovalchuk NSoulsby ASimmons MMatar OArcucci R(2024)Explainable AI models for predicting drop coalescence in microfluidics deviceChemical Engineering Journal10.1016/j.cej.2023.148465481(148465)Online publication date: Feb-2024
https://doi.org/10.1016/j.cej.2023.148465
Belhadi ADjenouri YBelbachir AMichalak TSrivastava G(2024)Shapley visual transformers for image-to-text generationApplied Soft Computing10.1016/j.asoc.2024.112205166(112205)Online publication date: Nov-2024
https://doi.org/10.1016/j.asoc.2024.112205
Park JLee Y(2024)Advanced pseudo-labeling approach in mixing-based text data augmentation methodPattern Analysis and Applications10.1007/s10044-024-01340-627:4Online publication date: 30-Sep-2024
https://doi.org/10.1007/s10044-024-01340-6
Yotov KHadzhikolev EHadzhikoleva SCheresharov S(2023)A Method for Extrapolating Continuous Functions by Generating New Training Samples for Feedforward Artificial Neural NetworksAxioms10.3390/axioms1208075912:8(759)Online publication date: 1-Aug-2023
https://doi.org/10.3390/axioms12080759
Jiang YRen YWang ZTang YLu SHu NTian ZZhou Y(2023)OPTIMA-DEM: An Optimized Threat Behavior Prediction Method using DEMATEL-ISM2023 IEEE 12th International Conference on Cloud Networking (CloudNet)10.1109/CloudNet59005.2023.10490058(413-417)Online publication date: 1-Nov-2023
https://doi.org/10.1109/CloudNet59005.2023.10490058
Onan A(2023)SRL-ACOJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10161135:7Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.jksuci.2023.101611

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents