survey

A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text Classification

Authors:

Marcos André GonçalvesAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 13s

Article No.: 265, Pages 1 - 52

https://doi.org/10.1145/3582000

Published: 13 July 2023 Publication History

Get Access

Abstract

Progress in natural language processing has been dictated by the rule of more: more data, more computing power, more complexity, best exemplified by deep learning Transformers. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. One way to ameliorate this problem is through data engineering instead of the algorithmic or hardware perspectives. Our focus here is an under-investigated data engineering technique, with enormous potential in the current scenario – Instance Selection (IS) (a.k.a. Selective Sampling, Prototype Selection). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. We survey classical and recent state-of-the-art IS techniques and provide a scientifically sound comparison of IS methods applied to an essential natural language processing task—Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural state-of-the-art ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod (training set reduction, effectiveness, and efficiency). Our answers reveal an enormous unfulfilled potential for IS solutions. Specially, we show that in 12 out of 19 datasets, specific IS methods—namely, Condensed Nearest Neighbor, Local Set-based Smoother, and Local Set Border Selector—can reduce the size of the training set without effectiveness losses. Furthermore, in the case of fine-tuning the Transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains.

References

[1]

Tariq Abdullah and Ahmed Ahmet. 2022. Deep learning in sentiment analysis: A survey of recent architectures. ACM Comput. Surv. (jun2022). Just Accepted.

Abstract

References

Cited By

Index Terms

Recommendations

Combining example selection with instance selection to speed up multiple-instance learning

Instance selection in semi-supervised learning

Robust multiple-instance learning ensembles using random subspace instance selection

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations