Active Learning for Reducing Labeling Effort in Text Classification Tasks

Jacobs, Pieter Floris; Maillette de Buy Wenniger, Gideon; Wiering, Marco; Schomaker, Lambert

doi:10.1007/978-3-030-93842-0_1

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1530))

Included in the following conference series:

Benelux Conference on Artificial Intelligence

886 Accesses

Abstract

Labeling data can be an expensive task as it is usually performed manually by domain experts. This is cumbersome for deep learning, as it is dependent on large labeled datasets. Active learning (AL) is a paradigm that aims to reduce labeling effort by only using the data which the used model deems most informative. Little research has been done on AL in a text classification setting and next to none has involved the more recent, state-of-the-art Natural Language Processing (NLP) models. Here, we present an empirical study that compares different uncertainty-based algorithms with BERT$_{base}$ as the used classifier. We evaluate the algorithms on two NLP classification datasets: Stanford Sentiment Treebank and KvK-Frontpages. Additionally, we explore heuristics that aim to solve presupposed problems of uncertainty-based AL; namely, that it is unscalable and that it is prone to selecting outliers. Furthermore, we explore the influence of the query-pool size on the performance of AL. Whereas it was found that the proposed heuristics for AL did not improve performance of AL; our results show that using uncertainty-based AL with BERT$_{base}$ outperforms random sampling of data. This difference in performance can decrease as the query-pool size gets larger.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Two-Stage Active Learning Algorithm for NLP Based on Feature Mixing

AutoLabel: Automated Textual Data Annotation Method Based on Active Learning and Large Language Model

Active Few-Shot Learning with FASL

Notes

1.
For our experiments, this resulted in our n ranging from 20 to 191 for the SST dataset and from 17 to 152 for the KvK dataset (the used q can be found in Sect. 3.5).
2.
Larger values up to 100 were tested, but induced much larger training times without noteworthy performance gains.

References

Ahmed, W., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. 23(1), 90–93 (1974)
Article MathSciNet Google Scholar
Bouneffouf, D., Laroche, R., Urvoy, T., Feraud, R., Allesiardo, R.: Contextual bandit for active learning: active Thompson sampling. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8834, pp. 405–412. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12637-1_51
Chapter Google Scholar
Cai, W., Zhang, Y., Zhou, J.: Maximizing expected model change for active learning in regression. In: Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 51–60 (2013)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Association for Computational Linguistics (NAACL), pp. 4171–4186 (2019)
Google Scholar
Drost, F.: Uncertainty estimation in deep neural networks for image classification. Master’s thesis, University of Groningen (2020)
Google Scholar
Ein-Dor, L., et al.: Active learning for BERT: an empirical study. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7949–7962 (2020)
Google Scholar
Gal, Y.: Uncertainty in deep learning. Master’s thesis, University of Cambridge (2016)
Google Scholar
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1050–1059. PMLR (2016)
Google Scholar
Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1183–1192. PMLR (2017)
Google Scholar
Gikunda, P.K., Jouandeau, N.: Budget active learning for deep networks. In: Intelligent Systems and Applications, pp. 488–504 (2021)
Google Scholar
Gissin, D., Shalev-Shwartz, S.: Discriminative active learning. CoRR abs/1907.06347 (2019)
Google Scholar
Grießhaber, D., Maucher, J., Vu, N.T.: Fine-tuning BERT for low-resource natural language understanding via active learning. CoRR abs/2012.02462 (2020)
Google Scholar
Gulati, P., Sharma, A., Gupta, M.: Theoretical study of decision tree algorithms to identify pivotal factors for performance improvement: a review. Int. J. Comput. Appl. 141(14), 19–25 (2016)
Google Scholar
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International Conference on Machine Learning, pp. 1321–1330 (2017)
Google Scholar
Gupta, A., Thadani, K., O’Hare, N.: Effective few-shot classification with transfer learning. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1061–1066 (2020)
Google Scholar
Houlsby, N., Huszár, F., Ghahramani, Z., Lengyel, M.: Bayesian active learning for classification and preference learning (2011)
Google Scholar
Hu, P., Lipton, Z.C., Anandkumar, A., Ramanan, D.: Active learning with partial feedback. CoRR abs/1802.07427 (2018)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)
Google Scholar
Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation and active learning. In: Proceedings of the 7th International Conference on Neural Information Processing Systems, pp. 231–238. MIT Press (1994)
Google Scholar
Kuleshov, V., Fenner, N., Ermon, S.: Accurate uncertainties for deep learning using calibrated regression. In: Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 2796–2804 (2018)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR, pp. 1–12 (2013)
Google Scholar
Munikar, M., Shakya, S., Shrestha, A.: Fine-grained sentiment classification using BERT (2019)
Google Scholar
Oosten, J.P., Schomaker, L.: Separability versus prototypicality in handwritten word-image retrieval. Pattern Recogn. 47(3), 1031–1038 (2014)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. CoRR abs/1908.10084 (2019)
Google Scholar
Roy, N., McCallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 441–448 (2001)
Google Scholar
Schröder, C., Niekler, A.: A survey of active learning for text classification using deep neural networks. CoRR abs/2008.07267 (2020)
Google Scholar
Settles, B.: Active learning literature survey. Synth. Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012)
MATH Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Article MathSciNet Google Scholar
Smith, L.N.: No more pesky learning rate guessing games. CoRR abs/1506.01186 (2015)
Google Scholar
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Swayamdipta, S., et al.: Dataset cartography: mapping and diagnosing datasets with training dynamics. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9275–9293. Online (2020)
Google Scholar
Tang, M., Luo, X., Roukos, S.: Active learning for statistical natural language parsing. In: Proceedings of ACL 2002, pp. 120–127 (2002)
Google Scholar
Teye, M., Azizpour, H., Smith, K.: Bayesian uncertainty estimation for batch normalized deep networks. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 4907–4916. PMLR (2018)
Google Scholar
Tsymbalov, E., Panov, M., Shapeev, A.: Dropout-based active learning for regression. In: van der Aalst, W.M.P., et al. (eds.) AIST 2018. LNCS, vol. 11179, pp. 247–258. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11027-7_24
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Google Scholar
Zhang, M., Plank, B.: Cartography active learning. CoRR abs/2109.04282 (2021)
Google Scholar
Zhu, J., Wang, H., Yao, T., Tsou, B.K.: Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 1137–1144 (2008)
Google Scholar

Download references

Acknowledgments

We would like to express our thanks and gratitude to the people at Dialogic (Utrecht) of which Nick Jelicic in particular, for the useful advice on the writing style of the paper and the suggested improvements for the source code.

Author information

Authors and Affiliations

University of Groningen, Groningen, The Netherlands
Pieter Floris Jacobs, Gideon Maillette de Buy Wenniger, Marco Wiering & Lambert Schomaker
Open University of the Netherlands, Heerlen, The Netherlands
Gideon Maillette de Buy Wenniger

Authors

Pieter Floris Jacobs
View author publications
You can also search for this author in PubMed Google Scholar
Gideon Maillette de Buy Wenniger
View author publications
You can also search for this author in PubMed Google Scholar
Marco Wiering
View author publications
You can also search for this author in PubMed Google Scholar
Lambert Schomaker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Pieter Floris Jacobs , Gideon Maillette de Buy Wenniger , Marco Wiering or Lambert Schomaker .

Editor information

Editors and Affiliations

University of Luxembourg, Esch-sur-Alzette, Luxembourg
Luis A. Leiva
Luxembourg Institute of Science and Technology, Esch-sur-Alzette, Luxembourg
Cédric Pruski
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Réka Markovich
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Amro Najjar
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Christoph Schommer

Appendix

1.1 A.1 RET Algorithm Computational Cost Analysis

The number of forward passes required by the RET algorithm depends on two factors:

1.
Basic passes: The forward passes required by the “normal” computation of uncertainty at the beginning of the computation for every query-pool.
2.
RP passes: The forward passed required for intermediate updates, using the redundancy pool RP.

In this analysis we will assume that the size of the redundancy pool $|\mathcal {RP}|$ is chosen as a factor $f > 1$ of the size of the query-pool q. A reasonable assumption, considering that making $|\mathcal {RP}|$ larger than needed incurs unnecessary computational cost, whereas a too small value is expected to diminish the effect of the RET algorithm. We furthermore notice that given this assumption, and assuming a fixed total number of examples to label, there are two factors influencing the required amount of RP passes:

Linearly increasing the query-pool size and coupled redundancy pool size causes a quadratic increase in the number of required forward passes per query pool round.
At the same time, a linearly increased query-pool size also induces a corresponding linear decrease in the number of required query-pool rounds.

We will see that these two factors will cause a net linear contribution to the number of RP passes starts causing a net increase of total passes once the query-size comes above a certain value. Looking at (1) more precisely, the amount of passes over $\mathcal {RP}$ that needs to be performed per query-pool round can be computed as an arithmetic progression:

$$\begin{aligned} |\mathcal {RP}| + (|\mathcal {RP}| - 1) + (|\mathcal {RP}| - 2) + \ldots + (|\mathcal {RP} - q) \end{aligned}$$

(7)

$$\begin{aligned} = \frac{1}{2} \times (q + 1) \times (|\mathcal {RP}| + |\mathcal {RP}| - q ) \end{aligned}$$

(8)

$$\begin{aligned} = \frac{1}{2} \times (q + 1) \times ((2f - 1) \times q) \end{aligned}$$

(9)

$$\begin{aligned} = \frac{1}{2} \times (q + 1) \times f' \times q) \end{aligned}$$

(10)

$$\begin{aligned} = \frac{1}{2} \times f' \times (q^2 + q)) \end{aligned}$$

(11)

Let’s assume we use $f = 1.5$ (as also used in our experiments), and consequently, $f' = 2f - 1 = 2$. The number of forward passes over $\mathcal {RP}$ then becomes exactly $q^2 + q$.

The complexity can then be expressed by the following formula:

(12)

This can be approximately rewritten as:

$$\begin{aligned} T \times \#\text {Samples} \times (\frac{|\text {data}|}{q} + \frac{q^2 + q}{\text {query-pool}}) \end{aligned}$$

(13)

$$\begin{aligned} = T \times \#\text {Samples} \times (\frac{|\text {data}|}{q} + q + 1) \end{aligned}$$

(14)

Note that the second term $\text {query-pool-size} + 1$ only starts dominating the number of forward passes in this formula as soon as:

$$q + 1 \approx q > \frac{|\text {data}|}{q} $$

This is the case when

$$q > \sqrt{(}|\text {data}|)$$

Until then, the computational gains of less basic passes outweighs the cost of more RP passes. In practice though, this may happen fairly quickly. For example, assuming we have a data size of 10000 examples, and we use as mentioned $q = 1.5 |\times \mathcal {RP}|$, then as soon as $q \ge 100$ the increased computation of the RP passes starts dominating the gains made by less basic passes when further increasing the query-pool size, and the net effect is that the total amount of computation increases.

In summary, for the RET algorithm, RP passes contribute to the total amount of forward passes. Furthermore, this contribution increases linearly with redundancy-pool size and coupled query-pool size, and starts to dominate the total amount of forward passes once $\text {redundancy-pool-size} > \sqrt{\text {data-size}}$. This limits its use for decreasing computation by increasing the query-pool size.

1.2 A.2 Algorithms

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jacobs, P.F., Maillette de Buy Wenniger, G., Wiering, M., Schomaker, L. (2022). Active Learning for Reducing Labeling Effort in Text Classification Tasks. In: Leiva, L.A., Pruski, C., Markovich, R., Najjar, A., Schommer, C. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2021. Communications in Computer and Information Science, vol 1530. Springer, Cham. https://doi.org/10.1007/978-3-030-93842-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-93842-0_1
Published: 11 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93841-3
Online ISBN: 978-3-030-93842-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us