Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Active Learning for Reducing Labeling Effort in Text Classification Tasks

  • Conference paper
  • First Online:
Artificial Intelligence and Machine Learning (BNAIC/Benelearn 2021)

Abstract

Labeling data can be an expensive task as it is usually performed manually by domain experts. This is cumbersome for deep learning, as it is dependent on large labeled datasets. Active learning (AL) is a paradigm that aims to reduce labeling effort by only using the data which the used model deems most informative. Little research has been done on AL in a text classification setting and next to none has involved the more recent, state-of-the-art Natural Language Processing (NLP) models. Here, we present an empirical study that compares different uncertainty-based algorithms with BERT\(_{base}\) as the used classifier. We evaluate the algorithms on two NLP classification datasets: Stanford Sentiment Treebank and KvK-Frontpages. Additionally, we explore heuristics that aim to solve presupposed problems of uncertainty-based AL; namely, that it is unscalable and that it is prone to selecting outliers. Furthermore, we explore the influence of the query-pool size on the performance of AL. Whereas it was found that the proposed heuristics for AL did not improve performance of AL; our results show that using uncertainty-based AL with BERT\(_{base}\) outperforms random sampling of data. This difference in performance can decrease as the query-pool size gets larger.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For our experiments, this resulted in our n ranging from 20 to 191 for the SST dataset and from 17 to 152 for the KvK dataset (the used q can be found in Sect. 3.5).

  2. 2.

    Larger values up to 100 were tested, but induced much larger training times without noteworthy performance gains.

References

  1. Ahmed, W., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. 23(1), 90–93 (1974)

    Article  MathSciNet  Google Scholar 

  2. Bouneffouf, D., Laroche, R., Urvoy, T., Feraud, R., Allesiardo, R.: Contextual bandit for active learning: active Thompson sampling. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8834, pp. 405–412. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12637-1_51

    Chapter  Google Scholar 

  3. Cai, W., Zhang, Y., Zhou, J.: Maximizing expected model change for active learning in regression. In: Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 51–60 (2013)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Association for Computational Linguistics (NAACL), pp. 4171–4186 (2019)

    Google Scholar 

  5. Drost, F.: Uncertainty estimation in deep neural networks for image classification. Master’s thesis, University of Groningen (2020)

    Google Scholar 

  6. Ein-Dor, L., et al.: Active learning for BERT: an empirical study. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7949–7962 (2020)

    Google Scholar 

  7. Gal, Y.: Uncertainty in deep learning. Master’s thesis, University of Cambridge (2016)

    Google Scholar 

  8. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1050–1059. PMLR (2016)

    Google Scholar 

  9. Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1183–1192. PMLR (2017)

    Google Scholar 

  10. Gikunda, P.K., Jouandeau, N.: Budget active learning for deep networks. In: Intelligent Systems and Applications, pp. 488–504 (2021)

    Google Scholar 

  11. Gissin, D., Shalev-Shwartz, S.: Discriminative active learning. CoRR abs/1907.06347 (2019)

    Google Scholar 

  12. Grießhaber, D., Maucher, J., Vu, N.T.: Fine-tuning BERT for low-resource natural language understanding via active learning. CoRR abs/2012.02462 (2020)

    Google Scholar 

  13. Gulati, P., Sharma, A., Gupta, M.: Theoretical study of decision tree algorithms to identify pivotal factors for performance improvement: a review. Int. J. Comput. Appl. 141(14), 19–25 (2016)

    Google Scholar 

  14. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: International Conference on Machine Learning, pp. 1321–1330 (2017)

    Google Scholar 

  15. Gupta, A., Thadani, K., O’Hare, N.: Effective few-shot classification with transfer learning. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1061–1066 (2020)

    Google Scholar 

  16. Houlsby, N., Huszár, F., Ghahramani, Z., Lengyel, M.: Bayesian active learning for classification and preference learning (2011)

    Google Scholar 

  17. Hu, P., Lipton, Z.C., Anandkumar, A., Ramanan, D.: Active learning with partial feedback. CoRR abs/1802.07427 (2018)

    Google Scholar 

  18. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2015)

    Google Scholar 

  19. Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation and active learning. In: Proceedings of the 7th International Conference on Neural Information Processing Systems, pp. 231–238. MIT Press (1994)

    Google Scholar 

  20. Kuleshov, V., Fenner, N., Ermon, S.: Accurate uncertainties for deep learning using calibrated regression. In: Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 2796–2804 (2018)

    Google Scholar 

  21. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  22. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)

    Google Scholar 

  23. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR, pp. 1–12 (2013)

    Google Scholar 

  24. Munikar, M., Shakya, S., Shrestha, A.: Fine-grained sentiment classification using BERT (2019)

    Google Scholar 

  25. Oosten, J.P., Schomaker, L.: Separability versus prototypicality in handwritten word-image retrieval. Pattern Recogn. 47(3), 1031–1038 (2014)

    Article  Google Scholar 

  26. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  27. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. CoRR abs/1908.10084 (2019)

    Google Scholar 

  28. Roy, N., McCallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 441–448 (2001)

    Google Scholar 

  29. Schröder, C., Niekler, A.: A survey of active learning for text classification using deep neural networks. CoRR abs/2008.07267 (2020)

    Google Scholar 

  30. Settles, B.: Active learning literature survey. Synth. Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012)

    MATH  Google Scholar 

  31. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  MathSciNet  Google Scholar 

  32. Smith, L.N.: No more pesky learning rate guessing games. CoRR abs/1506.01186 (2015)

    Google Scholar 

  33. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)

    Google Scholar 

  34. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  35. Swayamdipta, S., et al.: Dataset cartography: mapping and diagnosing datasets with training dynamics. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9275–9293. Online (2020)

    Google Scholar 

  36. Tang, M., Luo, X., Roukos, S.: Active learning for statistical natural language parsing. In: Proceedings of ACL 2002, pp. 120–127 (2002)

    Google Scholar 

  37. Teye, M., Azizpour, H., Smith, K.: Bayesian uncertainty estimation for batch normalized deep networks. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 4907–4916. PMLR (2018)

    Google Scholar 

  38. Tsymbalov, E., Panov, M., Shapeev, A.: Dropout-based active learning for regression. In: van der Aalst, W.M.P., et al. (eds.) AIST 2018. LNCS, vol. 11179, pp. 247–258. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11027-7_24

    Chapter  Google Scholar 

  39. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  40. Zhang, M., Plank, B.: Cartography active learning. CoRR abs/2109.04282 (2021)

    Google Scholar 

  41. Zhu, J., Wang, H., Yao, T., Tsou, B.K.: Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 1137–1144 (2008)

    Google Scholar 

Download references

Acknowledgments

We would like to express our thanks and gratitude to the people at Dialogic (Utrecht) of which Nick Jelicic in particular, for the useful advice on the writing style of the paper and the suggested improvements for the source code.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Pieter Floris Jacobs , Gideon Maillette de Buy Wenniger , Marco Wiering or Lambert Schomaker .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 A.1 RET Algorithm Computational Cost Analysis

The number of forward passes required by the RET algorithm depends on two factors:

  1. 1.

    Basic passes: The forward passes required by the “normal” computation of uncertainty at the beginning of the computation for every query-pool.

  2. 2.

    RP passes: The forward passed required for intermediate updates, using the redundancy pool RP.

In this analysis we will assume that the size of the redundancy pool \(|\mathcal {RP}|\) is chosen as a factor \(f > 1\) of the size of the query-pool q. A reasonable assumption, considering that making \(|\mathcal {RP}|\) larger than needed incurs unnecessary computational cost, whereas a too small value is expected to diminish the effect of the RET algorithm. We furthermore notice that given this assumption, and assuming a fixed total number of examples to label, there are two factors influencing the required amount of RP passes:

  • Linearly increasing the query-pool size and coupled redundancy pool size causes a quadratic increase in the number of required forward passes per query pool round.

  • At the same time, a linearly increased query-pool size also induces a corresponding linear decrease in the number of required query-pool rounds.

We will see that these two factors will cause a net linear contribution to the number of RP passes starts causing a net increase of total passes once the query-size comes above a certain value. Looking at (1) more precisely, the amount of passes over \(\mathcal {RP}\) that needs to be performed per query-pool round can be computed as an arithmetic progression:

$$\begin{aligned} |\mathcal {RP}| + (|\mathcal {RP}| - 1) + (|\mathcal {RP}| - 2) + \ldots + (|\mathcal {RP} - q) \end{aligned}$$
(7)
$$\begin{aligned} = \frac{1}{2} \times (q + 1) \times (|\mathcal {RP}| + |\mathcal {RP}| - q ) \end{aligned}$$
(8)
$$\begin{aligned} = \frac{1}{2} \times (q + 1) \times ((2f - 1) \times q) \end{aligned}$$
(9)
$$\begin{aligned} = \frac{1}{2} \times (q + 1) \times f' \times q) \end{aligned}$$
(10)
$$\begin{aligned} = \frac{1}{2} \times f' \times (q^2 + q)) \end{aligned}$$
(11)

Let’s assume we use \(f = 1.5\) (as also used in our experiments), and consequently, \(f' = 2f - 1 = 2\). The number of forward passes over \(\mathcal {RP}\) then becomes exactly \(q^2 + q\).

The complexity can then be expressed by the following formula:

(12)

This can be approximately rewritten as:

$$\begin{aligned} T \times \#\text {Samples} \times (\frac{|\text {data}|}{q} + \frac{q^2 + q}{\text {query-pool}}) \end{aligned}$$
(13)
$$\begin{aligned} = T \times \#\text {Samples} \times (\frac{|\text {data}|}{q} + q + 1) \end{aligned}$$
(14)

Note that the second term \(\text {query-pool-size} + 1\) only starts dominating the number of forward passes in this formula as soon as:

$$q + 1 \approx q > \frac{|\text {data}|}{q} $$

This is the case when

$$q > \sqrt{(}|\text {data}|)$$

Until then, the computational gains of less basic passes outweighs the cost of more RP passes. In practice though, this may happen fairly quickly. For example, assuming we have a data size of 10000 examples, and we use as mentioned \(q = 1.5 |\times \mathcal {RP}|\), then as soon as \(q \ge 100\) the increased computation of the RP passes starts dominating the gains made by less basic passes when further increasing the query-pool size, and the net effect is that the total amount of computation increases.

In summary, for the RET algorithm, RP passes contribute to the total amount of forward passes. Furthermore, this contribution increases linearly with redundancy-pool size and coupled query-pool size, and starts to dominate the total amount of forward passes once \(\text {redundancy-pool-size} > \sqrt{\text {data-size}}\). This limits its use for decreasing computation by increasing the query-pool size.

1.2 A.2 Algorithms

figure a
figure b
figure c
figure d
figure e

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jacobs, P.F., Maillette de Buy Wenniger, G., Wiering, M., Schomaker, L. (2022). Active Learning for Reducing Labeling Effort in Text Classification Tasks. In: Leiva, L.A., Pruski, C., Markovich, R., Najjar, A., Schommer, C. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2021. Communications in Computer and Information Science, vol 1530. Springer, Cham. https://doi.org/10.1007/978-3-030-93842-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-93842-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93841-3

  • Online ISBN: 978-3-030-93842-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics