Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Subject Classification of Texts in Polish - from TF-IDF to Transformers

  • Conference paper
  • First Online:
Theory and Engineering of Dependable Computer Systems and Networks (DepCoS-RELCOMEX 2021)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1389))

Included in the following conference series:

Abstract

The paper concerns the issue of documents in Polish classification according to the subject category. We compare five approaches, starting from classical TF-IDF one, through word2vec methods (fastText in the supervised mode and doc2vec method) up to contemporary transformer-based BERT models (i.e., two models pre-trained on Polish corpora are used). These five approaches are evaluated using five corpora with subject categories (ranging from 5 to 36 labels), different sizes, and lengths of texts. Results suggest that BERT methods outperform other approaches in almost all tests. Due to the size of the model and therefore required time for tuning, the Polbert model is recommended as a solution for Polish text subject classification. The paper includes detailed analysis of recognition and time performance of analyzed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://hdl.handle.net/11321/606.

  2. 2.

    https://huggingface.co/dkleczek/bert-base-polish-cased-v1.

  3. 3.

    on march 2021.

  4. 4.

    https://klejbenchmark.com/leaderboard/.

  5. 5.

    https://huggingface.co/allegro/herbert-large-cased.

References

  1. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. CoRR abs/2004.05150 (2020). https://arxiv.org/abs/2004.05150

  2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  3. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2018, pp. 3483–3487 (2018)

    Google Scholar 

  4. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Google Scholar 

  5. Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009). autres impressions : 2011 (corr.), 2013 (7e corr.)

    Google Scholar 

  6. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068

  7. Kłeczek, D.: Polbert: attacking polish NLP tasks with transformers. In: Ogrodniczuk, M., Kobyliński, Ł. (eds.) Proceedings of the PolEval 2020 Workshop, pp. 79–88. Institute of Computer Science, Polish Academy of Sciences (2020)

    Google Scholar 

  8. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: 8th International Conference on Learning Representations. OpenReview.net (2020). https://openreview.net/forum?id=rkgNKkHtvB

  9. Kocon, J., Gawor, M.: Evaluating KGR10 polish word embeddings in the recognition of temporal expressions using BiLSTM-CRF. CoRR abs/1904.04055 (2019). http://arxiv.org/abs/1904.04055

  10. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)

    Google Scholar 

  11. Marcińczuk, M., Gniewkowski, M., Walkowiak, T., Będkowski, M.: Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In: Proceedings of the 11th Global Wordnet Conference, pp. 207–214. Global Wordnet Association, University of South Africa (UNISA) (January 2021). https://www.aclweb.org/anthology/2021.gwc-1.24

  12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  13. Młynarczyk, K., Piasecki, M.: Wiki test - 34 categories (2015). http://hdl.handle.net/11321/217. CLARIN-PL digital repository

  14. Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories (2015). http://hdl.handle.net/11321/222. CLARIN-PL digital repository

  15. Piskorski, J., Sydow, M.: Experiments on classification of polish newspaper. Arch. Control Sci. 15, 613–625 (2005)

    MATH  Google Scholar 

  16. Rybak, P., Mroczkowski, R., Tracz, J., Gawlik, I.: KLEJ: comprehensive benchmark for polish language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1191–1201. Association for Computational Linguistics (July 2020). https://www.aclweb.org/anthology/2020.acl-main.111

  17. Salton, G.B.C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  18. Torkkola, K.: Discriminative features for textdocument classification. Formal Pattern Anal. Appl. 6(4), 301–308 (2004). https://doi.org/10.1007/s10044-003-0196-8

  19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  20. Walkowiak, T., Datko, S., Maciejewski, H.: Bag-of-words, bag-of-topics and word-to-vec based subject classification of text documents in polish - a comparative study. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Contemporary Complex Systems and Their Dependability, pp. 526–535. Springer, Cham (2019)

    Google Scholar 

  21. Walkowiak, T., Datko, S., Maciejewski, H.: Low-dimensional classification of text documents. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Engineering in Dependability of Computer Systems and Networks, pp. 534–543. Springer, Cham (2020)

    Chapter  Google Scholar 

  22. Walkowiak, T., Gniewkowski, M.: Evaluation of vector embedding models in clustering of text documents. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP 2019, Varna, Bulgaria, pp. 1304–1311. INCOMA Ltd. (September 2019). https://www.aclweb.org/anthology/R19-1149

  23. Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART, vol. 2, pp. 515–522. INSTICC, SciTePress (2018)

    Google Scholar 

  24. Zhu, H., Koniusz, P.: Simple spectral graph convolution. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=CYO5T-YjWZV

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Walkowiak, T. (2021). Subject Classification of Texts in Polish - from TF-IDF to Transformers. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Theory and Engineering of Dependable Computer Systems and Networks. DepCoS-RELCOMEX 2021. Advances in Intelligent Systems and Computing, vol 1389. Springer, Cham. https://doi.org/10.1007/978-3-030-76773-0_44

Download citation

Publish with us

Policies and ethics