Abstract
The paper concerns the issue of documents in Polish classification according to the subject category. We compare five approaches, starting from classical TF-IDF one, through word2vec methods (fastText in the supervised mode and doc2vec method) up to contemporary transformer-based BERT models (i.e., two models pre-trained on Polish corpora are used). These five approaches are evaluated using five corpora with subject categories (ranging from 5 to 36 labels), different sizes, and lengths of texts. Results suggest that BERT methods outperform other approaches in almost all tests. Due to the size of the model and therefore required time for tuning, the Polbert model is recommended as a solution for Polish text subject classification. The paper includes detailed analysis of recognition and time performance of analyzed methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. CoRR abs/2004.05150 (2020). https://arxiv.org/abs/2004.05150
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2018, pp. 3483–3487 (2018)
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009). autres impressions : 2011 (corr.), 2013 (7e corr.)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
Kłeczek, D.: Polbert: attacking polish NLP tasks with transformers. In: Ogrodniczuk, M., Kobyliński, Ł. (eds.) Proceedings of the PolEval 2020 Workshop, pp. 79–88. Institute of Computer Science, Polish Academy of Sciences (2020)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: 8th International Conference on Learning Representations. OpenReview.net (2020). https://openreview.net/forum?id=rkgNKkHtvB
Kocon, J., Gawor, M.: Evaluating KGR10 polish word embeddings in the recognition of temporal expressions using BiLSTM-CRF. CoRR abs/1904.04055 (2019). http://arxiv.org/abs/1904.04055
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Marcińczuk, M., Gniewkowski, M., Walkowiak, T., Będkowski, M.: Text document clustering: Wordnet vs. TF-IDF vs. word embeddings. In: Proceedings of the 11th Global Wordnet Conference, pp. 207–214. Global Wordnet Association, University of South Africa (UNISA) (January 2021). https://www.aclweb.org/anthology/2021.gwc-1.24
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Młynarczyk, K., Piasecki, M.: Wiki test - 34 categories (2015). http://hdl.handle.net/11321/217. CLARIN-PL digital repository
Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories (2015). http://hdl.handle.net/11321/222. CLARIN-PL digital repository
Piskorski, J., Sydow, M.: Experiments on classification of polish newspaper. Arch. Control Sci. 15, 613–625 (2005)
Rybak, P., Mroczkowski, R., Tracz, J., Gawlik, I.: KLEJ: comprehensive benchmark for polish language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1191–1201. Association for Computational Linguistics (July 2020). https://www.aclweb.org/anthology/2020.acl-main.111
Salton, G.B.C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Torkkola, K.: Discriminative features for textdocument classification. Formal Pattern Anal. Appl. 6(4), 301–308 (2004). https://doi.org/10.1007/s10044-003-0196-8
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Walkowiak, T., Datko, S., Maciejewski, H.: Bag-of-words, bag-of-topics and word-to-vec based subject classification of text documents in polish - a comparative study. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Contemporary Complex Systems and Their Dependability, pp. 526–535. Springer, Cham (2019)
Walkowiak, T., Datko, S., Maciejewski, H.: Low-dimensional classification of text documents. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Engineering in Dependability of Computer Systems and Networks, pp. 534–543. Springer, Cham (2020)
Walkowiak, T., Gniewkowski, M.: Evaluation of vector embedding models in clustering of text documents. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP 2019, Varna, Bulgaria, pp. 1304–1311. INCOMA Ltd. (September 2019). https://www.aclweb.org/anthology/R19-1149
Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART, vol. 2, pp. 515–522. INSTICC, SciTePress (2018)
Zhu, H., Koniusz, P.: Simple spectral graph convolution. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=CYO5T-YjWZV
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Walkowiak, T. (2021). Subject Classification of Texts in Polish - from TF-IDF to Transformers. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) Theory and Engineering of Dependable Computer Systems and Networks. DepCoS-RELCOMEX 2021. Advances in Intelligent Systems and Computing, vol 1389. Springer, Cham. https://doi.org/10.1007/978-3-030-76773-0_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-76773-0_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76772-3
Online ISBN: 978-3-030-76773-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)