Subject classification of texts in Polish-from TF-IDF to transformers

T Walkowiak - … Conference on Dependability and Complex Systems, 2021 - Springer
International Conference on Dependability and Complex Systems, 2021Springer
The paper concerns the issue of documents in Polish classification according to the subject
category. We compare five approaches, starting from classical TF-IDF one, through
word2vec methods (fastText in the supervised mode and doc2vec method) up to
contemporary transformer-based BERT models (ie, two models pre-trained on Polish
corpora are used). These five approaches are evaluated using five corpora with subject
categories (ranging from 5 to 36 labels), different sizes, and lengths of texts. Results suggest …
Abstract
The paper concerns the issue of documents in Polish classification according to the subject category. We compare five approaches, starting from classical TF-IDF one, through word2vec methods (fastText in the supervised mode and doc2vec method) up to contemporary transformer-based BERT models (i.e., two models pre-trained on Polish corpora are used). These five approaches are evaluated using five corpora with subject categories (ranging from 5 to 36 labels), different sizes, and lengths of texts. Results suggest that BERT methods outperform other approaches in almost all tests. Due to the size of the model and therefore required time for tuning, the Polbert model is recommended as a solution for Polish text subject classification. The paper includes detailed analysis of recognition and time performance of analyzed methods.
Springer