Author Attribution of Literary Texts in Polish by the Sequence Averaging

Walkowiak, Tomasz

doi:10.1007/978-3-031-23480-4_31

Tomasz Walkowiak ORCID: orcid.org/0000-0002-7749-4251¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13589))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

370 Accesses
1 Citations

Abstract

The paper concerns the authorship recognition in a collection of Polish literary texts from the late 19th and early 20th centuries, consisting of 99 novels from 33 authors. The authors divide the books into smaller parts and analyze the classification based on a book part. To mimic the real task of testing an unknown book, the data set has been divided in such a way that parts of the same book do not appear simultaneously in the training and test set. The authors compare the approaches by working with raw texts, and, to avoid the semantic features of the text, they represent texts in the form of a sequence of grammatical class bigrams. In the case of raw text analysis, classical TF-IDF, supervised fastText, and contemporary transformer-based BERT are analyzed. In the case of grammatical classes, only TF-IDF and fastText are applied. In addition, the authors propose a sequence averaging method that works by dividing the text into smaller parts, classifying each part separately, and making the final classification based on averaging results from each part of the text. The study suggests that the TF-IDF on the raw text outperforms other methods and the sequence averaging improves the classification results for most of the analyzed schemas. Surprisingly, the BERT based method is the worst. This phenomenon is carefully analyzed and explained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Stylometry and Mathematical Study of Authorship

Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

Authorship Attribution using Filtered N-grams as Features

Notes

1.
https://huggingface.co/allegro/herbert-base-cased.

References

Alexandre, L.A., Campilho, A.C., Kamel, M.: On combining classifiers using sum and product rules. Pattern Recogn. Lett. 22(12), 1283–1289 (2001). https://doi.org/10.1016/S0167-8655(01)00073-3
Article MATH Google Scholar
Barlas, G., Stamatatos, E.: Cross-domain authorship attribution using pre-trained language models. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2020. IAICT, vol. 583, pp. 255–266. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49161-1_22
Chapter Google Scholar
Calix, K., Connors, M., Levy, D., Manzar, H., McCabe, G., Westcott, S.: Stylometry for e-mail author identification and authentication (2008)
Google Scholar
Can, M.: Authorship attribution using principal component analysis and competitive neural networks. Math. Comput. Appl. 19(1), 21–36 (2014)
MATH Google Scholar
Craig, H., Kinney, A.: Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cogn. Stud. | Études cognitives 17 (2017). https://doi.org/10.11649/cs.1430
Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels (2015). http://hdl.handle.net/11321/57. CLARIN-PL digital repository
Fabien, M., Villatoro-Tello, E., Motlicek, P., Parida, S.: BertAA: BERT fine-tuning for authorship attribution. In: Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 127–137. NLP Association of India (NLPAI), Indian Institute of Technology Patna, Patna, India (2020). https://aclanthology.org/2020.icon-main.16
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), pp. 3483–3487 (2018)
Google Scholar
Grivas, A., Krithara, A., Giannakopoulos, G.: Author profiling using stylometric and structural feature groupings. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, 8–11 September 2015. CEUR Workshop Proceedings, vol. 1391. CEUR-WS.org (2015). http://ceur-ws.org/Vol-1391/68-CR.pdf
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Article Google Scholar
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Book MATH Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006). https://doi.org/10.1561/1500000005
Article Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Piasecki, M., Walkowiak, T., Eder, M.: Open stylometric system WebSty: Integrated language processing, analysis and visualisation. Comput. Methods Sci. Technol. 24(1), 43–58 (2018)
Article Google Scholar
Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.): Narodowy Korpus Jezyka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN (2012). http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf
Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16
Chapter Google Scholar
Rybak, P., Mroczkowski, R., Tracz, J., Gawlik, I.: KLEJ: comprehensive benchmark for polish language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1191–1201. Association for Computational Linguistics (2020). https://www.aclweb.org/anthology/2020.acl-main.111
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Walkowiak, T.: Subject classification of texts in polish - from TF-IDF to transformers. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2021. AISC, vol. 1389, pp. 457–465. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76773-0_44
Chapter Google Scholar
Walkowiak, T., Piasecki, M.: Stylometry analysis of literary texts in polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2018. LNCS (LNAI), vol. 10842, pp. 777–787. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91262-2_68
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information and Communication Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
Tomasz Walkowiak

Authors

Tomasz Walkowiak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Systems Research Institute of the Polish Academy of Sciences, Warsaw, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Walkowiak, T. (2023). Author Attribution of Literary Texts in Polish by the Sequence Averaging. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2022. Lecture Notes in Computer Science(), vol 13589. Springer, Cham. https://doi.org/10.1007/978-3-031-23480-4_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-23480-4_31
Published: 24 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23479-8
Online ISBN: 978-3-031-23480-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Author Attribution of Literary Texts in Polish by the Sequence Averaging

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Stylometry and Mathematical Study of Authorship

Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

Authorship Attribution using Filtered N-grams as Features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Author Attribution of Literary Texts in Polish by the Sequence Averaging

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Stylometry and Mathematical Study of Authorship

Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

Authorship Attribution using Filtered N-grams as Features

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation