Recognizing Preferred Grammatical Gender in Russian Anonymous Online Confessions

Alekseev, Anton; Nikolenko, Sergey

doi:10.1007/978-3-030-58323-1_24

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12284))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1488 Accesses

Abstract

We present annotation results for a dataset of public anonymous online confessions in Russian (“Overheard/Podslushano” group in VKontakte, posts tagged #family). Unlike many other cases with online social network data, intentionally anonymous posts do not contain any explicit metadata such as age or gender. We consider the problem of predicting the author’s preferred grammatical gender for self-reference, a problem that proved to be surprisingly hard and not reducible to simple morphological analysis. We describe an expert labeling of a dataset for this problem, show the findings of predictive analysis, and introduce rule-based and machine learning approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Vive la Petite Différence!

Can Three Pronouns Discriminate Identity in Writing?

Profiling Idioms:

Notes

1.
https://vk.com/overhear, with more than 3,987,000 users reading the community as of February 20, 2020.
2.
We have made several attempts to tackle the annotation task via crowdsourcing platforms, updating the instructions and adding more advanced qualification tests. However, most annotators still derived the gender based on stereotypes, so we had to ask our own experts to label the data, which explains the modest size of the corpus.
3.
We have used tokens and bigrams available in the training set with the minimum document frequency of 3; scikit-learn’s [11] default TF-IDF weighting scheme was employed.
4.
As of June 11, 2020, the model is available at: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131.
5.
Originally in Russian, translated into English.

References

ELI5: a Python package to debug machine learning classifiers and explain their predictions (2016). https://github.com/TeamHG-Memex/eli5/
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019)
Google Scholar
Alekseev, A., Nikolenko, S.: Word embeddings for user profiling in online social networks. Computación y Sistemas 21(2), 203–226 (2017)
Article Google Scholar
Anastasyev, D., Gusev, I., Indenbom, E.: Improving part-of-speech tagging via multi-task learning and character-level word representations. arXiv preprint arXiv:1807.00818 (2018)
Christopherson, K.M.: The positive and negative implications of anonymity in internet social interactions: “on the internet, nobody knows you’re a dog”. Comput. Hum. Behav. 23(6), 3038–3056 (2007)
Article Google Scholar
Google Scholar
Kang, R., Brown, S., Kiesler, S.: Why do people seek anonymity on the internet? informing policy and design. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2657–2666 (2013)
Google Scholar
Ke, G., et al.: Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, pp. 3146–3154 (2017)
Google Scholar
Kestemont, M., et al.: Overview of the author identification task at pan-2018: cross-domain authorship attribution and style change detection. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. Avignon, France, 10–14 September, 2018/Cappellato, Linda [edit.] et al, pp. 1–25 (2018)
Google Scholar
Meng, Q., et al.: A communication-efficient parallel algorithm for decision tree. In: Advances in Neural Information Processing Systems, pp. 1279–1287 (2016)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF, pp. 1613–0073 (2017)
Google Scholar
Rogers, A., Romanov, A., Rumshisky, A., Volkova, S., Gronas, M., Gribov, A.: Rusentiment: an enriched sentiment analysis dataset for social media in Russian. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 755–763 (2018)
Google Scholar
Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with udpipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Association for Computational Linguistics, Vancouver, Canada, August 2017
Google Scholar
Straka, M., Straková, J.: Universal dependencies 2.5 models for UDPipe (2019–12-06) (2019), http://hdl.handle.net/11234/1-3131, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNET: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, pp. 5754–5764 (2019)
Google Scholar

Download references

Acknowledgement

This work was carried out at the Samsung-PDMI Joint AI Center at Steklov Mathematical Institute at St. Petersburg and supported by Samsung Research. We would like to thank anonymous reviewers for insightful comments that helped us to improve the paper.

Author information

Authors and Affiliations

Samsung-PDMI Joint AI Center, Steklov Mathematical Institute at St. Petersburg, St. Petersburg, 191023, Russia
Anton Alekseev & Sergey Nikolenko
Neuromation OU, 10111, Tallinn, Estonia
Sergey Nikolenko

Authors

Anton Alekseev
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Nikolenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anton Alekseev .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák

Appendix: instructions for annotation

Which grammatical gender do authors use when talking about themselves?^{Footnote 5}

Short description: We ask you to carefully read the short text and report in what grammatical gender the authors refer to themselves, based on grammatical evidence/clues.

It is usually clear which grammatical gender (feminine/masculine/etc.) the users prefer when speaking about themselves in their posts. However, sometimes it may be impossible. Not all cases are obvious, please do read the instructions.

IMPORTANT: only grammatical features and clues can be used to determine the gender. The task is not to guess whether a man or a woman wrote the text. The task is to determine with confidence which grammatical gender they prefer when talking about themselves.

Sample cases with possible errors.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alekseev, A., Nikolenko, S. (2020). Recognizing Preferred Grammatical Gender in Russian Anonymous Online Confessions. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-58323-1_24
Published: 01 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58322-4
Online ISBN: 978-3-030-58323-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Recognizing Preferred Grammatical Gender in Russian Anonymous Online Confessions

Abstract

Access this chapter

Subscribe and save

Buy Now