Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Recognizing Preferred Grammatical Gender in Russian Anonymous Online Confessions

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2020)

Abstract

We present annotation results for a dataset of public anonymous online confessions in Russian (“Overheard/Podslushano” group in VKontakte, posts tagged #family). Unlike many other cases with online social network data, intentionally anonymous posts do not contain any explicit metadata such as age or gender. We consider the problem of predicting the author’s preferred grammatical gender for self-reference, a problem that proved to be surprisingly hard and not reducible to simple morphological analysis. We describe an expert labeling of a dataset for this problem, show the findings of predictive analysis, and introduce rule-based and machine learning approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://vk.com/overhear, with more than 3,987,000 users reading the community as of February 20, 2020.

  2. 2.

    We have made several attempts to tackle the annotation task via crowdsourcing platforms, updating the instructions and adding more advanced qualification tests. However, most annotators still derived the gender based on stereotypes, so we had to ask our own experts to label the data, which explains the modest size of the corpus.

  3. 3.

    We have used tokens and bigrams available in the training set with the minimum document frequency of 3; scikit-learn’s [11] default TF-IDF weighting scheme was employed.

  4. 4.

    As of June 11, 2020, the model is available at: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131.

  5. 5.

    Originally in Russian, translated into English.

References

  1. ELI5: a Python package to debug machine learning classifiers and explain their predictions (2016). https://github.com/TeamHG-Memex/eli5/

  2. Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019)

    Google Scholar 

  3. Alekseev, A., Nikolenko, S.: Word embeddings for user profiling in online social networks. Computación y Sistemas 21(2), 203–226 (2017)

    Article  Google Scholar 

  4. Anastasyev, D., Gusev, I., Indenbom, E.: Improving part-of-speech tagging via multi-task learning and character-level word representations. arXiv preprint arXiv:1807.00818 (2018)

  5. Christopherson, K.M.: The positive and negative implications of anonymity in internet social interactions: “on the internet, nobody knows you’re a dog”. Comput. Hum. Behav. 23(6), 3038–3056 (2007)

    Article  Google Scholar 

  6. Google Scholar 

  7. Kang, R., Brown, S., Kiesler, S.: Why do people seek anonymity on the internet? informing policy and design. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2657–2666 (2013)

    Google Scholar 

  8. Ke, G., et al.: Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, pp. 3146–3154 (2017)

    Google Scholar 

  9. Kestemont, M., et al.: Overview of the author identification task at pan-2018: cross-domain authorship attribution and style change detection. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. Avignon, France, 10–14 September, 2018/Cappellato, Linda [edit.] et al, pp. 1–25 (2018)

    Google Scholar 

  10. Meng, Q., et al.: A communication-efficient parallel algorithm for decision tree. In: Advances in Neural Information Processing Systems, pp. 1279–1287 (2016)

    Google Scholar 

  11. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  12. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF, pp. 1613–0073 (2017)

    Google Scholar 

  13. Rogers, A., Romanov, A., Rumshisky, A., Volkova, S., Gronas, M., Gribov, A.: Rusentiment: an enriched sentiment analysis dataset for social media in Russian. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 755–763 (2018)

    Google Scholar 

  14. Straka, M., Straková, J.: Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with udpipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Association for Computational Linguistics, Vancouver, Canada, August 2017

    Google Scholar 

  15. Straka, M., Straková, J.: Universal dependencies 2.5 models for UDPipe (2019–12-06) (2019), http://hdl.handle.net/11234/1-3131, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

  16. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNET: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, pp. 5754–5764 (2019)

    Google Scholar 

Download references

Acknowledgement

This work was carried out at the Samsung-PDMI Joint AI Center at Steklov Mathematical Institute at St. Petersburg and supported by Samsung Research. We would like to thank anonymous reviewers for insightful comments that helped us to improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anton Alekseev .

Editor information

Editors and Affiliations

Appendix: instructions for annotation

Appendix: instructions for annotation

Which grammatical gender do authors use when talking about themselves?Footnote 5

Short description: We ask you to carefully read the short text and report in what grammatical gender the authors refer to themselves, based on grammatical evidence/clues.

It is usually clear which grammatical gender (feminine/masculine/etc.) the users prefer when speaking about themselves in their posts. However, sometimes it may be impossible. Not all cases are obvious, please do read the instructions.

figure s

IMPORTANT: only grammatical features and clues can be used to determine the gender. The task is not to guess whether a man or a woman wrote the text. The task is to determine with confidence which grammatical gender they prefer when talking about themselves.

Sample cases with possible errors.

figure t
figure u
figure v
figure w
figure x

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alekseev, A., Nikolenko, S. (2020). Recognizing Preferred Grammatical Gender in Russian Anonymous Online Confessions. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58323-1_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58322-4

  • Online ISBN: 978-3-030-58323-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics