extended-abstract

Revealing the Demographic Attributes of the Authors from the Abstracts of Scientific Articles

Author:

Salim SazzedAuthors Info & Claims

HT '22: Proceedings of the 33rd ACM Conference on Hypertext and Social Media

Pages 209 - 213

https://doi.org/10.1145/3511095.3536358

Published: 28 June 2022 Publication History

Abstract

This study presents multiple strategies to automatically reveal undisclosed demographic attributes of the authors in the double-blind submissions. From a limited amount of textual content of around 100-200 words excerpted from an abstract, this study aims to reveal the following pieces of information, i) the English language nativeness of the primary author, ii) the country of origin of the primary author, and iii) the gender of the primary author. We introduce an annotated dataset of over 5600 articles labeled with the native language, country of origin, and gender information of the primary authors. We employ classical machine learning (CML) algorithms with statistical n-gram features and transformer-based fine-tuned language models to determine various demographic attributes. We observe that transformer-based models yield slightly better performances for all three tasks. The transformer-based models achieve macro F1 scores close to 75% for identifying the English language nativeness of the primary authors. To determine the country of the non-native English authors, the fine-tuned transformer-based models obtain F1 scores of around 60% (10-class classification). For the gender prediction task, we attain F1 scores of 0.65 by the transformer-based models. The experimental results demonstrate that the fine-tuned language models and CML classifiers are capable of disclosing various author attributes with an acceptable level of accuracy that can undermine the blindness of the double-blind submission.

Supplementary Material

MP4 File (demography_scientific.mp4)

Presentation video (late-breaking results)

Download
20.21 MB

MP4 File (demography_scientific.mp4)

Presentation video (late-breaking results)

Download
20.21 MB

References

[1]

Douglas Bagnall. 2015. Author identification using multi-headed recurrent neural networks. arXiv preprint arXiv:1506.04891(2015).

[2]

David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18, 2 (2014), 135–160.

[3]

Surajit Bhattacharya 2010. Authorship issue explained. Indian J Plast Surg 43, 2 (2010), 233–4.

[4]

Cornelia Caragea, Ana Uban, and Liviu P Dinu. 2019. The myth of double-blind review revisited: ACL vs. EMNLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2317–2327.

[5]

Stephen J Ceci and Douglas Peters. 1984. How blind is blind review?American Psychologist 39, 12 (1984), 1491.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).

[7]

Gili Goldin, Ella Rabinovich, and Shuly Wintner. 2018. Native language identification with user generated content. In Proceedings of the 2018 conference on empirical methods in natural language processing. 3591–3601.

[8]

Shawndra Hill and Foster Provost. 2003. The myth of the double-blind review? Author identification using only citations. Acm Sigkdd Explorations Newsletter 5, 2 (2003), 179–184.

Digital Library

[9]

Graeme Hirst and Ol’ga Feiguina. 2007. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22, 4 (2007), 405–417.

[10]

Julian Hitschler, Esther Van Den Berg, and Ines Rehbein. 2017. Authorship attribution with convolutional neural networks and POS-eliding. In Proceedings of the Workshop on Stylistic Variation. 53–58.

[11]

Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically categorizing written texts by author gender. Literary and linguistic computing 17, 4 (2002), 401–412.

[12]

Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s native language by mining a text for errors. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 624–628.

Digital Library

[13]

Carole J Lee, Cassidy R Sugimoto, Guo Zhang, and Blaise Cronin. 2013. Bias in peer review. Journal of the American Society for Information Science and Technology 64, 1 (2013), 2–17.

Digital Library

[14]

Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. http://jmlr.org/papers/v18/16-365.html

Digital Library

[15]

Wen Li and Markus Dickinson. 2017. Gender prediction for Chinese social media data. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017. 438–445.

[16]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).

[17]

Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. arXiv preprint arXiv:1107.4557(2011).

[18]

Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. 2013. The ACL anthology network corpus. Language Resources and Evaluation 47, 4 (2013), 919–944.

Digital Library

[19]

Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: tracing stylometric evidence beyond topic and genre. In Proceedings of the fifteenth conference on computational natural language learning. 78–86.

[20]

Salim Sazzed. 2021. A Hybrid Approach of Opinion Mining and Comparative Linguistic Analysis of Restaurant Reviews. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 1281–1288.

[21]

Salim Sazzed. 2022. Influence of Language Proficiency on the Readability of Review Text and Transformer-based Models for Determining Language Proficiency. (2022).

[22]

Andrew Tomkins, Min Zhang, and William D Heavlin. 2017. Reviewer bias in single-versus double-blind peer review. Proceedings of the National Academy of Sciences 114, 48(2017), 12708–12713.

[23]

Teja Tscharntke, Michael E Hochberg, Tatyana A Rand, Vincent H Resh, and Jochen Krauss. 2007. Author sequence and credit for contributions in multiauthored publications. PLoS biology 5, 1 (2007), e18.

[24]

Vered Volansky, Noam Ordan, and Shuly Wintner. 2015. On the features of translationese. Digital Scholarship in the Humanities 30, 1 (2015), 98–118.

[25]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6

[26]

Chuhan Wu, Fangzhao Wu, Tao Qi, Junxin Liu, Yongfeng Huang, and Xing Xie. 2019. Neural gender prediction in microblogging with emotion-aware user representation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2401–2404.

Digital Library

Cited By

Sazzed SAlhajj RAgarwal NMa ZRokne JAn JCharalampos CMagdy W(2022)Stylometric and Semantic Analysis of Demographically Diverse Non-Native English Review DataProceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM55673.2022.10068612(470-475)Online publication date: 10-Nov-2022
https://dl.acm.org/doi/10.1109/ASONAM55673.2022.10068612

Recommendations

Stylometric and Semantic Analysis of Demographically Diverse Non-Native English Review Data
ASONAM '22: Proceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

The demographic knowledge facilitates a finegrained interpretation of the user-generated review text and enables better decision-making. In this study, we aim to comprehend how various attributes of non-native English text vary across demographically ...
Native Language Identification on L2 Portuguese
Computational Processing of the Portuguese Language
Abstract
This study advances on Native Language Identification (NLI) for L2 Portuguese. We use texts from the NLI-PT dataset corresponding to five native languages: Chinese, English, German, Italian, and Spanish. We include the same L1s as in previous ...
Portuguese Native Language Identification
Computational Processing of the Portuguese Language
Abstract
This study presents the first Native Language Identification (NLI) study for L2 Portuguese. We used a sub-set of the NLI-PT dataset, containing texts written by speakers of five different native languages: Chinese, English, German, Italian, and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HT '22: Proceedings of the 33rd ACM Conference on Hypertext and Social Media

June 2022

272 pages

ISBN:9781450392334

DOI:10.1145/3511095

General Chairs:
Alejandro Bellogín
Universidad Autonoma de Madrid, Spain
,
Ludovico Boratto
University of Cagliari, Italy
,
Program Chair:
Federica Cena
University of Torino, Italy

Copyright © 2022 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2022

Check for updates

Author Tags

Qualifiers

Extended-abstract
Research
Refereed limited

Conference

HT '22

Sponsor:

HT '22: 33rd ACM Conference on Hypertext and Social Media

June 28 - July 1, 2022

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 378 of 1,158 submissions, 33%

Upcoming Conference

HT '24

Sponsor:
sigweb

35th ACM Conference on Hypertext and Social Media

September 10 - 13, 2024

Poznan , Poland

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
56
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sazzed SAlhajj RAgarwal NMa ZRokne JAn JCharalampos CMagdy W(2022)Stylometric and Semantic Analysis of Demographically Diverse Non-Native English Review DataProceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM55673.2022.10068612(470-475)Online publication date: 10-Nov-2022
https://dl.acm.org/doi/10.1109/ASONAM55673.2022.10068612

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents