Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3511095.3536358acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
extended-abstract

Revealing the Demographic Attributes of the Authors from the Abstracts of Scientific Articles

Published: 28 June 2022 Publication History
  • Get Citation Alerts
  • Abstract

    This study presents multiple strategies to automatically reveal undisclosed demographic attributes of the authors in the double-blind submissions. From a limited amount of textual content of around 100-200 words excerpted from an abstract, this study aims to reveal the following pieces of information, i) the English language nativeness of the primary author, ii) the country of origin of the primary author, and iii) the gender of the primary author. We introduce an annotated dataset of over 5600 articles labeled with the native language, country of origin, and gender information of the primary authors. We employ classical machine learning (CML) algorithms with statistical n-gram features and transformer-based fine-tuned language models to determine various demographic attributes. We observe that transformer-based models yield slightly better performances for all three tasks. The transformer-based models achieve macro F1 scores close to 75% for identifying the English language nativeness of the primary authors. To determine the country of the non-native English authors, the fine-tuned transformer-based models obtain F1 scores of around 60% (10-class classification). For the gender prediction task, we attain F1 scores of 0.65 by the transformer-based models. The experimental results demonstrate that the fine-tuned language models and CML classifiers are capable of disclosing various author attributes with an acceptable level of accuracy that can undermine the blindness of the double-blind submission.

    Supplementary Material

    MP4 File (demography_scientific.mp4)
    Presentation video (late-breaking results)
    MP4 File (demography_scientific.mp4)
    Presentation video (late-breaking results)

    References

    [1]
    Douglas Bagnall. 2015. Author identification using multi-headed recurrent neural networks. arXiv preprint arXiv:1506.04891(2015).
    [2]
    David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18, 2 (2014), 135–160.
    [3]
    Surajit Bhattacharya 2010. Authorship issue explained. Indian J Plast Surg 43, 2 (2010), 233–4.
    [4]
    Cornelia Caragea, Ana Uban, and Liviu P Dinu. 2019. The myth of double-blind review revisited: ACL vs. EMNLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2317–2327.
    [5]
    Stephen J Ceci and Douglas Peters. 1984. How blind is blind review?American Psychologist 39, 12 (1984), 1491.
    [6]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
    [7]
    Gili Goldin, Ella Rabinovich, and Shuly Wintner. 2018. Native language identification with user generated content. In Proceedings of the 2018 conference on empirical methods in natural language processing. 3591–3601.
    [8]
    Shawndra Hill and Foster Provost. 2003. The myth of the double-blind review? Author identification using only citations. Acm Sigkdd Explorations Newsletter 5, 2 (2003), 179–184.
    [9]
    Graeme Hirst and Ol’ga Feiguina. 2007. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22, 4 (2007), 405–417.
    [10]
    Julian Hitschler, Esther Van Den Berg, and Ines Rehbein. 2017. Authorship attribution with convolutional neural networks and POS-eliding. In Proceedings of the Workshop on Stylistic Variation. 53–58.
    [11]
    Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically categorizing written texts by author gender. Literary and linguistic computing 17, 4 (2002), 401–412.
    [12]
    Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Determining an author’s native language by mining a text for errors. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 624–628.
    [13]
    Carole J Lee, Cassidy R Sugimoto, Guo Zhang, and Blaise Cronin. 2013. Bias in peer review. Journal of the American Society for Information Science and Technology 64, 1 (2013), 2–17.
    [14]
    Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. http://jmlr.org/papers/v18/16-365.html
    [15]
    Wen Li and Markus Dickinson. 2017. Gender prediction for Chinese social media data. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017. 438–445.
    [16]
    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).
    [17]
    Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. arXiv preprint arXiv:1107.4557(2011).
    [18]
    Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. 2013. The ACL anthology network corpus. Language Resources and Evaluation 47, 4 (2013), 919–944.
    [19]
    Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: tracing stylometric evidence beyond topic and genre. In Proceedings of the fifteenth conference on computational natural language learning. 78–86.
    [20]
    Salim Sazzed. 2021. A Hybrid Approach of Opinion Mining and Comparative Linguistic Analysis of Restaurant Reviews. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 1281–1288.
    [21]
    Salim Sazzed. 2022. Influence of Language Proficiency on the Readability of Review Text and Transformer-based Models for Determining Language Proficiency. (2022).
    [22]
    Andrew Tomkins, Min Zhang, and William D Heavlin. 2017. Reviewer bias in single-versus double-blind peer review. Proceedings of the National Academy of Sciences 114, 48(2017), 12708–12713.
    [23]
    Teja Tscharntke, Michael E Hochberg, Tatyana A Rand, Vincent H Resh, and Jochen Krauss. 2007. Author sequence and credit for contributions in multiauthored publications. PLoS biology 5, 1 (2007), e18.
    [24]
    Vered Volansky, Noam Ordan, and Shuly Wintner. 2015. On the features of translationese. Digital Scholarship in the Humanities 30, 1 (2015), 98–118.
    [25]
    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
    [26]
    Chuhan Wu, Fangzhao Wu, Tao Qi, Junxin Liu, Yongfeng Huang, and Xing Xie. 2019. Neural gender prediction in microblogging with emotion-aware user representation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2401–2404.

    Cited By

    View all
    • (2022)Stylometric and Semantic Analysis of Demographically Diverse Non-Native English Review DataProceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM55673.2022.10068612(470-475)Online publication date: 10-Nov-2022

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HT '22: Proceedings of the 33rd ACM Conference on Hypertext and Social Media
    June 2022
    272 pages
    ISBN:9781450392334
    DOI:10.1145/3511095
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 June 2022

    Check for updates

    Author Tags

    1. article abstract
    2. authorship attribution
    3. demography prediction
    4. gender prediction
    5. native language identification
    6. peer review
    7. scientific publication

    Qualifiers

    • Extended-abstract
    • Research
    • Refereed limited

    Conference

    HT '22
    Sponsor:
    HT '22: 33rd ACM Conference on Hypertext and Social Media
    June 28 - July 1, 2022
    Barcelona, Spain

    Acceptance Rates

    Overall Acceptance Rate 378 of 1,158 submissions, 33%

    Upcoming Conference

    HT '24
    35th ACM Conference on Hypertext and Social Media
    September 10 - 13, 2024
    Poznan , Poland

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Stylometric and Semantic Analysis of Demographically Diverse Non-Native English Review DataProceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM55673.2022.10068612(470-475)Online publication date: 10-Nov-2022

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media