research-article

A Self-Supervised Representation Learning of Sentence Structure for Authorship Attribution

Authors:

Fereshteh Jafariakinabad,

Kien A. HuaAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 4

Article No.: 68, Pages 1 - 16

https://doi.org/10.1145/3491203

Published: 08 January 2022 Publication History

Abstract

The syntactic structure of sentences in a document substantially informs about its authorial writing style. Sentence representation learning has been widely explored in recent years and it has been shown that it improves the generalization of different downstream tasks across many domains. Even though utilizing probing methods in several studies suggests that these learned contextual representations implicitly encode some amount of syntax, explicit syntactic information further improves the performance of deep neural models in the domain of authorship attribution. These observations have motivated us to investigate the explicit representation learning of syntactic structure of sentences. In this article, we propose a self-supervised framework for learning structural representations of sentences. The self-supervised network contains two components; a lexical sub-network and a syntactic sub-network which take the sequence of words and their corresponding structural labels as the input, respectively. Due to the n-to-1 mapping of words to their structural labels, each word will be embedded into a vector representation which mainly carries structural information. We evaluate the learned structural representations of sentences using different probing tasks, and subsequently utilize them in the authorship attribution task. Our experimental results indicate that the structural embeddings significantly improve the classification tasks when concatenated with the existing pre-trained word embeddings.

References

[1]

Sadia Afroz, Michael Brennan, and Rachel Greenstadt. 2012. Detecting hoaxes, frauds, and deception in writing style online. In Proceedings of the 2012 IEEE Symposium on Security and Privacy. IEEE, 461–475.

Digital Library

[2]

Shlomo Argamon-Engelson, Moshe Koppel, and Galit Avneri. 1998. Style-based text categorization: What newspaper am I reading. In Proceedings of the AAAI Workshop on Text Categorization. 1–4.

[3]

Douglas Bagnall. 2016. Authorship clustering using multi-headed recurrent neural networks. arXiv:1608.04485 . Retrieved from https://arxiv.org/abs/1608.04485

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Retrieved from https://arxiv.org/abs/1409.0473

[5]

Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili Mou, Olga Vechtomova, Xinyu Dai, and Jiajun Chen. 2019. Generating sentences from disentangled syntactic and semantic spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6008–6019.

[6]

Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018. Deep RNNs encode soft hierarchical syntax. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 14–19.

[7]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.

[8]

Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security 15, 3 (2012), 12.

Digital Library

[9]

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder. arXiv:1803.11175 . Retrieved from https://arxiv.org/abs/1803.11175

[10]

Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Vol. 1, IEEE, 539–546.

Digital Library

[11]

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 670–680.

[12]

Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a single \(\backslash \&\!\#\*\) vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Vol. 1, Association for Computational Linguistics, 2126–2136.

[13]

Walter Daelemans. 2013. Explanation in computational stylometry. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 451–462.

Digital Library

[14]

Zhenhao Ge, Yufang Sun, and Mark J. T. Smith. 2016. Authorship attribution using a neural network language model. In Proceedings of the AAAI Conference on Artificial Intelligence. 4212–4213.

Digital Library

[15]

Serhii Havrylov, Germán Kruszewski, and Armand Joulin. 2019. Cooperative learning of disjoint syntax and semantics. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1118–1128.

[16]

Maryam Heidari and James H. Jones. 2020. Using bert to extract topic-independent sentiment features for social media bot detection. In Proceedings of the 2020 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference. IEEE, 0542–0547.

[17]

M. Heidari, J. H. Jones, and O. Uzuner. 2020. Deep contextualized word embedding for text-based online user profiling to detect social bots on Twitter. In Proceedings of the 2020 International Conference on Data Mining Workshops. 480–487. DOI:DOI:

[18]

John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4129–4138.

[19]

Julian Hitschler, Esther van den Berg, and Ines Rehbein. 2017. Authorship attribution with convolutional neural networks and POS-eliding. In Proceedings of the Workshop on Stylistic Variation. 53–58.

[20]

Fereshteh Jafariakinabad and Kien A. Hua. 2016. Maximal sequence mining approach for topic detection from microblog streams. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence. IEEE, 1–8.

[21]

Fereshteh Jafariakinabad and Kien A. Hua. 2019. Style-aware neural model with application in authorship attribution. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications. IEEE, 325–328.

[22]

Fereshteh Jafariakinabad, Sansiri Tarnpradab, and Kien A. Hua. 2020. Syntactic neural model for authorship attribution. In Proceedings of the 33rd International Flairs Conference.

[23]

Jad Kabbara and Jackie Chi Kit Cheung. 2016. Stylistic transfer in natural language generation systems using recurrent neural networks. In Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods. 43–47.

[24]

Vishal Kaushal and Manasi Patwardhan. 2018. Emerging trends in personality identification using online social networks-a literature survey. ACM Transactions on Knowledge Discovery from Data 12, 2, Article 15 (Jan. 2018), 30 pages. DOI:DOI:

[25]

Moniba Keymanesh, Saket Gurukar, Bethany Boettner, Christopher Browning, Catherine Calder, and Srinivasan Parthasarathy. 2020. Twitter watch: Leveraging social media to monitor and predict collective-efficacy of neighborhoods. In Complex Networks XI. H. Barbosa, J. Gomez-Gardenes, B. Gonves, G. Mangioni, R. Menezes, and M. Oliveira (Eds.). Springer, 197–211.

[26]

Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing Systems. 3294–3302.

Digital Library

[27]

Markus Krause. 2014. A behavioral biometrics based authentication method for MOOC’s that is robust against imitation attempts. In Proceedings of the 1st ACM Conference on Learning@ Scale Conference. ACM, 201–202.

Digital Library

[28]

Tim Kreutz and Walter Daelemans. 2018. Exploring classifier combinations for language variety identification. In Proceedings of the 5th Workshop on NLP for Similar Languages, Varieties and Dialects. 191–198.

[29]

Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom. 2018. LSTMs can learn syntax-sensitive dependencies well, but modeling structure makes them better. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1426–1436.

[30]

Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, Min Zhang, and Guodong Zhou. 2017. Modeling source syntax for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 688–697.

[31]

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv:1703.03130. Retrieved from https://arxiv.org/abs/1703.03130.

[32]

Rui Liu, Junjie Hu, Wei Wei, Zi Yang, and Eric Nyberg. 2017. Structural embedding of syntactic trees for machine comprehension. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 815–824.

[33]

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 55–60.

[34]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems. 3111–3119.

Digital Library

[35]

Matthew L. Newman, James W. Pennebaker, Diane S. Berry, and Jane M. Richards. 2003. Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin 29, 5 (2003), 665–675.

[36]

Toktam A. Oghaz, Ece Çiğdem Mutlu, Jasser Jasser, Niloofar Yousefi, and Ivan Garibay. 2020. Probabilistic model of narratives over topical trends in social media: A discrete time model. In Proceedings of the 31st ACM Conference on Hypertext and Social Media. ACM, 281–290. DOI: https://doi.org/10.1145/3372923.3404790

Digital Library

[37]

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1525–1534.

[38]

James W. Pennebaker and Laura A. King. 1999. Linguistic styles: Language use as an individual difference. Journal of Personality and Social Psychology 77, 6 (1999), 1296.

[39]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532–1543.

[40]

Christian S. Perone, Roberto Silveira, and Thomas S. Paula. 2018. Evaluation of sentence embeddings in downstream and linguistic probing tasks. arXiv:1806.06259. Retrieved from https://arxiv.org/abs/1806.06259.

[41]

Juan-Pablo Posadas-Durán, Ilia Markov, Helena Gómez-Adorno, Grigori Sidorov, Ildar Batyrshin, Alexander Gelbukh, and Obdulia Pichardo-Lagunas. 2015. Syntactic n-grams as features for the author profiling task. In Proceedings of the Working Notes Papers of the CLEF.

[42]

Sindhu Raghavan, Adriana Kovashka, and Raymond Mooney. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers. Association for Computational Linguistics, 38–42.

Digital Library

[43]

Yunita Sari, Andreas Vlachos, and Mark Stevenson. 2017. Continuous n-gram representations for authorship attribution. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 267–273.

[44]

Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker. 2006. Effects of age and gender on blogging. In Proceedings of the AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. Vol. 6, 199–205.

[45]

Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Yejin Choi, and Noah A. Smith. 2017. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. arXiv:1702.01841. Retrieved from https://arxiv.org/abs/1702.01841.

[46]

Yikang Shen, Zhouhan Lin, Chin-Wei Huang, and Aaron Courville. 2017. Neural language modeling by jointly learning syntax and lexicon. arXiv:1711.02013. Retrieved from https://arxiv.org/abs/1711.02013.

[47]

Prasha Shrestha, Sebastian Sierra, Fabio Gonzalez, Manuel Montes, Paolo Rosso, and Thamar Solorio. 2017. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 669–674.

[48]

Juan Soler and Leo Wanner. 2017. On the relevance of syntactic and discourse features for author profiling and identification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Vol. 2, 681–687.

[49]

Kaiqiang Song, Lin Zhao, and Fei Liu. 2018. Structure-infused copy mechanisms for abstractive summarization. In Proceedings of the 27th International Conference on Computational Linguistics. 1717–1729.

[50]

Efstathios Stamatatos. 2008. Author identification: Using text sampling to handle the class imbalance problem. Information Processing & Management 44, 2 (2008), 790–799.

Digital Library

[51]

Kalaivani Sundararajan and Damon Woodard. 2018. What represents “style” in authorship attribution? In Proceedings of the 27th International Conference on Computational Linguistics. 2814–2822.

[52]

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? Probing for sentence structure in contextualized word representations. arXiv:1905.06316. Retrieved from https://arxiv.org/abs/1905.06316.

[53]

Chris van der Lee and Antal van den Bosch. 2017. Exploring lexical and syntactic features for language variety identification. In Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects. 190–199.

[54]

William Yang Wang. 2017. “Liar, Liar Pants on Fire”: A new benchmark dataset for fake news detection. arXiv:1705.00648. Retrieved from https://arxiv.org/abs/1705.00648.

[55]

Reza Zafarani, Lei Tang, and Huan Liu. 2015. User identification across social media. ACM Transactions on Knowledge Discovery from Data 10, 2, Article 16 (Oct. 2015), 30 pages. DOI: https://doi.org/10.1145/2747880

Digital Library

[56]

Meishan Zhang, Zhenghua Li, Guohong Fu, and Min Zhang. 2019. Syntax-enhanced neural machine translation with syntax-aware word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1151–1161.

[57]

Richong Zhang, Zhiyuan Hu, Hongyu Guo, and Yongyi Mao. 2018. Syntax encoding with application in authorship attribution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2742–2753.

Cited By

Misini ACanhasi EKadriu AFetahi E(2024)Automatic authorship attribution in Albanian textsPLOS ONE10.1371/journal.pone.031005719:10(e0310057)Online publication date: 22-Oct-2024
https://doi.org/10.1371/journal.pone.0310057
Wu YPan XLi JDou SDong JWei D(2024)Knowledge Graph-Based Hierarchical Text Semantic RepresentationInternational Journal of Intelligent Systems10.1155/2024/55832702024Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1155/2024/5583270
Huertas-Tato JMartín ACamacho D(2024)Understanding writing style in social media with a supervised contrastively pre-trained transformerKnowledge-Based Systems10.1016/j.knosys.2024.111867296:COnline publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111867

Index Terms

A Self-Supervised Representation Learning of Sentence Structure for Authorship Attribution
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics

Recommendations

Co-training on authorship attribution with very fewlabeled examples: methods vs. views
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Authorship attribution (AA) aims to identify the authors of a set of documents. Traditional studies in this area often assume that there are a large set of labeled documents available for training. However, in the real life, it is hard or expensive to ...
Unsupervised Cross-Lingual Sentence Representation Learning via Linguistic Isomorphism
Knowledge Science, Engineering and Management
Abstract
Recently, many researches on learning cross-lingual word embeddings without parallel data have achieved success by utilizing word isomorphism among languages. However, unsupervised cross-lingual sentence representation, which aims to learn a ...
Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 4

August 2022

529 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3505210

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 January 2022

Accepted: 01 October 2021

Revised: 01 June 2021

Received: 01 October 2020

Published in TKDD Volume 16, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Crystal Photonics Inc (CPI)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
584
Total Downloads

Downloads (Last 12 months)94
Downloads (Last 6 weeks)14

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Misini ACanhasi EKadriu AFetahi E(2024)Automatic authorship attribution in Albanian textsPLOS ONE10.1371/journal.pone.031005719:10(e0310057)Online publication date: 22-Oct-2024
https://doi.org/10.1371/journal.pone.0310057
Wu YPan XLi JDou SDong JWei D(2024)Knowledge Graph-Based Hierarchical Text Semantic RepresentationInternational Journal of Intelligent Systems10.1155/2024/55832702024Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1155/2024/5583270
Huertas-Tato JMartín ACamacho D(2024)Understanding writing style in social media with a supervised contrastively pre-trained transformerKnowledge-Based Systems10.1016/j.knosys.2024.111867296:COnline publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111867

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents