Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Text Analysis in Adversarial Settings: Does Deception Leave a Stylistic Trace?

Published: 18 June 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Textual deception constitutes a major problem for online security. Many studies have argued that deceptiveness leaves traces in writing style, which could be detected using text classification techniques. By conducting an extensive literature review of existing empirical work, we demonstrate that while certain linguistic features have been indicative of deception in certain corpora, they fail to generalize across divergent semantic domains. We suggest that deceptiveness as such leaves no content-invariant stylistic trace, and textual similarity measures provide a superior means of classifying texts as potentially deceptive. Additionally, we discuss forms of deception beyond semantic content, focusing on hiding author identity by writing style obfuscation. Surveying the literature on both author identification and obfuscation techniques, we conclude that current style transformation methods fail to achieve reliable obfuscation while simultaneously ensuring semantic faithfulness to the original text. We propose that future work in style transformation should pay particular attention to disallowing semantically drastic changes.

    References

    [1]
    Ahmed Abbasi and Hsinchun Chen. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information and System Security 26, 2 (2008), 1--29.
    [2]
    Sadia Afroz, Michael Brennan, and Rachel Greenstadt. 2012. Detecting hoaxes, frauds, and deception in writing style online. In Proceedings of the 2012 IEEE Symposium on Security and Privacy. 461--475.
    [3]
    Sadia Afroz, Aylin Caliskan-Islam, Ariel Stolerman, Rachel Greenstadt, and Damon McCoy. 2014. Doppelgänger finder: Taking stylometry to the underground. In Proceedings of the 2014 IEEE Symposium on Security and Privacy. 212--226.
    [4]
    Mishari Almishari, Ekin Oguz, and Gene Tsudik. 2014. Fighting authorship linkability with crowdsourcing. In Proceedings of the 2nd ACM Conference on Online Social Networks. 69--82.
    [5]
    Shlomo Argamon, Moshe Koppel, James W. Pennebaker, and Jonathan Schler. 2009. Automatically profiling the author of an anonymous text. Communcations of the ACM 52, 2 (2009), 119--123.
    [6]
    Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion. 759--760.
    [7]
    Douglas Bagnall. 2015. Author identification using multi-headed recurrent neural networks. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum.
    [8]
    Jorge Baptista, Sandra Lourenco, and Nuno Mamede. 2016. Automatic generation of exercises on passive transformation in Portuguese. In IEEE Congress on Evolutionary Computation (CEC). 4965--4972.
    [9]
    Daniel Bennett. 2011. A ‘gay girl in Damascus’, the Mirage of the ‘authentic voice’ - and the future of journalism. In Mirage in the Desert? Reporting the Arab Spring, Richard Lance Keeble and John Mair (Eds.). Abramis, Bury St. Edmunds, 187--195.
    [10]
    Douglas Biber, Stig Johansson, Geoffrey Leech, Susan Conrad, Edward Finegan, and Randolph Quirk. 1999. Longman Grammar of Spoken and Written English, Vol. 2. Pearson Education, Harlow.
    [11]
    Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions.
    [12]
    Bernard Bloch. 1948. A set of postulates for phonemic analysis. Language 24, 1 (1948), 3--46.
    [13]
    Charles F. Bond. and Bella M. DePaulo. 2011. Accuracy of deception judgments. Personality and Social Psychology Review 10, 3 (2011), 214--234.
    [14]
    Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2011. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security 15, 3 (2011).
    [15]
    Michael Brennan and Rachel Greenstadt. 2009. Practical attacks against authorship recognition techniques. In Proceedings of the 21st Conference on Innovative Applications of Artificial Intelligence. 60--65.
    [16]
    Marcelo Luiz Brocardo, Issa Traore, Isaac Woungang, and Mohammad S. Obaidat. 2017. Authorship verification using deep belief network systems. Communication Systems 30, 12 (2017).
    [17]
    David B. Buller and Judee K. Burgoon. 1996. Interpersonal deception theory. Communication Theory 6, 3 (1996), 203--242.
    [18]
    David B. Buller, Judee K. Burgoon, Aileen Buslig, and James Roiger. 1996. Testing interpersonal deception theory: The language of interpersonal deception. Communication Theory 6, 3 (1996), 268--289.
    [19]
    Judee K. Burgoon, J. P. Blair, Tiantian Qin, and Jay F. Nunamaker, Jr. 2003. Detecting deception through linguistic analysis. In Proceedings of the 1st NSF/NIJ Conference on Intelligence and Security Informatics. Springer-Verlag, Berlin, 91--101.
    [20]
    Pete Burnap and Matthew L. Williams. 2015. Cyber hate speech on Twitter: An application of machine classification and statistical modeling for policy and decision making. Policy 8 Internet 7, 2 (2015), 223--242.
    [21]
    John F. Burrows. 1987. Word patterns and story shapes: The statistical analysis of narrative style. Literary and Linguistic Computing 2 (1987), 61--70.
    [22]
    Aylin Caliskan and Rachel Greenstadt. 2012. Translate once, translate twice, translate thrice and attribute: Identifying authors and machine translation tools in translated text. In IEEE Sixth International Conference on Semantic Computing (ICSC). 121--125.
    [23]
    Erik Cambria, Praphul Chandra, Avinash Sharma, and Amir Hussain. 2010. Do not feel the trolls. In Proceedings of the 3rd International Workshop on Social Data on the Web (SDoW’10).
    [24]
    Erik Cambria and Amir Hussain. 2015. Sentic Computing: A Common-Sense-Based Framework for Concept-Level Sentiment Analysis. Springer International Publishing, Cham.
    [25]
    Antonio Castro and Brian Lindauer. 2013. Author identification on Twitter. In 3rd IEEE International Conference on Data Mining. 705--708.
    [26]
    Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm: A library for support vector machines. Transactions on Intelligent Systems and Technology 3, 2 (2011), 27.
    [27]
    Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. 2012. Detecting offensive language in social media to protect adolescent online safety. In Proceedings of the 2012 International Conference on Privacy, Security, Risk and Trust and of the 2012 International Conference on Social Computing (PAS-SAT/SocialCom’12). Amsterdam, 71--80.
    [28]
    Cindy K. Chung and James W. Pennebaker. 2007. The psychological functions of function words. In Frontiers of Social Psychology: Social Communication, Klaus Fiedler (Ed.). Psychology Press, New York, 343--359.
    [29]
    Jonathan H. Clark and Charles J. Hannon. 2007. A classifier system for author recognition using synonym-based features. In Lecture Notes in Computer Science, Vol. 4827, Alexander Gelbukh and Ángel Fernando Kuri Morales (Eds.). Springer, 839--849.
    [30]
    Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 670--680.
    [31]
    Malcolm Coulthard. 2004. Author identification, idiolect and linguistic uniqueness. Applied Linguistics 25, 4 (2004), 431--447.
    [32]
    Erin Smith Crabb. 2014. “Time for some traffic problems”: Enhancing e-discovery and big data processing tools with linguistic methods for deception detection. Journal of Digital Forensics, Security and Law 9, 2 (2014).
    [33]
    Walter Daelemans. 2013. Explanation in computational stylometry. In Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’13). 451--462.
    [34]
    Robert Dale and Ehud Reiter. 2000. Building Natural Language Generation Systems. Cambridge University Press, Cambridge.
    [35]
    Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the 11th Conference on Web and Social Media. 512--515.
    [36]
    Siobahn Day, James Brown, Zachery Thomas, India Gregory, Lowell Bass, and Gerry Dozier. 2016. Adversarial authorship, AuthorWebs, and entropy-based evolutionary clustering. In 25th International Conference on Computer Communication and Networks (ICCCN’16). 1--6.
    [37]
    Olivier de Vel, Alison Anderson, Malcolm Corney, and George Mohay. 2001. Mining e-mail content for author identification forensics. ACM Sigmod Record 30, 4 (2001), 55--64.
    [38]
    Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. 376--380.
    [39]
    Bella M. DePaulo, James J. Lindsay, Brian E. Malone, Laura Muhlenbruck, Kelly Charlston, and Harris Cooper. 2003. Cues to deception. Psychological Bulletin 129, 1 (2003), 74--118.
    [40]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018).
    [41]
    Karthik Dinakar, Birago Jones, Catherine Havasi, Henry Lieberman, and Rosalind Picard. 2012. Common sense reasoning for detection, prevention, and mitigation of cyberbullying. ACM Transactions on Interactive Intelligence Systems 2, 3 (2012), 18:1--18:30.
    [42]
    Susan T. Dumais. 2004. Latent semantic analysis. Annual Review of Information Science and Technology 38, 1 (2004), 188--230.
    [43]
    Maciej Eder, Jan Rybicki, and Mike Kestemont. 2016. Stylometry with R: A package for computational text analysis. The R Journal 8, 1 (2016), 107--121.
    [44]
    Paul Ekman. 1985. Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage. Norton, New York.
    [45]
    Paul Ekman and Wallace V. Friesen. 1969. Nonverbal leakage and clues to deception. Psychiatry 32, 1 (1969), 88.
    [46]
    Paul. Ekman and Maureen O’Sullivan. 1991. Who can catch a liar? American Psychologist 46, 9 (1991), 913--920.
    [47]
    Frank Enos, Elizabeth Shriberg, Martin Graciarena, Julia Hirschberg, and Andreas Stolcke. 2007. Detecting deception using critical segments. In Proceedings of Interspeech. 1621--1624.
    [48]
    Iqbal Farkhund, Hamad Binsalleeh, Benjamin C. M. Fung, and Mourad Debbabi. 2013. Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation 7, 1--2 (2013), 56--64.
    [49]
    Christiane Fellbaum (Ed.). 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge.
    [50]
    Song Feng, Ritwik Banerjee, and Yejin Choi. 2012. Syntactic stylometry for deception detection. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 171--175.
    [51]
    Eileen Fitzpatrick and Joan Bachenko. 2009. Building a forensic corpus to test language-based indicators of deception. Language and Computers 71, 1 (2009), 183--196.
    [52]
    Eileen Fitzpatrick, Joan Bachenko, and Tommaso Fornaciari. 2015. Automatic Detection of Verbal Deception. Morgan 8 Claypool.
    [53]
    Douwe Fokkema and Elrud Ibsch. 1987. Modernist Conjectures. A Mainstream in European Literature. Hurst, London.
    [54]
    Patxi Galán-García, José Gaviria de la Puerta, Carlos Laorden Gómez, Igor Santos, and Pablo García Bringas. 2014. Supervised machine learning for the detection of troll profiles in Twitter social network: Application to a real case of cyberbullying. In International Joint Conference of Advances in Intelligent Systems and Computing. 419--428.
    [55]
    Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’13). 758--764.
    [56]
    Zhenhao Ge and Yufang Sun. 2016. Domain specific author attribution based on feedforward neural network language models. In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods (ICPRAM’16). 597--604.
    [57]
    Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura Damien, and Jun Long. 2015. A lexicon-based approach for hate speech detection. International Journal of Multimedia and Ubiquitous Engineering 10, 4 (2015), 215--230.
    [58]
    Yoav Goldberg. 2016. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research 57, 1 (2016), 345--420.
    [59]
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems 27 (NIPS). 2672--2680.
    [60]
    Paul Grice. 1989. Studies in the Way of Words. Harvard University Press, Cambridge/London.
    [61]
    Jack Grieve. 2007. Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22, 3 (2007).
    [62]
    Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro Conti, and N. Asokan. 2018. All you need is “love”: Evading hate speech detection. In Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security (AISec’11). 2--12.
    [63]
    Jeffrey T. Hancock, Lauren E. Curry, Saurabh Goorhaand, and Michael Woodworth. 2008. On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Discourse Processes 45, 1 (2008), 1--23.
    [64]
    David M. Markowitz and Jeffrey T. Hancock. 2014. Linguistic traces of a scientific fraud: The case of Diederik Stapel. PLoS One 9, 8 (2014).
    [65]
    Graeme Hirst and Olga Feiguina. 2007. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22, 4 (2007), 405--417.
    [66]
    Dirk Hovy. 2016. The enemy in your own camp: How well can we detect statistically-generated fake reviews -- An adversarial study. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 351--356.
    [67]
    Jeremy Howard and Sebastian Ruder. 2018. Fine-tuned language models for text classification. CoRR abs/1801.06146 (2018). arxiv:1801.06146 http://arxiv.org/abs/1801.06146.
    [68]
    Nan Hu, Indranil Bose, Noi Sian Koh, and Ling Liu. 2012. Manipulation of online reviews: An analysis of ratings, readability, and sentiments. Decision Support Systems 52, 3 (2012), 674--684.
    [69]
    Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward controlled generation of text. In Proceedings of the International Conference on Machine Learning (ICML’17). 1587--1596.
    [70]
    Timothy Jay and Kirstin Janschewitz. 2008. The pragmatics of swearing. Journal of Politeness Research. Language, Behaviour, Culture 4, 2 (2008), 267--288.
    [71]
    Patrick Juola. 2004. Ad-hoc authorship attribution competition. In Proceedings of the 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ALLC/ACH’04).
    [72]
    Patrick Juola. 2009. JGAAP: A system for comparative evaluation of authorship attribution. Journal of Digital Humanities and Computer Science 1, 1 (2009).
    [73]
    Patrick Juola. 2012. Large-scale experiments in authorship attribution. English Studies 93, 3 (2012), 275--283.
    [74]
    Patrick Juola. 2013. Stylometry and immigration: A case study. Journal of Law and Policy 21, 2 (2013), 287--298.
    [75]
    Patrick Juola, John Sofko, and Patrick Brennan. 2006. A prototype for authorship attribution studies. Literary and Linguistic Computing 2, 21 (2006), 169--178.
    [76]
    Patrick Juola and Darren Vescovi. 2010. Empirical evaluation of authorship obfuscation using JGAAP. In Proceedings of the 3rd ACM Workshop on Artificial Intelligence and Security (AISec’10). 14--18.
    [77]
    Mika Juuti, Bo Sun, Tatsuya Mori, and N. Asokan. 2018. Stay on-topic: Generating context-specific fake restaurant reviews. In Proceedings of the 23rd European Symposium on Research in Computer Security (ESORICS’18). 132--151.
    [78]
    Gary Kacmarcik and Michael Gamon. 2006. Obfuscating document stylometry to preserve author anonymity. In Proceedings of COLING/ACL: Poster Sessions. 444--451.
    [79]
    Georgi Karadzhov, Tsvetomila Mihaylova, Yasen Kiprov, Georgi Georgiev, Ivan Koychev, and Preslav Nakov. 2017. The case for being average: A mediocrity approach to style masking and author obfuscation. In International Conference of the Cross-Language Evaluation Forum for European Languages. 173--185.
    [80]
    Parambir S. Keila and David B. Skillicorn. 2005. Detecting unusual and deceptive communication in email. In Centers for Advanced Studies Conference. 17--20.
    [81]
    Yashwant Keswani, Harsh Trivedi, Parth Mehta, and Prasenjit Majumde. 2016. Author masking through translation - Notebook for PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop -- Working Notes Papers, Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald (Eds.).
    [82]
    Foaad Khosmood and Robert Levinson. 2008. Automatic natural language style classification and transformation. In Proceedings of the 2008 BCS-IRSG Conference on Corpus Profiling. 3.
    [83]
    Foaad Khosmood and Robert Levinson. 2009. Toward automated stylistic transformation of natural language text. In Proceedings of the Digital Humanities. 177--181.
    [84]
    Foaad Khosmood and Robert Levinson. 2010. Automatic synonym and phrase replacement show promise for style transformation. In Proceedings of the 9th International Conference on Machine Learning and Applications. 958--961.
    [85]
    Bryan Klimt and Yiming Yang. 2004. The Enron corpus: A new dataset for email classification research. In Machine Learning: 15th European Conference on Machine Learning (ECML’04). 217--226.
    [86]
    Mark Knapp and Mark Comaden. 1979. Telling it like it isn’t: A review of theory and research on deceptive communications. Human Communication Research 5, 3 (1979), 270--285.
    [87]
    Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit. 79--86.
    [88]
    Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, and Richard Zens. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. 177--180.
    [89]
    Moshe Koppel and Jonathan Schler. 2004. Authorship verification as a one-class classification problem. In Proceedings of the T21st International Conference on Machine Learning. 489--495.
    [90]
    Olga V. Kukushkina, Anatoly A. Polikarpov, and Dimitry V. Khmelev. 2001. Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission 37, 2 (2001), 172--184.
    [91]
    William Labov. 1984. Field methods of the project in linguistic change and variation. In Language in Use: Readings in Sociolinguistics, John Baugh and Joel Sherzer (Eds.). Prentice Hall, Englewood Cliffs, 28--66.
    [92]
    J. Clayton Lafferty and Patrick M. Eady. 1974. The Desert Survival Problem. Experimental Learning Methods, Plymouth, Michigan.
    [93]
    David F. Larcker and Anastasia A. Zakolyukina. 2012. Detecting deceptive discussions in conference calls. Journal of Accounting Research 50, 2 (2012), 495--540.
    [94]
    Raymond Y. K. Lau, S. Y. Liao, Ron Chi-Wai Kwok, Kaiquan Xu, Yunqing Xia, and Yuefeng Li. 2011. Text mining and probabilistic language modeling for online review spam detecting. ACM Transactions on Management Information Systems 4, 2 (2011), 1--30.
    [95]
    Chih-Chen Lee, Robert B. Welker, and Marcus D. Odom. 2009. Features of computer-mediated, text-based messages that support automatable, linguistics-based indicators for deception detection. Journal of Information Systems 23, 1 (2009), 5--24.
    [96]
    Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine code from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation. 24--26.
    [97]
    Jiwei Li, Myle Ott, and Claire Cardie. 2013. Identifying manipulated offerings on review portals. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 18--21.
    [98]
    Jiwei Li, Myle Ott, Claire Cardie, and Eduard Hovy. 2014. Towards a general rule for identifying deceptive opinion spam. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14). 1566--1576.
    [99]
    Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. 2018. Paraphrase generation with deep reinforcement learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 3865--3878.
    [100]
    Max Louwerse, K. Lin, A. Drescher, and Gün Semin. 2010. Linguistic cues predict fraudulent events in a corporate social network. In Proceedings of the 32nd Annual Conference of the Cognitive Science Society. 961--966.
    [101]
    Max M. Louwerse. 2004. Semantic variation in idiolect and sociolect: Corpus linguistic evidence from literary texts. Computers and the Humanities 38, 2 (2004), 207--221.
    [102]
    Daniel Lowd and Christopher Meek. 2005. Good word attacks on statistical spam filters. In Proceedings of the 2nd Conference on Email and Anti-Spam (CEAS’05).
    [103]
    Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP’15). 1412--1421.
    [104]
    Wicenty Lutoslawski. 1898. Principes de stylometrie. E. Leroux.
    [105]
    Nathan Mack, Jasmine Bowers, Henry Williams, Gerry Dozier, and Joseph Shelton. 2015. The best way to a strong defense is a strong offense: Mitigating deanonymization attacks via iterative language translation. International Journal of Machine Learning and Computing 5, 5 (2015), 409--413.
    [106]
    Nitin Madnani and Bonnie Dorr. 2010. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Journal of Computational Linguistics 36, 3 (2010), 341--387.
    [107]
    Muharram Mansoorizadeh, Taher Rahgooy, Mohammad Aminiyan, and Mahdy Eskandari. 2016. Author obfuscation using WordNet and language models -- Notebook for PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop -- Working Notes Papers, Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald (Eds.).
    [108]
    Yuval Marton, Ning Wu, and Lisa Hellerstein. 2005. On compression-based text classification. In Advances in Information Retrieval. 300--314.
    [109]
    Andrew W. E. McDonald, Sadia Afroz, Aylin Caliskan, Ariel Stolerman, and Rachel Greenstadt. 2012. Use fewer instances of the letter i: Toward writing style anonymization. In Privacy Enhancing Technologies (PETS). 299--318.
    [110]
    Gerald R. McMenamin and Dongdoo Choi. 2002. Forensic Linguistics: Advances in Forensic Stylistics. CRC Press, London.
    [111]
    Yashar Mehdad and Joel Tetreault. 2016. Do characters abuse more than words? In 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 299--303.
    [112]
    Thomas Corwin Mendenhall. 1887. The characteristic curves of composition. Science IX (1887), 237--49.
    [113]
    Rada Mihalcea and Carlo Strapparava. 2009. The lie detector: Explorations in the automatic recognition of deceptive language. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. 309--312.
    [114]
    Todor Mihaylov, Georgi D. Georgiev, and Preslav Nakov. 2015. Finding opinion manipulation trolls in news community forums. In Proceedings of the 19th Conference on Computational Language Learning. 310--314.
    [115]
    Todor Mihaylov and Preslav Nakov. 2016. Hunting for troll comments in news community forums. In The 54th Annual Meeting of the Association for Computational Linguistics (ACL). 399--405.
    [116]
    Tsvetomila Mihaylova, Georgi Karadjov, Preslav Nakov, Yasen Kiprov, Georgi Georgiev, and Ivan Koychev. 2016. SU@PAN’2016: Author obfuscation -- Notebook for PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop -- Working Notes Papers, Krisztian Balog, Linda Cappellato, Nicola Ferro, and Craig Macdonald (Eds.).
    [117]
    George A. Miller. 1995. WordNet: A lexical database for english. Communications of the ACM 38, 11 (1995), 39--41.
    [118]
    Frederick Mosteller and David L. Wallace. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley.
    [119]
    Arjun Mukherjee, Vivek Venkataraman, Bing Liu, and Nathan S. Glance. 2013. What Yelp fake review filter might be doing? In Proceedings of the 7th International Conference on Weblogs and Social Media (ICWSM’13). 409--418.
    [120]
    Shashi Narayan and Claire Gardent. 2014. Hybrid simplification using deep semantics and machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14). 435--445.
    [121]
    Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. 2012. On the feasibility of internet-scale author identification. In Proceedings of the 2012 IEEE Symposium on Security and Privacy. 300--314.
    [122]
    Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano, and Masayuki Takeda. 2007. Unsupervised spam detection based on string alienness measures. Discovery Science (2007), 161--172.
    [123]
    Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. 2017. Surveying stylometry techniques and applications. ACM Computing Surveys 50, 6 (2017), 86:1--86:36.
    [124]
    Matthew L. Newman, James W. Pennebaker, Diane S. Berry, and Jane M. Richards. 2003. Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin 29, 5 (2003), 665--675.
    [125]
    Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. 145--153.
    [126]
    Ray Oshikawa, Jing Qian, and William Yang Wang. 2018. A survey on natural language processing for fake news detection. CoRR abs/1811.00770 (2018).
    [127]
    Ricardo Otheguy, Ofelia García, and Wallis Reid. 2015. Clarifying translanguaging and deconstructing named languages: A perspective from linguistics. Applied Linguistics Review 6, 3 (2015), 281--307.
    [128]
    Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). 309--319.
    [129]
    Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2013. Negative deceptive opinion spam. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’13). 497--501.
    [130]
    Rebekah Overdorf and Rachel Greenstadt. 2016. Blogs, Twitter feeds, and reddit comments: Cross-domain authorship attribution. In Proceedings on Privacy Enhancing Technologies. 155--171.
    [131]
    James W. Pennebaker, Roger J. Booth, and Martha E. Francis. 2007. Linguistic Inquiry and Word Count (LIWC’07). Technical Report. LIWC.net, Austin, Texas.
    [132]
    Verónica Pérez-Rosas, Bennett Kleinberg, Alexandra Lefevre, and Rada Mihalcea. 2018. Automatic detection of fake news. In Proceedings of the 27th International Conference on Computational Linguistics. 3391--3401.
    [133]
    Juan-Pablo Posadas-Duran, Grigori Sidorov, and Ildar Batyrshin. 2014. Complete syntactic N-grams as style markers for authorship attribution. In Human-Inspired Computing and Its Applications, Alexander Gelbukh, Félix Castro Espinoza, and Sofía N. Galicia-Haro (Eds.). Springer International Publishing, Cham, 9--17.
    [134]
    Martin Potthast, Sarah Braun, Tolga Buz, Fabian Duffhauss, Florian Friedrich, Jörg Marvin Gülzow, Jakob Köhler, Winfried Lötzsch, Fabian Müller, Maike Elisa Müller, Robert Paßmann, Bernhard Reinke, Lucas Rettenmeier, Thomas Rometsch, Timo Sommer, Michael Träger, Sebastian Wilhelm, Benno Stein, Efstathios Stamatatos, and Matthias Hagen. 2016. Who wrote the web? Revisiting influential author identification research applicable to information retrieval. In Advances in Information Retrieval, Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello (Eds.). Springer International Publishing, 393--407.
    [135]
    Martin Potthast, Matthias Hagen, and Benno Stein. 2016. Author obfuscation: Attacking the state of the art in authorship verification. In CLEF 2016 Working Notes.
    [136]
    Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W. Black. 2018. Style transfer through back-translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers). 866--876.
    [137]
    Ella Rabinovich, Shachar Mirkin, Raj Nath Patel, Lucia Specia, and Shuly Wintner. 2016. Personalized machine translation: Preserving original author traits. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 1074--1084.
    [138]
    Roshan Ragel, Pramod Herath, and Upul Senanayake. 2013. Authorship detection of SMS messages using unigrams. In Proceedings of the 8th IEEE International Conference on Industrial and Information Systems. 387--392.
    [139]
    Sindhu Raghavan, Adriana Kovashka, and Raymond Mooney. 2010. Authorship attribution using probabilistic context-free grammars. In Proceedings of the ACL 2010 Conference Short Papers (ACLShort’10). 38--42.
    [140]
    Ganesh Ramakrishnan, B. Prithviraj, and Pushpak Bhattacharyya. 2004. A gloss centered algorithm for word sense disambiguation. In Proceedings of the ACL SENSEVAL. 217--221.
    [141]
    Congzhou He Ramyaa and Khaled Rasheed. 2004. Using machine learning techniques for stylometry. In Proceedings of the International Conference on Artificial Intelligence (IC-AI’04), Vol. 2. 897--903.
    [142]
    Josyula R. Rao and Pankaj Rohatgi. 2000. Can pseudonymity really guarantee privacy? In 9th USENIX Security Symposium, Steven M. Bellovin and Gregory G. (Eds.).
    [143]
    Sudha Rao and Joel R. Tetreault. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 129--140.
    [144]
    Shebuti Rayana and Leman Akoglu. 2015. Collective opinion spam detection: Bridging review networks and metadata. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15), Steven M. Bellovin and Gregory G. (Eds.).
    [145]
    Paul Rayson, Andrew Wilson, and Geoffrey Leech. 2001. Grammatical word class variation within the British national corpus sampler. Language and Computers 36, 1 (2001), 295--306.
    [146]
    Tapani Rinta-Kahila and Wael Soliman. 2017. Understanding crowdturfing: The different ethicallogics behind the clandestine industry of deception. In Proceedings of the 25th European Conference on Information Systems (ECIS’17). 1934--1949.
    [147]
    Victoria L. Rubin. 2017. Deception detection and rumor debunking for social media. In The SAGE Handbook of Social Media Research Methods, Luke Sloan and Anabel Quan-Haase (Eds.). SAGE, London.
    [148]
    Joseph Rudman. 1998. The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities 31, 4 (1998), 351--365.
    [149]
    Joseph Rudman. 2010. The state of non-traditional authorship studies - 2010: Some problems and solutions. In Proceedings of the Digital Humanities. 217--219.
    [150]
    Edward Sapir. 1927. Speech as a personality trait. American Journal of Sociology 32, 6 (1927), 892--905.
    [151]
    Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. 2003. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 76--85.
    [152]
    Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the 5th International Workshop on Natural Language Processing for Social Media. 1--10.
    [153]
    Chun Wei Seah, Hai Leong Chieu, Kian Ming A. Chai, Loo-Nin Teow, and Lee Wei Yeong. 2015. Troll detection by domain-adapting sentiment analysis. In Proceedings of the 18th International Conference on Information Fusion. 792--799.
    [154]
    Gün R. Semin and Klaus Fiedler. 1991. The linguistic category model, its bases, applications and range. European Review of Social Psychology 2, 1 (1991), 1--30.
    [155]
    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Controlling politeness in neural machine translation via side constraints. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 35--40.
    [156]
    Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Proceedings of Neural Information Processing Systems (NIPS’17).
    [157]
    Rakshith Shetty, Bernt Schiele, and Mario Fritz. 2018. A<sup>4</sup>NT: Author attribute anonymity by adversarial training of neural machine translation. In Proceedings of the 27th USENIX Security Symposium (USENIX Security’18). 1633--1650.
    [158]
    Advaith Siddharthan. 2010. Complex lexico-syntactic reformulation of sentences using typed dependency representations. In Proceedings of the 6th International Natural Language Generation Conference. 125--133.
    [159]
    Advaith Siddharthan. 2011. Text simplification using typed dependencies: A comparision of the robustness of different generation strategies. In Proceedings of the 13th European Workshop on Natural Language Generation. 2--11.
    [160]
    Edgar A. Smith and R. J. Senter. 1967. Automated Readability Index. Technical Report AMRL-TR-66-22. Aerospace Medical Division, Wright-Paterson AFB, Ohio.
    [161]
    Thamar Solorio, Ragib Hasan, and Mainul Mizan. 2013. A case study of sockpuppet detection in Wikipedia. In Proceedings of the Workshop on Language in Social Media. 59--68.
    [162]
    Dan Sperber and Deirdre Wilson. 1995. Relevance: Communication and Cognition (2nd ed.). Blackwell Publishers, Oxford/Cambridge.
    [163]
    Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60, 3 (2009), 538--556.
    [164]
    K. Surendran, O. P. Harilal, Hrudya Poroli, Prabaharan Poornachandran, and N. K. Suchetha. 2017. Stylometry detection using deep learning. In Computational Intelligence in Data Mining. 749--757.
    [165]
    Yla R. Tausczik and James W. Pennebaker. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29, 1 (2010), 24--54.
    [166]
    Catalina L. Toma and Jeffrey T. Hancock. 2012. What lies beneath: The linguistic traces of deception in online dating profiles. Journal of Communication 62, 1 (2012), 78--97.
    [167]
    Takashi Uemura, Daisuke Ikeda, Takuya Kida, and Hiroki Arimura. 2011. Unsupervised spam detection by document probability estimation with maximal overlap method. Information and Media Technologies 6, 1 (2011), 231--240.
    [168]
    Hans van Halteren, R. Harald Baayen, Fiona Tweedie, Marco Haverkort, and Anneke Neijt. 2005. New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics 12, 1 (2005), 65--77.
    [169]
    Gand Wang, Christo Wilson, Xiaohan Zhao, Yibo Zhu, Manish Mohanlal, Haitao Zheng, and Ben Y. Zhao. 2012. Serf and turf: Crowdturfing for fun and profit. In Proceedings of the 21st International Conference on World Wide Web (WWW’12). 679--688.
    [170]
    Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop. 88--93.
    [171]
    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Åukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144.
    [172]
    Zhibiao Wu and Martha Palmer. 1994. Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics (ACL’94). 133--138.
    [173]
    Sander Wubben, Antal van den Bosch, and Emiel Krahmer. 2012. Sentence simplification by monolingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL’12). 1015--1024.
    [174]
    Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex Machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web. 1391--1399.
    [175]
    Jun-Ming Xu, Xiaojin Zhu, and Amy Bellmore. 2012. Fast learning for sentiment analysis on bullying. In Proceedings of the International Workshop on Issues of Sentiment Discovery and Opinion Mining (WISDOM’12). 1--6.
    [176]
    Wei Xu, Alan Ritter, Bill Dolan, Ralph Grishman, and Colin Cherry. 2012. Paraphrasing for style. In Proceedings of COLING. 2899--2914.
    [177]
    Yinqing Xu, Bei Shi, Wentao Tian, and Wai Lam. 2015. A unified model for unsupervised opinion spamming detection incorporating text generality. In Proceedings of the 24th International Conference on Artificial Intelligence. 725--731.
    [178]
    Yuanshun Yao, Bimal Viswanath, Jenna Cryan, Haitao Zheng, and Ben Y. Zhao. 2017. Automated crowdturfing attacks and defenses in online review systems. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17). 1143--1158.
    [179]
    Kyung-Hyan Yoo and Ulrike Gretzel. 2009. Comparison of deceptive and truthful travel reviews. In Information and Communication Technologies in Tourism 2009: Proceedings of the International Conference. Springer Verlag, Vienna, 37--47.
    [180]
    Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2017. Recent trends in deep learning based natural language processing. CoRR abs/1708.02709 (2017).
    [181]
    Ziqi Zhang, David Robinson, and Jonathan Tepper. 2018. Detecting hate speech on Twitter using a convolution-GRU based deep neural network. In Proceedings of ESWC. 745--760.
    [182]
    Ying Zhao and Justin Zobel. 2005. Effective and scalable authorship attribution using function words. In Information Retrieval Technology. 174--189.
    [183]
    Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. 2006. A framework of authorship identification for online messages: Writing style features and classification techniques. Journal American Society for Information Science and Technology 57, 3 (2006), 378--393.
    [184]
    Lina Zhou, Judee K. Burgoon, Jay F. Nunamaker Jr, and Doung P. Twitchell. 2004. Automating linguistics-based cues for detecting deception in text-based asynchronous computer-mediated communication. Group Decision and Negotiation 13 (2004), 81--106.
    [185]
    Lina Zhou, Judee K. Burgoon, Doug P. Twitchell, Tiantian Qin, and Jay F. Nunamaker Jr.2004. A comparison of classification methods for predicting deception in computer-mediated communication. Journal of Management Information Systems 20 (2004), 139--163.
    [186]
    Lina Zhou, Judee K. Burgoon, and Douglas P. Twitchell. 2010. A longitudinal analysis of language behavior of deception in e-mail. In Intelligence and Security Informatics, Hsinchun Chen, Richard Miranda, Daniel D. Zeng, Chris Demchak, Jenny Schroeder, and Therani Madhusudan (Eds.). Springer Verlag, Berlin, 102--110.
    [187]
    Lina Zhou, Douglas P. Twitchell, Tiantian Qin, Judee K. Burgoon, and Jay F. Nunamaker Jr.2003. An exploratory study into deception detection in text-based computer mediated communication. In Proceedings of the 36th Hawaii International Conference on Systems Science.
    [188]
    Yan Zhou, Zach Jorgensen, and W. Meador Inge. 2007. Combating good word attacks on statistical spam filters with multiple instance learning. In 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’07). 298--305.

    Cited By

    View all
    • (2024)LLMs for Explainable Few-shot Deception DetectionProceedings of the 10th ACM International Workshop on Security and Privacy Analytics10.1145/3643651.3659898(37-47)Online publication date: 21-Jun-2024
    • (2024)Unveiling Deception in Arabic: Optimization of Deceptive Text Detection Across Formal and Informal GenresIEEE Access10.1109/ACCESS.2024.342453112(94216-94230)Online publication date: 2024
    • (2024)Classifying deceptive reviews for the cultural heritage domain: A lexicon-based approach for the Italian languageExpert Systems with Applications10.1016/j.eswa.2024.124131252(124131)Online publication date: Oct-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 52, Issue 3
    May 2020
    734 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3341324
    • Editor:
    • Sartaj Sahni
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2019
    Accepted: 01 January 2019
    Revised: 01 January 2019
    Received: 01 September 2017
    Published in CSUR Volume 52, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Stylometry
    2. author identification
    3. deanonymization
    4. deception
    5. text obfuscation

    Qualifiers

    • Survey
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)68
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)LLMs for Explainable Few-shot Deception DetectionProceedings of the 10th ACM International Workshop on Security and Privacy Analytics10.1145/3643651.3659898(37-47)Online publication date: 21-Jun-2024
    • (2024)Unveiling Deception in Arabic: Optimization of Deceptive Text Detection Across Formal and Informal GenresIEEE Access10.1109/ACCESS.2024.342453112(94216-94230)Online publication date: 2024
    • (2024)Classifying deceptive reviews for the cultural heritage domain: A lexicon-based approach for the Italian languageExpert Systems with Applications10.1016/j.eswa.2024.124131252(124131)Online publication date: Oct-2024
    • (2024)Reframing and Broadening Adversarial Stylometry for Academic IntegritySecond Handbook of Academic Integrity10.1007/978-3-031-54144-5_148(1467-1485)Online publication date: 18-Feb-2024
    • (2023)Implementation of a Multi-Approach Fake News Detector and of a Trust Management Model for News SourcesIEEE Transactions on Services Computing10.1109/TSC.2023.331162916:6(4288-4301)Online publication date: Nov-2023
    • (2023)Development of Classification Model based on Arabic Textual Analysis to Detect Fake News: Case Studies2023 1st International Conference on Advanced Innovations in Smart Cities (ICAISC)10.1109/ICAISC56366.2023.10085350(1-6)Online publication date: 23-Jan-2023
    • (2023)Use of Compression Analytics to Detect Deception2023 10th International Conference on Behavioural and Social Computing (BESC)10.1109/BESC59560.2023.10386884(1-7)Online publication date: 30-Oct-2023
    • (2023)Comparative network analysis as a new approach to the editorship profiling task: A case study of the Mishnah and Tosefta from Rabbinic literatureDigital Scholarship in the Humanities10.1093/llc/fqad03838:4(1720-1739)Online publication date: 13-Jun-2023
    • (2023)How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processingArtificial Intelligence Review10.1007/s10462-022-10204-656:2(1427-1492)Online publication date: 1-Feb-2023
    • (2023)Reframing and Broadening Adversarial Stylometry for Academic IntegrityHandbook of Academic Integrity10.1007/978-981-287-079-7_148-1(1-19)Online publication date: 2-Aug-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media